amazon-textract-response-parser icon indicating copy to clipboard operation
amazon-textract-response-parser copied to clipboard

Parts of the merged cells text is getting cut off when merged

Open GradyMellin opened this issue 2 years ago • 1 comments

When I am merging cells that have text that spans multiple cells, both rows and columns, only the text from the first cell it is in is getting transferred. I am assuming I have to do something like the combine headers function but I am having trouble finding out how to access those other cells. I have added a picture of the table similar to the one that is giving me problems as well as my code and results. Any help with this would be greatly appreciated!

textract_json = call_textract(input_document=documentName, features = [Textract_Features.TABLES])

t_doc = TDocumentSchema().load(textract_json)
ordered_doc = order_blocks_by_geo(t_doc)
trp_doc = Document(TDocumentSchema().dump(ordered_doc))

table_index = 1
dataframes = []

def combine_headers(top_h, mid_h, bottom_h):
    try:
        bottom_h[4] = top_h[4] + " " + mid_h[4] + " " + bottom_h[4]
        bottom_h[5] = top_h[4] + " " + mid_h[4] + " " + bottom_h[5]
    except:
        pass

for page in trp_doc.pages:
    for table in page.tables:
        table_data = []
        headers = table.get_header_field_names()
        if(len(headers)>0):                                      
            print("Statememt headers: "+ repr(headers))
            top_header= headers[0]
            middle_header = headers[1]
            bottom_header = headers[2]
            combine_headers(top_header, middle_header, bottom_header)   
            for r, row in enumerate(table.rows_without_header): 
                table_data.append([])
                for c, cell in enumerate(row.cells):
                    table_data[r].append(cell.mergedText)  
            
            if len(table_data)>0:
                df = pd.DataFrame(table_data, columns=bottom_header)
    print(df.to_markdown())

Table: Screenshot` (196)

As you can see below, in the headers, after "Local (Up" gets cut off because it runs into the next cell, the same happens with all of the length class rows they cut off the "pages)" part of that row. It also happens with the extra long books part. Results:

Length Class Category Class Codes Codes Distribution Local (Up To Mark Up Factor Distribution Local (Up To Cost Factor
0 Short Books (0 100 Children's Non-fiction Fiction 011-- 012-- 1.10 1.00
1 Short Books (0 100 Mystery Non-fiction Fiction 021-- 022-- 1.55 1.15
2 Short Books (0 100 Romance Non-fiction Fiction 031-- 032-- 1.40 1.00
3
4 Medium Books (101 500 Children's Non-fiction Fiction 211-- 212-- 1.05 0.95
5 Medium Books (101 500 Mystery Non-fiction Fiction 221-- 222-- 1.50 0.70
6 Medium Books (101 500 Romance Non-fiction Fiction 231-- 232-- 1.40 0.75
7
8 Long Books (501 - 1,000 Children's Non-fiction Fiction 311-- 312-- 1.10 0.65
9 Long Books (501 - 1,000 Mystery Non-fiction Fiction 321-- 322-- 1.55 0.90
10 Long Books (501 - 1,000 Romance Non-fiction Fiction 331-- 332-- 1.25 0.70
11
12 Extra-Long (Over 1,000 Extra-Long (Over 1,000 Non-fiction Fiction 401-- 402-- 2.45 1.15
13

GradyMellin avatar Mar 17 '23 19:03 GradyMellin

Hey @GradyMellin I am also facing the same issue.Did you get any workaround for this?

pranavbhat12 avatar Jan 08 '24 11:01 pranavbhat12