Parts of the merged cells text is getting cut off when merged

Open GradyMellin opened this issue 2 years ago • 1 comments

When I am merging cells that have text that spans multiple cells, both rows and columns, only the text from the first cell it is in is getting transferred. I am assuming I have to do something like the combine headers function but I am having trouble finding out how to access those other cells. I have added a picture of the table similar to the one that is giving me problems as well as my code and results. Any help with this would be greatly appreciated!

textract_json = call_textract(input_document=documentName, features = [Textract_Features.TABLES])

t_doc = TDocumentSchema().load(textract_json)
ordered_doc = order_blocks_by_geo(t_doc)
trp_doc = Document(TDocumentSchema().dump(ordered_doc))

table_index = 1
dataframes = []

def combine_headers(top_h, mid_h, bottom_h):
    try:
        bottom_h[4] = top_h[4] + " " + mid_h[4] + " " + bottom_h[4]
        bottom_h[5] = top_h[4] + " " + mid_h[4] + " " + bottom_h[5]
    except:
        pass

for page in trp_doc.pages:
    for table in page.tables:
        table_data = []
        headers = table.get_header_field_names()
        if(len(headers)>0):                                      
            print("Statememt headers: "+ repr(headers))
            top_header= headers[0]
            middle_header = headers[1]
            bottom_header = headers[2]
            combine_headers(top_header, middle_header, bottom_header)   
            for r, row in enumerate(table.rows_without_header): 
                table_data.append([])
                for c, cell in enumerate(row.cells):
                    table_data[r].append(cell.mergedText)  
            
            if len(table_data)>0:
                df = pd.DataFrame(table_data, columns=bottom_header)
    print(df.to_markdown())

Table: Screenshot` (196)

As you can see below, in the headers, after "Local (Up" gets cut off because it runs into the next cell, the same happens with all of the length class rows they cut off the "pages)" part of that row. It also happens with the extra long books part. Results:

	Length Class	Category Class	Codes	Codes	Distribution Local (Up To Mark Up Factor	Distribution Local (Up To Cost Factor
0	Short Books (0 100	Children's	Non-fiction Fiction	011-- 012--	1.10	1.00
1	Short Books (0 100	Mystery	Non-fiction Fiction	021-- 022--	1.55	1.15
2	Short Books (0 100	Romance	Non-fiction Fiction	031-- 032--	1.40	1.00
3
4	Medium Books (101 500	Children's	Non-fiction Fiction	211-- 212--	1.05	0.95
5	Medium Books (101 500	Mystery	Non-fiction Fiction	221-- 222--	1.50	0.70
6	Medium Books (101 500	Romance	Non-fiction Fiction	231-- 232--	1.40	0.75
7
8	Long Books (501 - 1,000	Children's	Non-fiction Fiction	311-- 312--	1.10	0.65
9	Long Books (501 - 1,000	Mystery	Non-fiction Fiction	321-- 322--	1.55	0.90
10	Long Books (501 - 1,000	Romance	Non-fiction Fiction	331-- 332--	1.25	0.70
11
12	Extra-Long (Over 1,000	Extra-Long (Over 1,000	Non-fiction Fiction	401-- 402--	2.45	1.15
13

Mar 17 '23 19:03 GradyMellin

Hey @GradyMellin I am also facing the same issue.Did you get any workaround for this?

Jan 08 '24 11:01 pranavbhat12