bug/_convert_table_to_text index out of range
Describe the bug A list index out of range occurs in _convert_table_to_text during docx parsing.
To Reproduce I was operating on 1360 docx files from this source: https://www.3gpp.org/ftp/Specs/latest/Rel-17 In the case of doc's, I first converted to docx using the below command:
C:\"Program Files"\LibreOffice\program\soffice.exe --headless --convert-to docx --outdir out in\<filename>
Expected behavior _convert_table_to_text to correctly convert all docx tables
Screenshots
Desktop (please complete the following information):
- OS: Windows 10
- Python version Python 3.10.10
Additional context 21101-h10.docx 21202-h00.docx 21205-h10.docx 21905-h10.docx 22003-h00.docx 22004-h00.docx 22011-h60.docx 22022-h00.docx 22030-h00.docx 22031-h00.docx 22032-h00.docx 22041-h00.docx 22042-h00.docx 22057-h00.docx 22071-h00.docx 22072-h00.docx 22081-h00.docx 22084-h00.docx 22087-h00.docx 22090-h00.docx 22094-h00.docx 22096-h00.docx 22097-h00.docx 22101-h50.docx 22104-h70.docx 22115-h10.docx 22119-h00.docx 22125-h60.docx 22135-h00.docx 22142-h00.docx 22146-h00.docx 22182-h00.docx 22185-h00.docx 22186-h00.docx 22220-h00.docx 22226-h00.docx 22242-h00.docx 22246-h00.docx 22259-h00.docx 22261-hb0.docx 22263-h40.docx 22268-h00.docx 22278-h20.docx 22279-h00.docx 22282-h00.docx 22346-h00.docx 22368-h00.docx 22468-h01.docx 22519-h00.docx 22826-h20.docx 22829-h10.docx 22832-h40.docx 22836-h10.docx 22866-h10.docx 22873-020.docx 22881-020_cl.docx 22889-h40.docx 22912-h00.docx 22936-h00.docx 22944-h00.docx 22948-h00.docx 22967-h00.docx 22973-h00.docx 22978-h00.docx 22979-h00.docx 22986-h00.docx 22987-h00.docx 23035-h00.docx 23041-h40.docx 23172-h00.docx 23222-h70.docx 23281-h60.docx 23303-h00.docx 23379-h90.docx 23402-h00.docx 23554-h20.docx 23558-h70.docx 23744-h10.docx 23755-h00.docx 23758-h00.docx 23783-1a0_sAnnex_A.docx 23783-1a0_sAnnex_D.docx 23783-1a0_sAnnex_E.docx 24002-h00.docx 24022-h00.docx 24166-h00.docx 24250-h00.docx 24322-h00.docx 24323-h10.docx 24333-h00.docx 24341-h10.docx 24371-h10.docx 24391-h00.docx 24483-h70.docx 25102-h00.docx 25113-h00.docx 25116-h00.docx 25153-h00.docx 25171-h00.docx 25172-h00.docx 25173-h00.docx 25213-h00.docx 25214-h00.docx 25221-h00.docx 25222-h00.docx 25224-h00.docx 25304-h10.docx 25305-h00.docx 25306-h10.docx 25321-h00.docx 25322-h00.docx 25323-h00.docx 25327-h00.docx 25401-h00.docx 25411-h00.docx 25412-h00.docx 25413-h00.docx 25420-h00.docx 25421-h00.docx 25422-h00.docx 25423-h00.docx 25424-h00.docx 25430-h00.docx 25431-h00.docx 25435-h00.docx 25442-h00.docx 25444-h00.docx 25446-h00.docx 25450-h00.docx 25453-h00.docx 25461-h00.docx 25470-h00.docx 25912-h00.docx 25914-h00.docx 25943-h00.docx 25951-h00.docx 25963-h00.docx 25967-h00.docx 25968-h00.docx 25993-h00.docx 26074-h01.docx 26090-h00.docx 26091-h00.docx 26101-h00.docx 26102-h00.docx 26103-h00.docx 26110-h00.docx 26117-h00.docx 26131-h30.docx 26132-h20.docx 26140-h00.docx 26141-h00.docx 26142-h00.docx 26150-h00.docx 26173-h11.docx 26177-h00.docx 26179-h00.docx 26193-h00.docx 26201-h00.docx 26204-h10.docx 26231-h00.docx 26234-h00.docx 26243-h00.docx 26245-h00.docx 26247-h30.docx 26267-h00.docx 26268-h00.docx 26273-h00.docx 26347-h20.docx 26403-h00.docx 26404-h00.docx 26410-h01.docx 26411-h00.docx 26412-h00.docx 26430-h00.docx 26445-h00_1_s05_s0501.docx 26445-h00_2_s0502_s050203.docx 26445-h00_4_s050206.docx 26445-h00_5_s0503.docx 26445-h00_6_s0504_s0506.docx 26445-h00_9_s0602_s0607.docx 26445-h00_a_s0608_sHistory.docx 26446-h00.docx 26447-h10.docx 26448-h00.docx 26450-h00.docx 26452-h00.docx 26511-h10.docx 26903-h00.docx 26904-h00.docx 26907-h00.docx 26911-h00.docx 26918-h00.docx 26923-h00.docx 26925-h10.docx 26937-h00.docx 26938-h00.docx 26943-h00.docx 26944-h00.docx 26946-h00.docx 26947-h00.docx 26949-h00.docx 26952-h00.docx 26957-h00.docx 26959-h00.docx 26967-h00.docx 26980-h00.docx 27002-h00.docx 27003-h00.docx 27010-h00.docx 28302-h00.docx 28303-h00.docx 28305-h00.docx 28308-h00.docx 28310-h50.docx 28311-h00.docx 28402-h00.docx 28403-h00.docx 28404-h40.docx 28405-h40.docx 28510-h00.docx 28511-h00.docx 28513-h00.docx 28520-h00.docx 28521-h00.docx 28525-h00.docx 28526-h00.docx 28528-h00.docx 28530-h40.docx 28531-h70.docx 28533-h30.docx 28540-h30.docx 28550-h10.docx 28623-h51.docx 28626-h00.docx 28628-h00.docx 28629-h00.docx 28631-h00.docx 28656-h00.docx 28657-h00.docx 28658-h10.docx 28662-h00.docx 28667-h00.docx 28668-h00.docx 28669-h00.docx 28672-h00.docx 28681-h00.docx 28682-h00.docx 28683-h00.docx 28701-h00.docx 28702-h00.docx 28707-h00.docx 28708-h00.docx 28731-h00.docx 28732-h00.docx 28735-h00.docx 28751-h00.docx 28812-h10.docx 29007-h00.docx 29108-h00.docx 29153-h00.docx 29164-h00.docx 29215-h00.docx 29217-h00.docx 29250-h00.docx 29251-h20.docx 29343-h00.docx 29368-h00.docx 29414-h00.docx 29486-h60.docx 29507-h90.docx 29508-ha0.docx 29512-ha0.docx 29517-h90.docx 29523-h80.docx 29549-h70.docx 29554-h40.docx 29594-h50.docx 29658-h00.docx 29675-h70.docx 29949-h00.docx 32111-1-h00.docx 32111-2-h00.docx 32111-6-h00.docx 32121-h00.docx 32122-h00.docx 32126-h00.docx 32130-h60.docx 32153-h00.docx 32154-h00.docx 32157-h00.docx 32158-h40.docx 32160-h70.docx 32181-h00.docx 32182-h00.docx 32250-h00.docx 32253-h00.docx 32254-h30.docx 32255-h90.docx 32256-h20.docx 32270-h00.docx 32271-h00.docx 32274-h20.docx 32275-h30.docx 32280-h00.docx 32290-h60.docx 32293-h00.docx 32300-h00.docx 32301-h00.docx 32306-h00.docx 32312-h00.docx 32321-h00.docx 32331-h00.docx 32336-h00.docx 32341-h00.docx 32356-h00.docx 32361-h00.docx 32371-h00.docx 32381-h00.docx 32386-h00.docx 32391-h00.docx 32404-h00.docx 32407-h00.docx 32408-h00.docx 32409-h00.docx 32411-h00.docx 32421-h40.docx 32425-h10.docx 32436-h00.docx 32442-h00.docx 32446-h00.docx 32450-h00.docx 32452-h10.docx 32453-h00.docx 32501-h00.docx 32506-h00.docx 32531-h00.docx 32536-h00.docx 32541-h00.docx 32572-h00.docx 32581-h00.docx 32582-h00.docx 32583-h00.docx 32592-h10.docx 32594-h00.docx 32600-h00.docx 32601-h00.docx 32602-h00.docx 32612-h00.docx 32690-h00.docx 32901-h00.docx 33102-h00.docx 33106-h00.docx 33110-h00.docx 33117-h30.docx 33122-h10.docx 33187-h00.docx 33203-h10.docx 33204-h00.docx 33210-h10.docx 33216-h00.docx 33221-h00.docx 33234-h00.docx 33246-h00.docx 33250-h00.docx 33259-h00.docx 33303-h10.docx 33310-h60.docx 33320-h00.docx 33402-h00.docx 33511-h31.docx 33513-h10.docx 33514-h00.docx 33515-h00.docx 33518-h00.docx 33824-h00.docx 33916-h00.docx 33937-h00.docx 33995-h00.docx 34109-h00.docx 34926-h00.docx 35201-h00.docx 35204-h00.docx 35207-h00.docx 35216-h00.docx 35217-h00.docx 35218-h00.docx 35222-h00.docx 35232-h00.docx 35233-h00.docx 35935-h00.docx 35936-h00.docx 36360-h00.docx 36361-h00.docx 36414-h00.docx 36422-h00.docx 36425-h00.docx 36441-h00.docx 36442-h00.docx 36443-h01.docx 36444-h00.docx 36455-h10.docx 36456-h01.docx 36457-h00.docx 36462-h00.docx 36463-h00.docx 36903-h00.docx 36904-h00.docx 36905-h00.docx 36913-h00.docx 37460-h00.docx 37470-h00.docx 37481-h00.docx 38201-h00.docx 38411-h00.docx 38412-h00.docx 38414-h00.docx 38422-h00.docx 38462-h00.docx 38463-h00.docx 38913-h00.docx 41101-h00.docx 42068-h00.docx 42069-h00.docx 43010-h00.docx 43020-h00.docx 43026-h00.docx 43030-h00.docx 43055-h00.docx 43058-h00.docx 43059-h00.docx 43064-h00.docx 43129-h00.docx 43246-h00.docx 43318-h00.docx 43902-h00.docx 44004-h00.docx 44012-h00.docx 44014-h00.docx 44060-h00.docx 44071-h00.docx 44901-h00.docx 45002-h00.docx 45008-h00.docx 45010-h00.docx 45050-h00.docx 45056-h00.docx 45903-h00.docx 45912-h00.docx 45913-h00.docx 45926-h00.docx 46001-h00.docx 46002-h00.docx 46007-h00.docx 46008-h00.docx 46011-h00.docx 46012-h00.docx 46020-h00.docx 46021-h00.docx 46054-h00.docx 46061-h00.docx 46081-h00.docx 48001-h00.docx 48006-h00.docx 48008-h00.docx 48014-h00.docx 48016-h00.docx 48018-h00.docx 48031-h00.docx 48049-h00.docx 48052-h00.docx 48054-h00.docx 48056-h00.docx 48058-h00.docx 48061-h00.docx 48103-h00.docx 49031-h00.docx 49995-h00.docx 51021-h00.docx 51026-h00.docx 52008-h00.docx 52402-h00.docx 55205-h00.docx 55217-h00.docx 55226-h00.docx 55236-h00.docx 55241-h00.docx 55243-h00.docx 55252-h00.docx Readme_VAD2_TV_h01.docx 22.890-040_rm.docx
Using the following prevents the script from breaking:
@property
def _cells(self):
"""
A sequence of |_Cell| objects, one for each cell of the layout grid.
If the table contains a span, one or more |_Cell| object references
are repeated.
"""
col_count = self._column_count
cells = []
for tc in self._tbl.iter_tcs():
for grid_span_idx in range(tc.grid_span):
# if tc.vMerge == ST_Merge.CONTINUE:
# cells.append(cells[-col_count])
if grid_span_idx > 0:
cells.append(cells[-1])
else:
cells.append(_Cell(tc, self))
return cells
Thanks for submitting this, @igoforth ! I'm attempting to reproduce from the list of files you provided. My initial take is that it looks like the error is occurring in the docx library, so it's possible an issue needs to be submitted there. I'm going to look into it though to see if some changes are appropriate from our side.
I've parsed a couple hundred of these docs now without errors. Can you by any chance point to a specific document that gives you the error you got?
I wish I could, but as you might understand I didn't want to try to debug a python program that, in this case, took over two days to finish lol.
If not random, understanding in which order the implementation of unstructured acts on files could clue us in. Perhaps file #286?
Reference my gist here for how I converted originally https://gist.github.com/igoforth/80b86cc4a256db502b5d8bed3b857113
It's worth noting that my memory and CPU usage were both close to full, could there be a timing issue?
After the original error, I commented out the block which looked like an edge case. It then ran great for me.
Aside from that, libreoffice could've done something funky which messed with whatever tx.vMerge is. Did you try both 7.5.4 and 7.4.7? I probably used the stable branch. Apologies for not having looked into it further yet
Ok it looks like you're running into the issue referenced here and here. It looks like this issue has been around for a while in the python-docx library with no fix.
Reopening per report from community Slack. @scanny - per @qued 's comment this might stem from python-docx, thought you'd know better than us 😄
Ah, right, this can happen when Word tables becomes non-uniform, that is, not all rows contain the same number of cells (after accounting for merged cells). Unfortunately Word itself can produce this situation in certain table editing situations where row endings don't line up. I'll change the docx partitioner to not assume tables are uniform.
This is fixed on unstructured@main. It should appear in v0.13.7 which should be released within a week or so.
If you want to try it out in the meantime one option is:
- clone
unstructured - run
$ make installin the repo root directory, once you have activated the target virtualenv.