unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

bug/element type for non-English languages

Open cm-halfspace opened this issue 1 year ago • 1 comments

Describe the bug When I partition a Danish .docx file I notice some weird classifications of the element types.

I think this is related to the fact that the languages-list is not being set in _parse_paragraph_text_for_element_type, eg in is_possible_narrative_text(text).

If one takes a look at the definition of is_possible_narrative_text it seems that a quick temporary solution would be to at least use language_checks in line 90 such that it instead becomes:

if "eng" in languages and language_checks and (sentence_count(text, 3) < 2) and (not contains_verb(text)):

To Reproduce

from unstructured.partition.text_type import is_possible_narrative_text
text = "Dette er et eksempel på en kort sætning."
is_possible_narrative_text(text)

which returns False right now. With the above quick-fix, it would return True as expected.

cm-halfspace avatar May 17 '24 11:05 cm-halfspace

Hi @cm-halfspace - thanks for reporting this! We'll look at this as soon as we can, or happy to review if you want to open a PR with your suggested change.

MthwRobinson avatar May 17 '24 12:05 MthwRobinson