unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

bug/Wrong parsing of html, xml code blocks in markdown

Open cgjosephlee opened this issue 1 year ago • 2 comments

Describe the bug HTML and XML code blocks in markdown are not parsed properly.

Results:

HTML Example
```html
Hello, World!
This is a simple HTML example.
```
XML Example
xml <note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note>
```xml
```
```xml
```
  • HTML tags are not preserved.
  • XML code is malformed. The blank lines may erase the context.
  • <?xml version='1.0' encoding='UTF-8'?> line breaks the parser.
Traceback (most recent call last):
  File "/Users/joseph.lee/Documents/repos_ncb/test_rag/test.py", line 14, in <module>
    elems = partition_html(
  File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/documents/elements.py", line 605, in wrapper
    elements = func(*args, **kwargs)
  File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 706, in wrapper
    elements = func(*args, **kwargs)
  File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 662, in wrapper
    elements = func(*args, **kwargs)
  File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/chunking/dispatch.py", line 74, in wrapper
    elements = func(*args, **kwargs)
  File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/partition/html/partition.py", line 103, in partition_html
    elements = list(
  File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/partition/lang.py", line 475, in apply_lang_metadata
    elements = list(elements)
  File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/partition/html/partition.py", line 222, in iter_elements
    yield from cls(opts)._iter_elements()
  File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/partition/html/partition.py", line 229, in _iter_elements
    for e in self._main.iter_elements():
  File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/partition/html/parser.py", line 361, in iter_elements
    yield from self._element_from_text_or_tail(block_item.tail or "", q)
  File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/partition/html/parser.py", line 377, in _element_from_text_or_tail
    for node in self._iter_text_segments(text, q):
  File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/partition/html/parser.py", line 421, in _iter_text_segments
    while q and q[0].is_phrasing:
AttributeError: 'lxml.etree._ProcessingInstruction' object has no attribute 'is_phrasing'

To Reproduce

## HTML Example

```html
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Sample HTML</title>
</head>
<body>
    <h1>Hello, World!</h1>
    <p>This is a simple HTML example.</p>
</body>
</html>
```

## XML Example

```xml
<note>
    <to>Tove</to>
    <from>Jani</from>
    <heading>Reminder</heading>
    <body>Don't forget me this weekend!</body>
</note>
```

```xml

<note>
    <to>Tove</to>
    <from>Jani</from>
    <heading>Reminder</heading>
    <body>Don't forget me this weekend!</body>
</note>

```

```xml
<?xml version='1.0' encoding='UTF-8'?>
<note>
    <to>Tove</to>
    <from>Jani</from>
    <heading>Reminder</heading>
    <body>Don't forget me this weekend!</body>
</note>
```

Expected behavior The content in code blocks should be preserved as it is.

Screenshots

Environment Info 0.15.7

Additional context Since markdown is first converted to html, adding extensions=['fenced_code'] to markdown parser solves the issue. Or a better way is to make the extensions list to be a configurable parameter. https://github.com/Unstructured-IO/unstructured/blob/f440eb476cf75d6109e8a3719cadf893529dcef8/unstructured/partition/md.py#L109

cgjosephlee avatar Aug 29 '24 08:08 cgjosephlee

Hi @cgjosephlee - Thanks for the report and the detailed reproduction steps. We'll take a look as soon as we're able. cc @scanny .

MthwRobinson avatar Aug 29 '24 20:08 MthwRobinson

@MthwRobinson

The same bug also occurs with text containing PHP tags. Please take a look: Sample text

<?php phpinfo(); ?>

UPON-2021 avatar Apr 18 '25 03:04 UPON-2021

This should be fixed by this PR. Going to close this issue now, but please reopen if you still see the same problem.

jiajun-unstructured avatar Jul 01 '25 20:07 jiajun-unstructured

Closing as resolved by #4044

qued avatar Jul 08 '25 01:07 qued