unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

Unstrutured library is unable to extract CDATA from the xml data

Open PhaneendraGunda opened this issue 1 year ago • 1 comments

Sample XML:

<GENERAL_INFO><TITLE><![CDATA[Mobile Apple Devices (iPhones, iPads, and Smartwatches)]]></TITLE><SUMMARY><![CDATA[<p>This article highlights the key benefits and specifications of Apple iPhones, iPads, and Smartwatches.</p></SUMMARY></GENERAL_INFO>

Code to fetch data from the XML

from unstructured.partition.html import partition_html

_text = ' '.join([element.text for element in partition_html(text=_html_text)])

Is there any flag or function to enable extracting content from the CDATA ?

PhaneendraGunda avatar May 22 '24 19:05 PhaneendraGunda

Thanks for the issue @PhaneendraGunda ! We'll discuss and follow up

shreyanid avatar May 22 '24 20:05 shreyanid