inscriptis icon indicating copy to clipboard operation
inscriptis copied to clipboard

Label offest no accurate in case of table

Open ShakedAharonn opened this issue 1 year ago • 6 comments

Hi, I encuontered this bug while trying to scarpe a specific site:

` page = """

  • item1
  • item2
  • item3
  • item4
  • item5
  • item6
  • item7
  • item8
"""

rules = {'ul':['ul'], 'table':['table']}

output = get_annotated_text(page, ParserConfig(annotation_rules=rules)) // {'text': ' * item1 * item5\n * item2 * item6\n * item3 * item7\n * item4 * item8\n', 'label': [(0, 85, 'table'), (0, 40, 'ul'), (11, 51, 'ul')]}

(start_index, end_index, annotation) = output['label'][1] (output['text'][start_index:end_index]) //' * item1 * item5\n * item2 * item' `

as can be seen, accessing the text of the relevant label isn't working as the offsets aren't accurate when viewing a table

ShakedAharonn avatar Feb 04 '25 09:02 ShakedAharonn

Would annotating li rather than ul fix your problem?

AlbertWeichselbraun avatar Feb 04 '25 18:02 AlbertWeichselbraun

I can try, but it will miss the point of me trying to capture the full list as a single segment, wouldn't it?

ShakedAharonn avatar Feb 05 '25 09:02 ShakedAharonn

with the current implementation annotations cover the area between an element's start and stop tag.

in case of an ul in a table cell this leads to overlaps, with the uls start tags in the following cells (otherwise one annotation would need to yield multiple areas (i.e., one for each line) rather than a single one).

in my opinion this is a use case where it makes more sense to capture the content of the ul tag with an xpath expression (e.g., via lxml) and then use inscriptis to convert the extracted content to text.

AlbertWeichselbraun avatar Feb 05 '25 10:02 AlbertWeichselbraun

The issue with that solution, is that I try to use inscriptis in an automated RT process, that parse multiple different domains. there is no way to identify it beforehand.

ShakedAharonn avatar Feb 05 '25 17:02 ShakedAharonn

i see - at the moment I do not see how this use case could be supported with the current annotation design which marks relevant text with a start and stop index.

Your use case would require returning a tree structure outlining which li belong to which ul or ol tags. Unless you have an idea on how to solve this without changing the output format (i.e., breaking compatibility) this will probably be an extension to consider for the next major inscriptis release.

AlbertWeichselbraun avatar Feb 08 '25 06:02 AlbertWeichselbraun

Thanks, will keep following up on that, hope this feature will be added to future releases. btw a probably easy solution for this will be to just return text by tag instead of list of start&end indexes

ShakedAharonn avatar Feb 18 '25 13:02 ShakedAharonn