JsonCssExtractionStrategy Fails to Handle Lists of Elements

Open hypy13 opened this issue 1 year ago • 0 comments

After working for several hours, I discovered a significant problem with the JsonCssExtractionStrategy when trying to extract a list of elements.

Example:

In the example provided in the repository:
https://github.com/unclecode/crawl4ai/blob/main/docs/examples/v0_4_24_walkthrough.py

There are two <article class="post"> elements, but JsonCssExtractionStrategy only retrieves the first <article> tag.

Root Cause:

The issue lies in the implementation of the _get_elements() method in JsonCssExtractionStrategy, which is designed to fetch only the first element matching the selector:

def _get_elements(self, element, selector: str):
    selected = element.select_one(selector)  # Only gets the first match
    return [selected] if selected else []

This approach completely overlooks the possibility of handling multiple elements. As a result, I couldn't even retrieve a list of tags inside a <div>.

My Thoughts:

This limitation has been a frustrating blocker, and it wasted several hours of my time. I'm open to contributing and rewriting this method to handle lists of elements properly, but doing so would require a change in the current strategy. Let me know if this aligns with your goals, and I can propose an updated implementation.

Jan 08 '25 20:01 hypy13