JsonCssExtractionStrategy Fails to Handle Lists of Elements
After working for several hours, I discovered a significant problem with the JsonCssExtractionStrategy when trying to extract a list of elements.
Example:
In the example provided in the repository:
https://github.com/unclecode/crawl4ai/blob/main/docs/examples/v0_4_24_walkthrough.py
There are two <article class="post"> elements, but JsonCssExtractionStrategy only retrieves the first <article> tag.
Root Cause:
The issue lies in the implementation of the _get_elements() method in JsonCssExtractionStrategy, which is designed to fetch only the first element matching the selector:
def _get_elements(self, element, selector: str):
selected = element.select_one(selector) # Only gets the first match
return [selected] if selected else []
This approach completely overlooks the possibility of handling multiple elements. As a result, I couldn't even retrieve a list of tags inside a <div>.
My Thoughts:
This limitation has been a frustrating blocker, and it wasted several hours of my time. I'm open to contributing and rewriting this method to handle lists of elements properly, but doing so would require a change in the current strategy. Let me know if this aligns with your goals, and I can propose an updated implementation.