crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: 'NoneType' object has no attribute 'new_context'

Open betterthanever2 opened this issue 11 months ago • 5 comments

crawl4ai version

0.4.248

Expected Behavior

No errors, content is collected.

Current Behavior

Every crawl attempt fails with 'NoneType' object has no attribute 'new_context'

Is this reproducible?

Yes

Inputs Causing the Bug

I get the error on a variery of URLs (normally supplied in batches of 5). Some examples:
- https://comitet.su/item/prosto-kinuli.html
- https://comitet.su/item/v-berdyanske-prozvuchal-novyj-vzryv.html
- https://rusvesna.su/news/1664030090
- https://antifashist.com/item/postanovki-sbu-v-stile-95-kvartala-deputata-kunickogo-obvinyayut-v-izbienii-cheloveka.html

Steps to Reproduce

- it's a python app packaged as a Docker container. `crawl4ai` is used as a lib
- VPN is used inside the container to ensure access to required resourses.

Code snippets

class WebpageFetcher:
    ...
    
    @abstractmethod
    async def fetch_webpages(self):
        ...
    
    @abstractmethod
    async def parse_webpage(self, content: Any):
        ...

    async def parse_web_result(self, page: Webpage, result: CrawlResult) -> Webpage:
        """
        Parse the result of a webpage crawl and update the Webpage object accordingly.

        Args:
            page (Webpage): The Webpage object to update.
            result (CrawlResult): The result of the webpage crawl.

        Returns:
            Webpage: The updated Webpage object.

        Raises:
            Exception: If there was an error fetching the webpage.
        """
        if result.status_code == 200:
            parsed_page = await self.parse_webpage(page, result)
            return parsed_page
        elif result.status_code == 404:
            page.is_removed = True
            return page
        else:
            raise Exception(f"Error fetching webpage: {result.status_code}:: {result.error_message}")
        
    
class WebpageCrawler(WebpageFetcher):
    crawler: AsyncWebCrawler
    extraction_strategy: ExtractionStrategy | None = None
    crawl_dispatcher: MemoryAdaptiveDispatcher

    @abstractmethod
    async def parse_web_result(self, page: Webpage, result: CrawlResult) -> Webpage:
        ...

    def generate_config(self) -> CrawlerRunConfig:
        """
        Generate a configuration for the webpage crawler.

        This method can be redefined in subclasses to add additional options.

        Returns:
            CrawlerRunConfig: The generated crawler configuration.
        """
        return CrawlerRunConfig(cache_mode=CacheMode.BYPASS,
                                magic=True,
                                delay_before_return_html=2,
                                remove_overlay_elements=True,
                                stream=False,
                                remove_forms=True,
                                page_timeout=6000)  # 6 seconds
    
    async def fetch_webpages(self) -> list[Webpage]:
        """
        Fetch multiple webpages concurrently using the configured crawler.

        The fetched webpages are parsed and updated.

        Returns:
            list[Webpage]: The list of updated Webpage objects.

        Raises:
            Exception: If there was an error fetching or parsing the webpages.
        """
        urls = [page.url for page in self.webpages]
        try:
            results = await self.crawler.arun_many(urls=urls,
                                                   extraction_strategy=self.extraction_strategy,
                                                   config=self.generate_config(),
                                                   dispatcher=self.crawl_dispatcher)
        except Exception as e:
            raise e
        
        matched_pages = [(page, result) for result in results 
                        for page in self.webpages if page.url == result.url]
        
        tasks = [self.parse_web_result(page, result) for page, result in matched_pages]
        try:
            updated_pages = await asyncio.gather(*tasks)
            return updated_pages
        except Exception as e:
            raise e
        
    async def parse_webpage(self, page: Webpage, content: CrawlResult) -> Webpage:
        """
        Parse the content of a crawled webpage and update the Webpage object.

        Args:
            page (Webpage): The Webpage object to update.
            content (CrawlResult): The result of the webpage crawl.

        Returns:
            Webpage: The updated Webpage object.

        Raises:
            Exception: If there was an error parsing the extracted content or accommodating the pieces.
        """
        page_content = content.fit_markdown or content.markdown_v2.fit_markdown or content.markdown
        metadata = content.metadata
        media = content.media
        links = content.links
        
        page.content = page_content
        page.page_metadata = metadata
        page.media = media
        page.links = links

        ## clean text for NER
        page.content_ner = self.clean_text(page_content)
        
        try:
            pieces = json.loads(content.extracted_content)
        except Exception as e:
            raise e
        
        try:
            page = await self.accomodate_pieces(page, pieces)
        except Exception as e:
            raise e

        return page
    
    @abstractmethod
    async def accomodate_pieces(self, page: Webpage, pieces: dict) -> Webpage:
        """
        Accommodate the extracted pieces of data into the Webpage object.

        This method should be implemented by subclasses to handle the specific logic
        for storing the extracted data in the appropriate fields of the Webpage object.

        Args:
            pieces (dict): The extracted pieces of data, as a dictionary.

        Raises:
            Exception: If there was an error accommodating the extracted data.
        """
        ...
    
    @abstractmethod
    def define_strategy(self):
        """
        Define the extraction strategy for the webpage crawler.

        This method should be implemented by subclasses to specify the schema and rules
        for extracting data from the crawled webpages.

        Returns:
            The defined extraction strategy, in a format specific to the subclass.
        """
        ...
(and then there are high-level classes that inherit from WebpageCrawler and implement `define_strategy` and `accomodate_pieces` methods that are specific for each target)

OS

Linux Ubuntu 22

Python version

3.13.1

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

periodicals-fetch  | [ERROR]... × https://comitet.su/item/daniya-bez-reform-na-ukrai... | Error: 
periodicals-fetch  | │ × Unexpected error in _crawl_web at line 664 in create_browser_context (../crawl4ai/async_crawler_strategy.py):       │
periodicals-fetch  | │   Error: 'NoneType' object has no attribute 'new_context'                                                             │
periodicals-fetch  | │                                                                                                                       │
periodicals-fetch  | │   Code context:                                                                                                       │
periodicals-fetch  | │   659               }                                                                                                 │
periodicals-fetch  | │   660               # Update context settings with text mode settings                                                 │
periodicals-fetch  | │   661               context_settings.update(text_mode_settings)                                                       │
periodicals-fetch  | │   662                                                                                                                 │
periodicals-fetch  | │   663           # Create and return the context with all settings                                                     │
periodicals-fetch  | │   664 →         context = await self.browser.new_context(**context_settings)                                          │
periodicals-fetch  | │   665                                                                                                                 │
periodicals-fetch  | │   666           # Apply text mode settings if enabled                                                                 │
periodicals-fetch  | │   667           if self.config.text_mode:                                                                             │
periodicals-fetch  | │   668               # Create and apply route patterns for each extension                                              │
periodicals-fetch  | │   669               for ext in blocked_extensions:                                                                    │

crawl4ai_trace.txt

betterthanever2 avatar Feb 10 '25 17:02 betterthanever2

@betterthanever2 Can you check if there's a browser installed and accessible to Crawl4AI in your docker container. That's where this error occurs. It means that we were not able to create Browser context with the specified configurations.

Are you using this dockerfile or you created one yourself to build the container image?

aravindkarnam avatar Feb 11 '25 05:02 aravindkarnam

Ok, thanks for the clue. It is quite possible that something goes wrong with browser install.

I'm not installing the browser in dockerfile, instead I have this command in my compose file: sh -c "playwright install && playwright install chrome && openvpn --config /etc/openvpn/config.ovpn --auth-user-pass /etc/openvpn/a.txt & infisical run --projectId ${INFISICAL_PROJECT} -- python periodicals/core.py". This is done to avoid adding ~1.2Gb of data to the Dockerfile at buildtime. (I also have this data mounted as a volume on the host system to avoid downloading it every time)

For reference, here's my Dockerfile:

# syntax=docker/dockerfile:1

ARG PYTHON_VERSION=3.12

FROM python:${PYTHON_VERSION}-slim-bullseye AS builder

RUN apt-get update && apt-get install -y --no-install-recommends \
    bash \
    curl \
    tar \
    coreutils \
    postgresql-client \
    openvpn \
    libwoff1 \
    libopus0 \
    # libwebp7 \
    libwebpdemux2 \
    libenchant-2-2 \
    libgudev-1.0-0 \
    libsecret-1-0 \
    libhyphen0 \
    libgdk-pixbuf2.0-0 \
    libegl1 \
    libnotify4 \
    libxslt1.1 \
    libevent-2.1-7 \
    libgles2 \
    libxcomposite1 \
    libatk1.0-0 \
    libatk-bridge2.0-0 \
    libepoxy0 \
    libgtk-3-0 \
    libharfbuzz-icu0 \
    libgstreamer-gl1.0-0 \
    libgstreamer-plugins-bad1.0-0 \
    gstreamer1.0-plugins-good \
    gstreamer1.0-plugins-bad \
    libxt6 \
    libxaw7 \
    xvfb \
    fonts-noto-color-emoji \
    libfontconfig \
    libfreetype6 \
    xfonts-cyrillic \
    xfonts-scalable \
    fonts-liberation \
    fonts-ipafont-gothic \
    fonts-wqy-zenhei \
    fonts-tlwg-loma-otf \
    fonts-freefont-ttf \
    && curl -1sLf 'https://dl.cloudsmith.io/public/infisical/infisical-cli/setup.deb.sh' | bash \
    && apt-get update && apt-get install -y infisical \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

RUN python -m pip install --no-cache-dir --upgrade pip

FROM builder AS final

ARG wheel=periodicals-0.9.11-py3-none-any.whl

COPY ./dist/$wheel .
RUN pip install --no-cache-dir --upgrade $wheel

WORKDIR /usr/local/lib/python3.12/site-packages/wapaganda

Can you advise on the proper arrangement here?

betterthanever2 avatar Feb 11 '25 07:02 betterthanever2

Hey, @aravindkarnam is there anything you can advise here?

betterthanever2 avatar Feb 14 '25 19:02 betterthanever2

Having the same issue in my project, and i am running it locally, not in docker container. I have tried to run this: python -m playwright install --with-deps chromium but it didn't help

bujarinnovationnorway avatar Feb 21 '25 06:02 bujarinnovationnorway

Changing from this:

crawler = AsyncWebCrawler(config=browser_cfg)
crawl_result = await crawler.arun(url=url, config=crawl_config)

to this helped in my case:

async with AsyncWebCrawler() as crawler:
     crawl_result = await crawler.arun(url=url, config=crawl_config)

bujarinnovationnorway avatar Feb 21 '25 07:02 bujarinnovationnorway