[Bug]: 'NoneType' object has no attribute 'new_context'
crawl4ai version
0.4.248
Expected Behavior
No errors, content is collected.
Current Behavior
Every crawl attempt fails with 'NoneType' object has no attribute 'new_context'
Is this reproducible?
Yes
Inputs Causing the Bug
I get the error on a variery of URLs (normally supplied in batches of 5). Some examples:
- https://comitet.su/item/prosto-kinuli.html
- https://comitet.su/item/v-berdyanske-prozvuchal-novyj-vzryv.html
- https://rusvesna.su/news/1664030090
- https://antifashist.com/item/postanovki-sbu-v-stile-95-kvartala-deputata-kunickogo-obvinyayut-v-izbienii-cheloveka.html
Steps to Reproduce
- it's a python app packaged as a Docker container. `crawl4ai` is used as a lib
- VPN is used inside the container to ensure access to required resourses.
Code snippets
class WebpageFetcher:
...
@abstractmethod
async def fetch_webpages(self):
...
@abstractmethod
async def parse_webpage(self, content: Any):
...
async def parse_web_result(self, page: Webpage, result: CrawlResult) -> Webpage:
"""
Parse the result of a webpage crawl and update the Webpage object accordingly.
Args:
page (Webpage): The Webpage object to update.
result (CrawlResult): The result of the webpage crawl.
Returns:
Webpage: The updated Webpage object.
Raises:
Exception: If there was an error fetching the webpage.
"""
if result.status_code == 200:
parsed_page = await self.parse_webpage(page, result)
return parsed_page
elif result.status_code == 404:
page.is_removed = True
return page
else:
raise Exception(f"Error fetching webpage: {result.status_code}:: {result.error_message}")
class WebpageCrawler(WebpageFetcher):
crawler: AsyncWebCrawler
extraction_strategy: ExtractionStrategy | None = None
crawl_dispatcher: MemoryAdaptiveDispatcher
@abstractmethod
async def parse_web_result(self, page: Webpage, result: CrawlResult) -> Webpage:
...
def generate_config(self) -> CrawlerRunConfig:
"""
Generate a configuration for the webpage crawler.
This method can be redefined in subclasses to add additional options.
Returns:
CrawlerRunConfig: The generated crawler configuration.
"""
return CrawlerRunConfig(cache_mode=CacheMode.BYPASS,
magic=True,
delay_before_return_html=2,
remove_overlay_elements=True,
stream=False,
remove_forms=True,
page_timeout=6000) # 6 seconds
async def fetch_webpages(self) -> list[Webpage]:
"""
Fetch multiple webpages concurrently using the configured crawler.
The fetched webpages are parsed and updated.
Returns:
list[Webpage]: The list of updated Webpage objects.
Raises:
Exception: If there was an error fetching or parsing the webpages.
"""
urls = [page.url for page in self.webpages]
try:
results = await self.crawler.arun_many(urls=urls,
extraction_strategy=self.extraction_strategy,
config=self.generate_config(),
dispatcher=self.crawl_dispatcher)
except Exception as e:
raise e
matched_pages = [(page, result) for result in results
for page in self.webpages if page.url == result.url]
tasks = [self.parse_web_result(page, result) for page, result in matched_pages]
try:
updated_pages = await asyncio.gather(*tasks)
return updated_pages
except Exception as e:
raise e
async def parse_webpage(self, page: Webpage, content: CrawlResult) -> Webpage:
"""
Parse the content of a crawled webpage and update the Webpage object.
Args:
page (Webpage): The Webpage object to update.
content (CrawlResult): The result of the webpage crawl.
Returns:
Webpage: The updated Webpage object.
Raises:
Exception: If there was an error parsing the extracted content or accommodating the pieces.
"""
page_content = content.fit_markdown or content.markdown_v2.fit_markdown or content.markdown
metadata = content.metadata
media = content.media
links = content.links
page.content = page_content
page.page_metadata = metadata
page.media = media
page.links = links
## clean text for NER
page.content_ner = self.clean_text(page_content)
try:
pieces = json.loads(content.extracted_content)
except Exception as e:
raise e
try:
page = await self.accomodate_pieces(page, pieces)
except Exception as e:
raise e
return page
@abstractmethod
async def accomodate_pieces(self, page: Webpage, pieces: dict) -> Webpage:
"""
Accommodate the extracted pieces of data into the Webpage object.
This method should be implemented by subclasses to handle the specific logic
for storing the extracted data in the appropriate fields of the Webpage object.
Args:
pieces (dict): The extracted pieces of data, as a dictionary.
Raises:
Exception: If there was an error accommodating the extracted data.
"""
...
@abstractmethod
def define_strategy(self):
"""
Define the extraction strategy for the webpage crawler.
This method should be implemented by subclasses to specify the schema and rules
for extracting data from the crawled webpages.
Returns:
The defined extraction strategy, in a format specific to the subclass.
"""
...
(and then there are high-level classes that inherit from WebpageCrawler and implement `define_strategy` and `accomodate_pieces` methods that are specific for each target)
OS
Linux Ubuntu 22
Python version
3.13.1
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
periodicals-fetch | [ERROR]... × https://comitet.su/item/daniya-bez-reform-na-ukrai... | Error:
periodicals-fetch | │ × Unexpected error in _crawl_web at line 664 in create_browser_context (../crawl4ai/async_crawler_strategy.py): │
periodicals-fetch | │ Error: 'NoneType' object has no attribute 'new_context' │
periodicals-fetch | │ │
periodicals-fetch | │ Code context: │
periodicals-fetch | │ 659 } │
periodicals-fetch | │ 660 # Update context settings with text mode settings │
periodicals-fetch | │ 661 context_settings.update(text_mode_settings) │
periodicals-fetch | │ 662 │
periodicals-fetch | │ 663 # Create and return the context with all settings │
periodicals-fetch | │ 664 → context = await self.browser.new_context(**context_settings) │
periodicals-fetch | │ 665 │
periodicals-fetch | │ 666 # Apply text mode settings if enabled │
periodicals-fetch | │ 667 if self.config.text_mode: │
periodicals-fetch | │ 668 # Create and apply route patterns for each extension │
periodicals-fetch | │ 669 for ext in blocked_extensions: │
@betterthanever2 Can you check if there's a browser installed and accessible to Crawl4AI in your docker container. That's where this error occurs. It means that we were not able to create Browser context with the specified configurations.
Are you using this dockerfile or you created one yourself to build the container image?
Ok, thanks for the clue. It is quite possible that something goes wrong with browser install.
I'm not installing the browser in dockerfile, instead I have this command in my compose file: sh -c "playwright install && playwright install chrome && openvpn --config /etc/openvpn/config.ovpn --auth-user-pass /etc/openvpn/a.txt & infisical run --projectId ${INFISICAL_PROJECT} -- python periodicals/core.py".
This is done to avoid adding ~1.2Gb of data to the Dockerfile at buildtime. (I also have this data mounted as a volume on the host system to avoid downloading it every time)
For reference, here's my Dockerfile:
# syntax=docker/dockerfile:1
ARG PYTHON_VERSION=3.12
FROM python:${PYTHON_VERSION}-slim-bullseye AS builder
RUN apt-get update && apt-get install -y --no-install-recommends \
bash \
curl \
tar \
coreutils \
postgresql-client \
openvpn \
libwoff1 \
libopus0 \
# libwebp7 \
libwebpdemux2 \
libenchant-2-2 \
libgudev-1.0-0 \
libsecret-1-0 \
libhyphen0 \
libgdk-pixbuf2.0-0 \
libegl1 \
libnotify4 \
libxslt1.1 \
libevent-2.1-7 \
libgles2 \
libxcomposite1 \
libatk1.0-0 \
libatk-bridge2.0-0 \
libepoxy0 \
libgtk-3-0 \
libharfbuzz-icu0 \
libgstreamer-gl1.0-0 \
libgstreamer-plugins-bad1.0-0 \
gstreamer1.0-plugins-good \
gstreamer1.0-plugins-bad \
libxt6 \
libxaw7 \
xvfb \
fonts-noto-color-emoji \
libfontconfig \
libfreetype6 \
xfonts-cyrillic \
xfonts-scalable \
fonts-liberation \
fonts-ipafont-gothic \
fonts-wqy-zenhei \
fonts-tlwg-loma-otf \
fonts-freefont-ttf \
&& curl -1sLf 'https://dl.cloudsmith.io/public/infisical/infisical-cli/setup.deb.sh' | bash \
&& apt-get update && apt-get install -y infisical \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
RUN python -m pip install --no-cache-dir --upgrade pip
FROM builder AS final
ARG wheel=periodicals-0.9.11-py3-none-any.whl
COPY ./dist/$wheel .
RUN pip install --no-cache-dir --upgrade $wheel
WORKDIR /usr/local/lib/python3.12/site-packages/wapaganda
Can you advise on the proper arrangement here?
Hey, @aravindkarnam is there anything you can advise here?
Having the same issue in my project, and i am running it locally, not in docker container. I have tried to run this:
python -m playwright install --with-deps chromium
but it didn't help
Changing from this:
crawler = AsyncWebCrawler(config=browser_cfg)
crawl_result = await crawler.arun(url=url, config=crawl_config)
to this helped in my case:
async with AsyncWebCrawler() as crawler:
crawl_result = await crawler.arun(url=url, config=crawl_config)