crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: wrong permissions on the .cache folder in docker image

Open faileon opened this issue 2 months ago • 4 comments

hello, trying to use the link scoring feature with the following config but im getting the error below. crawl4ai running in docker. any idea what is wrong?

[LINK_EXTRACT] ℹ Starting link head extraction for 3 internal and 0 external 
links 
[LINK_EXTRACT] ℹ Error during link head extraction: [[Errno 13]] Permission 
denied: '/home/appuser/.cache/url_seeder' 

config

                    link_preview_config: {
                        type: 'LinkPreviewConfig',
                        params: {
                            verbose: true,
                            include_internal: true,
                            include_external: false,
                            max_links: 10,
                            concurrency: 5,
                            timeout: 10,
                            query: 'foo bar',
                            score_threshold: 0.3,
                        },
                    },

it seems there is a permission issue inside the container

appuser@fae68547e9e3:/app$ ls -ld /home/appuser /home/appuser/.cache
drwxr-xr-x 1 appuser appuser 4096 Nov 24 07:01 /home/appuser
drwxr-xr-x 3 root    root    4096 Nov 14 10:01 /home/appuser/.cache

the .cache folder belongs to root user

workaround:

# open a root shell in the container
docker exec -it -u 0 crawl4ai bash

#  change permissions
chown -R appuser:appuser /home/appuser/.cache
chmod -R 700 /home/appuser/.cache

But the .cache folder should belong to appuser so i think this should be fixed in the crawl4ai docker image itself.

crawl4ai version - 0.7.7

faileon avatar Nov 24 '25 12:11 faileon

User comment (https://discord.com/channels/1278297938551902308/1278298697540567132/1442493367035367566) also i think there might be more issues, perhaps related to this. when i fixed the caching folder permissions with the workaround, i now get the scores, but no matter what query i used in the LinkPreviewConfig it kept returning the same score...

ntohidi avatar Nov 25 '25 09:11 ntohidi

Root Cause:

  • During Docker build, crawl4ai-setup and other commands run as root
  • When AsyncUrlSeeder is initialized, it creates ~/.cache/url_seeder directory
  • This creates the directory with root ownership
  • Later, when the app runs as appuser, it can't write to this directory

ntohidi avatar Nov 25 '25 10:11 ntohidi

User comment (https://discord.com/channels/1278297938551902308/1278298697540567132/1442493367035367566) also i think there might be more issues, perhaps related to this. when i fixed the caching folder permissions with the workaround, i now get the scores, but no matter what query i used in the LinkPreviewConfig it kept returning the same score...

@faileon could you please provide your code? it helps me with debugging

ntohidi avatar Nov 25 '25 10:11 ntohidi

Sure, here is an example request body I am calling via the REST API in docker:

{
  "crawler_config": {
    "css_selector": "#main",
    "excluded_selector": ".footer, #bar",
    "remove_forms": true,
    "markdown_generator": {
      "type": "DefaultMarkdownGenerator",
      "params": {
        "content_filter": {
          "type": "PruningContentFilter",
          "params": {
            "min_word_threshold": 0,
            "threshold": 0.1
          }
        }
      }
    },
    "link_preview_config": {
      "type": "LinkPreviewConfig",
      "params": {
        "max_links": 10,
        "concurrency": 5,
        "timeout": 10,
        "query": "Foo bar",
        "score_threshold": 0.3
      }
    },
    "score_links": true,
    "stream": false
  },
  "urls": ["https://example.com"]
}

faileon avatar Nov 25 '25 10:11 faileon