[Bug]: wrong permissions on the .cache folder in docker image
hello, trying to use the link scoring feature with the following config but im getting the error below. crawl4ai running in docker. any idea what is wrong?
[LINK_EXTRACT] ℹ Starting link head extraction for 3 internal and 0 external
links
[LINK_EXTRACT] ℹ Error during link head extraction: [[Errno 13]] Permission
denied: '/home/appuser/.cache/url_seeder'
config
link_preview_config: {
type: 'LinkPreviewConfig',
params: {
verbose: true,
include_internal: true,
include_external: false,
max_links: 10,
concurrency: 5,
timeout: 10,
query: 'foo bar',
score_threshold: 0.3,
},
},
it seems there is a permission issue inside the container
appuser@fae68547e9e3:/app$ ls -ld /home/appuser /home/appuser/.cache
drwxr-xr-x 1 appuser appuser 4096 Nov 24 07:01 /home/appuser
drwxr-xr-x 3 root root 4096 Nov 14 10:01 /home/appuser/.cache
the .cache folder belongs to root user
workaround:
# open a root shell in the container
docker exec -it -u 0 crawl4ai bash
# change permissions
chown -R appuser:appuser /home/appuser/.cache
chmod -R 700 /home/appuser/.cache
But the .cache folder should belong to appuser so i think this should be fixed in the crawl4ai docker image itself.
crawl4ai version - 0.7.7
User comment (https://discord.com/channels/1278297938551902308/1278298697540567132/1442493367035367566) also i think there might be more issues, perhaps related to this. when i fixed the caching folder permissions with the workaround, i now get the scores, but no matter what query i used in the LinkPreviewConfig it kept returning the same score...
Root Cause:
- During Docker build, crawl4ai-setup and other commands run as root
- When AsyncUrlSeeder is initialized, it creates ~/.cache/url_seeder directory
- This creates the directory with root ownership
- Later, when the app runs as appuser, it can't write to this directory
User comment (https://discord.com/channels/1278297938551902308/1278298697540567132/1442493367035367566) also i think there might be more issues, perhaps related to this. when i fixed the caching folder permissions with the workaround, i now get the scores, but no matter what query i used in the LinkPreviewConfig it kept returning the same score...
@faileon could you please provide your code? it helps me with debugging
Sure, here is an example request body I am calling via the REST API in docker:
{
"crawler_config": {
"css_selector": "#main",
"excluded_selector": ".footer, #bar",
"remove_forms": true,
"markdown_generator": {
"type": "DefaultMarkdownGenerator",
"params": {
"content_filter": {
"type": "PruningContentFilter",
"params": {
"min_word_threshold": 0,
"threshold": 0.1
}
}
}
},
"link_preview_config": {
"type": "LinkPreviewConfig",
"params": {
"max_links": 10,
"concurrency": 5,
"timeout": 10,
"query": "Foo bar",
"score_threshold": 0.3
}
},
"score_links": true,
"stream": false
},
"urls": ["https://example.com"]
}