jetnet
jetnet
it is possible to configure a log-file per crawler as it worked in v.2? I tried the following config, but `sd:type` does not get resolved. Thanks ``` %d{HH:mm:ss.SSS} [%t] %-5level...
we need to crawl many Internet sites and encountered an issue with `www` prefix: some sites redirect to their domains without `www`, some other way round. Unfortunately, such case cannot...
## Describe the bug **Command Name** `az aks command invoke -n $AKS_NAME -c "kubectl cluster-info"` **Errors:** ``` (KubernetesOperationError) Failed to run command due to cluster perf issue, container command-0be71db980254f398cdecce07419fbed in...
hello Pascal, I'd like to use several methods (e.g. `csv` and `regex`) in the `KeepOnlyTagger`, but it seems, only one `fieldMatcher` is allowed: ```xml crawl_date,type,content,collector.depth,document.language (thumbnailImage|imagePHash).* ``` Error: ``` 1...
hello Pascal, one quick question: do you plan to develop an external API tagger / transformer? Similar to the existing ones, but no starting an executable, but calling an external...
Hello Pascal, is it possible to configure the Document Parser to apply the OCR processing for images from a given size / dimention? There are some metadata that could be...
It would be great if there would be examples in the documentation how to set various web-driver capabilities, in particular: * proxy with and without auth * user agent *...
### What is the issue? * Podman container start: ```bash podman run -d \ --device nvidia.com/gpu=all \ --memory=100g \ -v ollama:$HOME/.ollama \ -v /local_path/ollama/models:/models \ -p 11434:11434 \ -e OLLAMA_MODELS=/models...
I'd like to disable OCR Tesseract for images. Norconex v.3.1 Rendered config for `documentParsedFactory`: ```xml deu,eng DISABLED_image/(jpeg|png|gif) ``` does not prevent `tesseract` being started, e.g.: ``` └─java -Dlog4j2.configurationFile=file:/storage/norconex/etc/test/log4j2.xml -Xms4G -Xmx16G...
Requirements: - Download text content and images from the "main" site - Download images from the main site and from external sites, which are referenced on the "main" site Need...