incubator-stormcrawler icon indicating copy to clipboard operation
incubator-stormcrawler copied to clipboard

A scalable, mature and versatile web crawler based on Apache Storm

Results 89 incubator-stormcrawler issues
Sort by recently updated
recently updated
newest added

From a user `Links that were once pages and then turn to redirects are our issue. Our content management system auto creates clean URLs. If the title of the page...

core

Hi @jnioche, I was looking into https://github.com/DigitalPebble/storm-crawler/pull/989#discussion_r918581042 and reviewed the old code in order to make sure, that I get the wanted behaviour. (see https://github.com/FelixEngl/storm-crawler/blob/834347e53f79376d3a79f125a6203c91d062e04f/external/elasticsearch/src/main/java/com/digitalpebble/stormcrawler/elasticsearch/bolt/IndexerBolt.java) Now I am wondering, shouldn't...

``` 022-07-15 09:57:16.851 o.a.s.e.e.ReportError Thread-43-fetcher-executor[15, 15] [ERROR] Error java.lang.RuntimeException: java.lang.RuntimeException: java.util.ConcurrentModificationException at org.apache.storm.utils.Utils$1.run(Utils.java:411) ~[storm-client-2.4.0.jar:2.4.0] at java.lang.Thread.run(Thread.java:829) [?:?] Caused by: java.lang.RuntimeException: java.util.ConcurrentModificationException at org.apache.storm.executor.Executor.accept(Executor.java:301) ~[storm-client-2.4.0.jar:2.4.0] at org.apache.storm.utils.JCQueue.consumeImpl(JCQueue.java:113) ~[storm-client-2.4.0.jar:2.4.0] at org.apache.storm.utils.JCQueue.consume(JCQueue.java:89) ~[storm-client-2.4.0.jar:2.4.0]...

bug

Maybe https://github.com/inoio/solrs would be useful?

enhancement
SOLR
help wanted

Just like it's done in ES, we could route the documents in the statusupdaterbolt based on the host / name or IP and in the spouts check that the number...

enhancement
SOLR
help wanted
good first issue

https://www.elastic.co/blog/aggregate-data-faster-with-new-the-random-sampler-aggregation

elasticsearch

The High Level Rest Client is deprecated in favor of the [Elasticsearch Java API Client](https://www.elastic.co/guide/en/elasticsearch/client/java-api-client/current/introduction.html) This will affect SC, if we want to upgrade Elasticsearch from 7.5.2 to 7.17.0 We...

elasticsearch

https://www.elastic.co/blog/whats-new-elasticsearch-kibana-cloud-8-1-0)](https://www.elastic.co/blog/whats-new-elasticsearch-kibana-cloud-8-1-0 For the _status_ index, only the _key_ and _nextFetchDate_ are frequently queried. Anything else (URL, metadata, status) is mostly used in kibana (unless there is filtering on the source)....

enhancement
elasticsearch

Hello @jnioche, sorry for the first PR regarding the metadata, that PR was a mistake. I thought that a draft was only visible for me. I improved the class Metadata...