news-please Finished crawling with no results

Mandatory

[x] I read the documentation (readme and wiki).
[x] I searched other issues (including closed issues) and could not find any to be related. If you find related issues post them below or directly add your issue to the most related one.

Related issues:

add them here

Describe your question The the given CLI example returns no pages from zeit.de. I have the same problems with other web pages. No error is thrown, it just returns and claims to be finished. So the question is if there is a way to approach the problem. I attached the log file. log.txt

Versions (please complete the following information):

OS: [e.g. MacOS 10.2] Ubuntu 18.4
Python Version [e.g. 3.6] 3.6
news-please Version [e.g. 1.2] 1.5.13

Intent (optional; we'll use this info to prioritize upcoming tasks to work on)

[x] personal
[ ] academic
[ ] business
[ ] other
Some information on your project:

I train language models for finetuning them on other tasks like ner or text classification

Aug 26 '20 12:08 tobiasstrauss

Strange, also that there's no error in the log! When not using the CLI mode but the library mode (see readme.md) does the extraction work for you?

Sep 05 '20 09:09 fhamborg

Acutally not. The problem seems to be that one hast to accept the advertisement popup first. The output was: zeit.de mit Werbung Besuchen Sie zeit.de wie gewohnt mit Werbung und Tracking. Details zum Tracking finden Sie in der Datenschutzerklärung und im Privacy Center . :-/

Sep 10 '20 19:09 tobiasstrauss

Did I understand you correctly that:

when using library mode, e.g., from_url, you retrieve the above text? In which is this set of the NewsArticle object, e.g., title, maintext, etc.?
And respectively, when using CLI mode, nothing is returned, not even an (empty) article object?

Sep 15 '20 11:09 fhamborg

I had this problem myself, I am pretty sure I had a configuration issue that was failing silently. I remade my configuration file basing it off of the examples and things seemed to start working. My assumption was some weird python tabs or spaces problem.

Oct 12 '20 17:10 JermellB

Did I understand you correctly that:

when using library mode, e.g., from_url, you retrieve the above text? In which is this set of the NewsArticle object, e.g., title, maintext, etc.?

And respectively, when using CLI mode, nothing is returned, not even an (empty) article object?

To 1. exactly! I just asked for maintext and title. To 2. In the CLI mode there is not even a folder referring to zeit.de. Meanwhile I set up a new system with Ubuntu 20.04. Same problem. Also with a new configuration. I just used the configuration given in the example. This is a strange behavior since other pages like faz seem to work perfectly. @fhamborg thanks for sharing this great tool. Although zeit.de is not working for me, I was able to crawl many other pages.

edit: my config file:

{
  # Every URL has to be in an array-object in "base_urls".
  # The same URL in combination with the same crawler may only appear once in this array.
  "base_urls" : [
    {
      # zeit.de has a blog which we do not want to crawl
      "url": "http://www.zeit.de",

      "overwrite_heuristics": {
        # because we do not want to crawl that blog, disable all downloads from
        # subdomains
        "is_not_from_subdomain": true
      },
      # Update the condition as well, all the other heuristics are enabled in
      # newscrawler.cfg
      "pass_heuristics_condition": "is_not_from_subdomain and og_type and self_linked_headlines and linked_headlines"
    }
  ]
}

Oct 13 '20 15:10 tobiasstrauss

@tobiasstrauss I agree with you, the issue was the website pop up at https://www.zeit.de/zustimmung?url=https%3A%2F%2Fwww.zeit.de%2Findex

Mar 13 '21 03:03 peterkabz

Hey there @tobiasstrauss you're able to bypass the issue by sending the appropriate cookie with the crawl request (cookie named 'zonconsent' - you would have to get the appropriate value of the cookie by visiting the site manually once). I've been implementing a couple of changes including this one, which I could push - though I'm not a 100% sure if there are any legal implications to programmatically bygoing such consent-popups. is anyone more literate on the according legal issues?

Jul 22 '21 14:07 woxxel

Hey @woxxel, I am currently experiencing the same issues as @tobiasstrauss. Could you share your approach on how to send the cookie with the crawl request? I tried to implement it myself but failed so far. Thanks a lot!

Jan 05 '22 21:01 SamuelHelspr

@SamuelHelspr or @woxxel have either of you (or anyone reading) figured out how to send a cookie? I've been using the from_url function and it seems there's no option to pass it.

Aug 08 '23 13:08 loughnane

If no one has figured this out in a week ping me and I'll write a quick patch for you. I was doing some decently large scale crawls with this and to get scale that was something I had to do.

On Tue, Aug 8, 2023 at 9:34 AM Chris Loughnane @.***> wrote:

@SamuelHelspr https://github.com/SamuelHelspr or @woxxel https://github.com/woxxel have either of you (or anyone reading) figured out how to send a cookie? I've been using the from_url function and it seems there's no option to pass it.

— Reply to this email directly, view it on GitHub https://github.com/fhamborg/news-please/issues/175#issuecomment-1669631533, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABM63N6EDNYBSEAUAIGYSDDXUI56ZANCNFSM4QLZBU7A . You are receiving this because you commented.Message ID: @.***>

Aug 08 '23 14:08 JermellB

Hey @JermellB i'd gladly take you up on that patch.

Aug 25 '23 00:08 loughnane

I just made the same experience. Interestingly some sites (guardian, FAZ) are working fine even though there are ads in between.

But for Spiegel the maintext is not being returned at all for most content.

@JermellB any updates from you? Do you need help how you starting this patch?

Sep 25 '23 16:09 BilalReffas