python-readability icon indicating copy to clipboard operation
python-readability copied to clipboard

isProbablyReaderable

Open Uzay-G opened this issue 3 years ago • 3 comments

How difficult would it be to implement isProbablyReaderable(doc, options) (from https://github.com/mozilla/readability#isprobablyreaderabledocument-options).

This would allow to check when a webpage is actually interesting / relevant for scraping and save on speed.

Would this be hard to implement? I could also try working on it.

Uzay-G avatar May 25 '22 13:05 Uzay-G

It's not difficult to implement in that way, but I'm afraid you won't get any big improvement in parsing time (now typical article processing time is 0.1-0.4 s per page), nor it's reliable, or, to be more precise:

  • If you use minScore check, readability algorithm is completely the same but without cleaning phase, will take almost the same time.
  • If you could only check HTML, it's completely unreliable.

buriy avatar May 27 '22 05:05 buriy

Oh I see. What could I do to use readability to check if a webpage actually has like interesting content?

Where an actual article passes this check and something like the google homepage doesn't.

Uzay-G avatar May 28 '22 15:05 Uzay-G

The main check should be whether there's something to read: text with length starting from 300 chars. Ideally, 500+ chars. You can check this after processinging with readability: just convert to text and check the length.

buriy avatar May 29 '22 05:05 buriy