python-readability issues

summary() throws away lots of text at some websites

2

e.g. at https://edition.cnn.com/2020/07/24/politics/donald-trump-coronavirus-briefing-jacksonville/index.html A lot of the text content is unfortunately not present in the .summary() result.

yevgenpapernyk

Wrong length-check in summary when using xpath=True results wrong summaries

When trying to use xpath=True in summary to extract the main content, you get the wrong result for several webpages, otherwise its correct. The reason is that the length check...

yeus

Break Inline tags

I noticed that sequences like ` A B C ` It is transformed into ` A B C ` Instead of something like: ` A B C ` This causes...

Amecom

Leave necessary images

2

Hi! Sorry for probably asking a simple thing, but nevertheless. Is it possible to use this readbility library to simplify webpage with pictures? Like, let's say, most of medium articles....

ozhyrenkov

Some tags create unnecessary paragraphs

1

In some websites (phoronix for example) some tags (notably `a` and `em`) are wrapped in their own unnecessary paragraph. This causes unnecessary line breaks, ultimately hurting the page's readability. Here's...

GabMus

Pass LXML object straight to readability?

2

As of now only strings containing HTML seem to be acceptable input. Is there a way to pass an object parsed by LXML or `lxml.html` (types: `etree._ElementTree` and `html.HtmlElement`) straight...

adbar

instagram and twitter embedded content filtered

it seems like the embedded instagram and twitter content are filtered. How can I keep them? (example: https://www.businessinsider.com/elon-musk-tweets-bernie-sanders-meme-2020-2)

pierrehabte2017

Remove distracting and unnecessary tags

6

SVGs often render way too big on most websites (see e.g. [github](https://github.com/baskerville/plato/blob/e6d071a3258f2ef9eb38881ff5641b8782c3c30f/doc/LIBRARY.md) and the [mozilla docs](https://developer.mozilla.org/en-US/docs/Learn/HTML/Multimedia_and_embedding/Adding_vector_graphics_to_the_Web), see also the screenshot below), providing quite the distraction. Moreover, they are generally non-informative...

rien333

significant portion of content missed by readability

1

We are processing the text from https://www.fiolinjurylaw.com/ using readability and a much of the content is missing. I've attached the readability output as generated by: $ python -m readability.readability -u...

robh71

Splitting the text in scoring

4

In the score_paragraphs method content score is calculated like this: **content_score += len(inner_text.split(','))** But I think it should be like below, because there may be no comma in a text....

haziyevv

python-readability
python-readability copied to clipboard

Metadata

summary() throws away lots of text at some websites

Wrong length-check in summary when using xpath=True results wrong summaries

Break Inline tags

Leave necessary images

Some tags create unnecessary paragraphs

Pass LXML object straight to readability?

instagram and twitter embedded content filtered

Remove distracting and unnecessary tags

significant portion of content missed by readability

Splitting the text in scoring

← Metadata

Owner

Metadata

python-readability python-readability copied to clipboard

Metadata

← Metadata

Owner

Metadata

python-readability
python-readability copied to clipboard