python-readability Using AdBlock rules to remove elements

AdBlock Plus element hiding rules specify elements to exclude and are specified by CSS selectors. This is easily implemented in lxml, if somewhat slowly.

I'm using this in my own code to automatically remove social media share links from pages. You may want to consider including something similar in python-readablity.

EasyList is dual licensed Creative Commons Attribution-ShareAlike 3.0 Unported and GNU General Public License version 3. CC-BY-SA looks compatible with Apache licensed projects.

Example

First download the rules:

$ wget https://easylist-downloads.adblockplus.org/fanboy-annoyance.txt

Then you can simply extract the CSS selectors to match against a document tree.

from lxml import html
from lxml.cssselect import CSSSelector

RULES_PATH = 'fanboy-annoyance.txt'
with open(RULES_PATH, 'r') as f:
    lines = f.read().splitlines()

# get elemhide rules (prefixed by ##) and create a CSSSelector for each of them
rules = [CSSSelector(line[2:]) for line in lines if line[:2] == '##']

def remove_ads(tree):
    for rule in rules:
        for matched in rule(tree):
            matched.getparent().remove(matched)

doc = html.document_fromstring("<html>...</html>")
remove_ads(doc)

Oct 29 '13 21:10 bburky

+1 This feature seems quite good, why not add it ?

Jun 06 '16 02:06 eromoe

@eromoe That's an interesting question. What do you think, will it work fine if turned on by default?

Jun 06 '16 07:06 buriy

Currently readability-lxml would count some tags such as image to calculate the best candidate. Reduce some ad elements would improve the accuracy(I don't feel adblock has ever blocked an article content or any important part on a site at least until now). But it should not be turned on by default I think :)

Jun 06 '16 09:06 eromoe

Thanks. Will add a switchable extension then. And there were several others requests for switchable features.

Jun 06 '16 09:06 buriy

Just to give an update for anyone who is using above method for the removal of adblock tags, don't store the rules of CSS selector into the list

rules = [CSSSelector(line[2:]) for line in lines if line[:2] == '##']

It would consume nearly 200-250 MB ram and instead you can use the conditional statement inside the loop which will consume 5-10 MB only.

Below is the memory usage statistics:


Line #    Mem usage    Increment   Line Contents
================================================
   279     79.8 MiB      0.0 MiB       @profile
   280                                 def test_fanboy_content(self):
   281     79.8 MiB      0.0 MiB           from lxml.cssselect import CSSSelector
   282     79.8 MiB      0.0 MiB           from project.settings import ADBLOCK_RULES_PATH, ALREADY_MADE_RULES
   283     79.8 MiB      0.0 MiB           RULES_PATH = ADBLOCK_RULES_PATH
   284                             
   287     79.8 MiB      0.0 MiB           with open(RULES_PATH, 'r') as f:
   288     81.0 MiB      1.2 MiB               lines = f.read().splitlines()
   289     81.0 MiB      0.0 MiB           f.close()
   290                             
   291                                     # get elemhide rules (prefixed by ##) and create a CSSSelector for each of them
   292    282.0 MiB    201.0 MiB           rules = [CSSSelector(line[2:]) for line in lines if line[:2] == '##']

Aug 07 '17 07:08 azhard4int

@azhard4int Yes, you can regenerate the CSSSelector objects each time to save memory, but you're trading memory usage for performance. It is very slow to recreate the CSSSelector objects every time you process a document.

Instead, how about we just extract the xpath query that was generated by cssselector, and join them all together with the xpath | (or) operator. Storing the single large xpath query string isn't nearly as bad as storing a list of CSSSelector objects. Then we can easily check if any rule from the entire list matched at once, and delete the matched item.

Here's a new implementation that uses that approach. I only did a little testing of it, but it seems to work fine.

import cssselect

class AdRemover(object):
    """
    This class applies elemhide rules from AdBlock Plus to an lxml
    document or element object. One or more AdBlock Plus filter
    subscription files must be provided.

    Example usage:

    >>> import lxml.html
    >>> remover = AdRemover('fanboy-annoyance.txt')
    >>> doc = lxml.html.document_fromstring("<html>...</html>")
    >>> remover.remove_ads(doc)
    """

    def __init__(self, *rules_files):
        if not rules_files:
            raise ValueError("one or more rules_files required")

        translator = cssselect.HTMLTranslator()
        rules = []

        for rules_file in rules_files:
            with open(rules_file, 'r') as f:
                for line in f:
                    # elemhide rules are prefixed by ## in the adblock filter syntax
                    if line[:2] == '##':
                        try:
                            rules.append(translator.css_to_xpath(line[2:]))
                        except cssselect.SelectorError:
                            # just skip bad selectors
                            pass

        # create one large query by joining them the xpath | (or) operator
        self.xpath_query = '|'.join(rules)


    def remove_ads(self, tree):
        """Remove ads from an lxml document or element object.

        The object passed to this method will be modified in place."""

        for elem in tree.xpath(self.xpath_query):
            elem.getparent().remove(elem)

Aug 09 '17 07:08 bburky

@bburky interesting, were you able to check the performance time in both cases? In our scenario, adblock rules was taking around 3-7 seconds to process the document. But, it will vary as well depending on the document size.

Aug 09 '17 10:08 azhard4int

I was testing in on a completely empty document. The slowness I saw was entirely from creating CSSelectors. Also I did notice the memory usage you mentioned, but it was only +50Mb I think, not 200Mb.

Aug 09 '17 12:08 bburky

I can look at it tomorrow. I may have introduced antother change on accident that caused all the slowdown.

I think there was actually a small speed up from using a single merged xpath query. Its a really nice approach regardless.

Aug 09 '17 12:08 bburky

@bburky I had few more set of rules added so combined it was causing around 200mb.

By the way, did you try to run this stuff (readability + custom rules like adblocker) on large scale for processing more than 50,000-100,000 documents on daily basis etc?

Aug 09 '17 13:08 azhard4int

@azhard4int No. I really only used this once for downloading some blogs to create ebooks for personal reading. I wanted to get rid of all the social media buttons in the text.

It looks like you've integrated it into a large project though. I hope you found this useful.

Aug 09 '17 18:08 bburky

I also have the similar problems when running this code. Is there a newer version?

Feb 28 '23 11:02 roeiba