Using AdBlock rules to remove elements
AdBlock Plus element hiding rules specify elements to exclude and are specified by CSS selectors. This is easily implemented in lxml, if somewhat slowly.
I'm using this in my own code to automatically remove social media share links from pages. You may want to consider including something similar in python-readablity.
EasyList is dual licensed Creative Commons Attribution-ShareAlike 3.0 Unported and GNU General Public License version 3. CC-BY-SA looks compatible with Apache licensed projects.
Example
First download the rules:
$ wget https://easylist-downloads.adblockplus.org/fanboy-annoyance.txt
Then you can simply extract the CSS selectors to match against a document tree.
from lxml import html
from lxml.cssselect import CSSSelector
RULES_PATH = 'fanboy-annoyance.txt'
with open(RULES_PATH, 'r') as f:
lines = f.read().splitlines()
# get elemhide rules (prefixed by ##) and create a CSSSelector for each of them
rules = [CSSSelector(line[2:]) for line in lines if line[:2] == '##']
def remove_ads(tree):
for rule in rules:
for matched in rule(tree):
matched.getparent().remove(matched)
doc = html.document_fromstring("<html>...</html>")
remove_ads(doc)
+1 This feature seems quite good, why not add it ?
@eromoe That's an interesting question. What do you think, will it work fine if turned on by default?
Currently readability-lxml would count some tags such as image to calculate the best candidate. Reduce some ad elements would improve the accuracy(I don't feel adblock has ever blocked an article content or any important part on a site at least until now). But it should not be turned on by default I think :)
Thanks. Will add a switchable extension then. And there were several others requests for switchable features.
Just to give an update for anyone who is using above method for the removal of adblock tags, don't store the rules of CSS selector into the list
rules = [CSSSelector(line[2:]) for line in lines if line[:2] == '##']
It would consume nearly 200-250 MB ram and instead you can use the conditional statement inside the loop which will consume 5-10 MB only.
Below is the memory usage statistics:
Line # Mem usage Increment Line Contents
================================================
279 79.8 MiB 0.0 MiB @profile
280 def test_fanboy_content(self):
281 79.8 MiB 0.0 MiB from lxml.cssselect import CSSSelector
282 79.8 MiB 0.0 MiB from project.settings import ADBLOCK_RULES_PATH, ALREADY_MADE_RULES
283 79.8 MiB 0.0 MiB RULES_PATH = ADBLOCK_RULES_PATH
284
287 79.8 MiB 0.0 MiB with open(RULES_PATH, 'r') as f:
288 81.0 MiB 1.2 MiB lines = f.read().splitlines()
289 81.0 MiB 0.0 MiB f.close()
290
291 # get elemhide rules (prefixed by ##) and create a CSSSelector for each of them
292 282.0 MiB 201.0 MiB rules = [CSSSelector(line[2:]) for line in lines if line[:2] == '##']
@azhard4int Yes, you can regenerate the CSSSelector objects each time to save memory, but you're trading memory usage for performance. It is very slow to recreate the CSSSelector objects every time you process a document.
Instead, how about we just extract the xpath query that was generated by cssselector, and join them all together with the xpath | (or) operator. Storing the single large xpath query string isn't nearly as bad as storing a list of CSSSelector objects. Then we can easily check if any rule from the entire list matched at once, and delete the matched item.
Here's a new implementation that uses that approach. I only did a little testing of it, but it seems to work fine.
import cssselect
class AdRemover(object):
"""
This class applies elemhide rules from AdBlock Plus to an lxml
document or element object. One or more AdBlock Plus filter
subscription files must be provided.
Example usage:
>>> import lxml.html
>>> remover = AdRemover('fanboy-annoyance.txt')
>>> doc = lxml.html.document_fromstring("<html>...</html>")
>>> remover.remove_ads(doc)
"""
def __init__(self, *rules_files):
if not rules_files:
raise ValueError("one or more rules_files required")
translator = cssselect.HTMLTranslator()
rules = []
for rules_file in rules_files:
with open(rules_file, 'r') as f:
for line in f:
# elemhide rules are prefixed by ## in the adblock filter syntax
if line[:2] == '##':
try:
rules.append(translator.css_to_xpath(line[2:]))
except cssselect.SelectorError:
# just skip bad selectors
pass
# create one large query by joining them the xpath | (or) operator
self.xpath_query = '|'.join(rules)
def remove_ads(self, tree):
"""Remove ads from an lxml document or element object.
The object passed to this method will be modified in place."""
for elem in tree.xpath(self.xpath_query):
elem.getparent().remove(elem)
@bburky interesting, were you able to check the performance time in both cases? In our scenario, adblock rules was taking around 3-7 seconds to process the document. But, it will vary as well depending on the document size.
I was testing in on a completely empty document. The slowness I saw was entirely from creating CSSelectors. Also I did notice the memory usage you mentioned, but it was only +50Mb I think, not 200Mb.
I can look at it tomorrow. I may have introduced antother change on accident that caused all the slowdown.
I think there was actually a small speed up from using a single merged xpath query. Its a really nice approach regardless.
@bburky I had few more set of rules added so combined it was causing around 200mb.
By the way, did you try to run this stuff (readability + custom rules like adblocker) on large scale for processing more than 50,000-100,000 documents on daily basis etc?
@azhard4int No. I really only used this once for downloading some blogs to create ebooks for personal reading. I wanted to get rid of all the social media buttons in the text.
It looks like you've integrated it into a large project though. I hope you found this useful.
I also have the similar problems when running this code. Is there a newer version?