html-cluster
html-cluster copied to clipboard
A command line tool to cluster html pages based on structural and style similarity.
HTML Cluster
A command line tool to cluster html pages based on structural and style similarity. This tool was based on Page Compare.
Install
The quick way:
pip install html-cluster
How it works
- Download HTML form a list of files.
html-cluster download-html urls.txt
- Create a similarity score file.
html-cluster make-score-similarity-file --structural-weight=0.3
- Create the graph dot file
html-cluster make-graph similarity_score.json > graph.dot
- Render the image
neato -O -Tpng graph.dot
Splash
We use splash to take screenshots. You can use the following bash script to run it using docker
#!/usr/bin/env bash
echo 'Splash is running on http://127.0.0.1:8050'
docker run -p 5023:5023 -p 8050:8050 -p 8051:8051 scrapinghub/splash
Examples of the images generated by the script
Example 1 using Splash

Example 2 using Splash

Example 2 without using Splash. Default behaviour
