lychee Exclude URLs based on HTML tags

Currently lychee seems to check each an every URL found in the HTML document.

Generally there is nothing bad about it. We previously used liche, which only checks specific tags and attributes.

While generally checking URLs from text is not bad, IMO, it would be great to be able to exclude certain elements, not just based on the URL itself. As HTML files are parsed already as such, it should be cheaply possible to exclude specific tags, e.g. <code> and <pre> tags. Allowing to specify URL excludes based on tags via an option, would be awesome.

Jun 11 '21 13:06 MichaIng

For HTML that would work as the tags have names. Not sure about Markdown or Plaintext. If we add an option like --exclude-tags=pre,code, what would be the behavior for Markdown? Simply ignore it (which might be confusing) or throw an error (which might be frustrating) or something else?

In general, I wonder if it makes a difference whether a link occurs inside or outside a pre tag for example. Do you have an example for links that predominantly occur in special tags that you couldn't exclude with a regex? I just don't want to make the project unnecessarily complex.

Jun 11 '21 13:06 mre

I personally check links in Markdown only after building the HTML pages from it. But I agree it should be made clear, when adding such an option, that it can and does only apply for HTML input files, e.g. start with naming it --exclude-html-tags, or so. In Markdown things are much more complicated, due to the many extensions and ways to create code blocks and inline code, effective also based on indentation etc. So I guess practically it can only be used on HTML indeed, but that should cover the majority of scanned documents, especially as websites can be scanned wie URLs as well, and as this is currently the only way to scan internal links etc.

In general, I wonder if it makes a difference whether a link occurs inside or outside a pre tag for example.

We have no "links" in any pre or code tag, and you're right that this is not even possible, when you mean clickable links by this. Both tags take all content literally. But we have URLs inside of them, e.g. when telling users how to access a web interface of their locally running application, using loopback IP, localhost, LAN hostnames or IPs. In case of MkDocs, code tags are nice, as they allow easy copy&paste. But also when having a pre tag to show a scripts content, filled with LAN IPs/hostnames or example URLs.

I know the -E option, to exclude them all together (aside of invalid example URLs), but we cannot use it as me must check internal links via locally running webserver and use hence the loopback IP as base URL, which is then excluded as well. Actually, an alternative for at least our case, would be if -E excludes would not be applied to any URL that was generated for internal links with base-url. That actually makes sense anyway, as noone would define a base URL, to then exclude it from checks 😄.

Jun 11 '21 16:06 MichaIng

lychee should first distinguish strict and fuzzy link detections.

strict means a string is a link if and only if it's rendered as a hyperlink in a HTML file. This naturally exclude tags like code or pre.

fuzzy means a string is a link if it's a valid URI of the scheme we support.

after implementing this, it's easier to offer finer control to users.

Jun 11 '21 22:06 lebensterben

Agree with both of your comments. I guess your ideas would fit together quite nicely. The collector could return links with a tag field, that would be an HTML tag. We'd classify those as strict or fuzzy, so a pre would be fuzzy for example. We could think about using different words than strict/fuzzy like hyperlink/plaintext or just link/text, but I like the proposal in general.

Jun 21 '21 14:06 mre

While checking our forum for outdated links, I recognised that also shortened URL text within anchor tags is checked. phpBB shortens long URLs. so that such an element is produced:

<a href="https://example.com/loong/looong/url/that/is/shorteneed/index.html" class="postlink">https://example.com/loong/looong/url/th ... index.html</a>

lychee now checks the URL itself but as well the text, where it identifies https://example.com/loong/looong/url/th as URL, which of course fails.

So once a basic functionality exists to exclude URLs/text based on HTML tags, the text content of tags/elements which have an explicit src/href/... attribute could be excluded by default, as AFAIK such shortening is quite common.

Jun 22 '21 11:06 MichaIng

That's odd. If this is a file with a .html extension, it should have skipped that shortened version. Can you double check that this is the case?

Jun 22 '21 20:06 mre

Ah this was when scanning URLs (with .php extensions and query string) and no local files. In case of URLs as input, it's of course harder to derive/guess the type 🤔. content-type: text/html response header could be used to interpret it as HTML, but I'm not sure how reliable this is, compared to a file extension. Otherwise additionally a contained <html> tag? But then it's getting complicated when currently the type/parser is obtained/chosen based on the extension only. Interpreting .php files as HTML in general also doesn't make sense as those could be PHP source scripts as well.

Jun 22 '21 23:06 MichaIng

Adding this idea here: CSS selector syntax based excludes (and exclusive includes, if it turns out there is a wider use case for such) would be awesome. But I guess that's hard to achieve by hand, so only if the underlying HTML parser or other library supports such already.

Jun 26 '21 09:06 MichaIng

As a workaround for HTML tags, I wonder if lychee could be combined with htmlq or something similar.

htmlq --invert 'pre' | htmlq --invert 'code' | lychee

The --invert flag doesn't exist yet, but we could create an issue for that.

There's also pup, which has a PR open for that for a long time: https://github.com/ericchiang/pup/pull/81.

Nov 30 '21 18:11 mre

But that would not work when using lychee's regular input arguments, especially complicated when wanting to scan URLs instead of local files.

It is not a that big issue for us. The -E flag + a small amount of --exclude's does the trick. It was more what we were used to from liche and code tags were hence a simple way to exclude example URLs or emails for web application access and login documentations, like (with Markdown `` converted to HTML code tags):

Access the web interface at: `http://<your.IP>:1234`
Username: `[email protected]`

We have at least one web application where indeed an invalid example email is required for initial login, from where it can be changed 😉.

Nov 30 '21 19:11 MichaIng

Wait, you say liche supported this? Can you post an example that you used before? Can't find much about it in their repo.

Nov 30 '21 19:11 mre

We have at least one web application where indeed an invalid example email is required for initial login, from where it can be changed 😉.

That sounds more like a bug / weird feature on the site no? You could exclude that with --exclude-mail right now? (There is no way to exclude a single mail address yet, but that might be an easier thing to add.)

Nov 30 '21 19:11 mre

liche excludes code and pre tags OOTB. We did just call:

liche -c 64 -x 'https://(github.com|twitter.com|www.spigotmc.org|pi-hole.net|www.php.net)' -d build -r README.md build/docs

without any additional configuration. When switching to lychee we needed to add all URLs, IPs and links from code and pre tags as excludes. But as you can see, with liche we needed some other excludes which are not required with lychee anymore.

That sounds more like a bug / weird feature on the site no?

😄 well it is not uncommon that a web application freshly installed ships with a default login. And the particular one uses an invalid email address as default login, as it generally aims to have email addresses as usernames, that's all.

You could exclude that with --exclude-mail right now? (There is no way to exclude a single mail address yet, but that might be an easier thing to add.)

The regular --exclude works as well for emails, we do not want to exclude all emails but just the one particular case 😉.

Nov 30 '21 19:11 MichaIng

liche excludes code and pre tags OOTB

That's smart. Starting to work on that as part of https://github.com/lycheeverse/lychee/issues/414.

Dec 03 '21 12:12 mre

As part of https://github.com/lycheeverse/lychee/pull/424 I added element and attribute to the Request struct now. It's not exposed to the CLI yet, but with that we can add exclusion based on tags and set defaults in the future.

I'm thinking to support an option like --exclude-element pre,code as well as setting const DEFAULT_ELEMENTS_EXCLUDED: &'static [&'static str] = &["pre", "code"];.

Dec 17 '21 23:12 mre

That sounds great, many thanks for working on this 👍.

Dec 17 '21 23:12 MichaIng

Examples for CSS selectors, that we could support eventually. I saw these being used in an internal tool.

a[href]
[href=*='*.js']
[src*='.js']

Feb 16 '22 15:02 mre

CSS selectors would be awesome, very flexible, and simple tags are just selected as simple as code/pre without needing to know about further CSS selectors syntax 👍.

Feb 16 '22 15:02 MichaIng

Yeah I think selectors are the better abstraction over filtering by element name. I'm a bit concerned about the overhead.

@untitaker would filtering by CSS selector still be within the scope of html5gum? Would be nice if elements could be filtered out as early as possible to avoid unnecessary allocs. 😅

See also

https://github.com/causal-agent/scraper
https://github.com/importcjj/nipper

Feb 16 '22 16:02 mre

I'm a bit concerned about the overhead.

Dito. At least it shouldn't cause overhead when the option is not used and on non-HTML inputs. However, for our projects it was trivial to work around the need for this via -E and a small number of explicit URI excludes. I think it is generally still useful, and excluding <code> and <pre> even a reasonable default (often containing example or schematic URIs), but until issue/request triggers more comments by other users, it's fine to not work on it with high priority 🙂.

Feb 16 '22 17:02 MichaIng

the overhead is proportional to the CSS path.

Either CSS path or XPath are just ways to specify a node in the HTML tree. Implementation wise that's really easy.

Feb 16 '22 20:02 lebensterben

Make sense. I just have no idea how much relative overhead it would be to check expected CSS paths, like code or a.nocheck (I would expect or use myself relatively short ones, code tags and blocks and probably an explicit nocheck class), compared to parsing a tag and the contained URI in general, and doing an offline or online URI check. Basically would it make a notable difference, at least when doing offline-only checks, or would it be in the barely measurable area? Likely something we won't know for sure until doing the implementation 😄.

Feb 16 '22 20:02 MichaIng

@untitaker would filtering by CSS selector still be within the scope of html5gum? Would be nice if elements could be filtered out as early as possible to avoid unnecessary allocs. sweat_smile

that requires adding the tree building/dom logic to html5gum, which is all well in scope but a very big task. the spec for assembling tokens into a tree structure is bigger than the tokenization spec.

you could have a css selector engine that does not allow you to probe for hierarchy. e.g. if you disallow p > a, p a etc it should be possible to patch into lychee without changes to html5gum. i am not sure how generally useful that is, and whether it would be a good fit for html5gum depends on the exact api proposal imo

Feb 19 '22 16:02 untitaker

Probably I need some explanation how this tokenization actually works, especially when comparing html5gum and html5ever.

I mean lychee currently detects and checks specific HTML tags, respectively their src/href attributes (otherwise it wouldn't be able to detect relative/internal URLs), but it also detects URIs in any other tag and raw text. So since it detects tags and their attributes, it should be at least possible to jump over defined tags, or such with a defined class, isn't it? That would not allow full CSS selectors, unless a DOM tree is generated while parsing, but for practical use excluding specific tags or classes, or tags with classes should cover very most cases. So basically a minimal CSS selector syntax like [tag][.class] with both elements optional would be possible without doing the possibly heavy DOM tree generation.

Feb 19 '22 20:02 MichaIng

We don't need upstream packages to incorporate filtering based on CSS path. We only need to add additional checkings in extract module.

Feb 19 '22 20:02 lebensterben

@MichaIng there's no conceptual difference between html5gum tokenizer and html5ever tokenizer. outside of the tokenizer however, html5ever contains a full DOM tree builder while html5gum does not. You're entirely correct with your assessment as to which css selectors would be possible right now. It's also true that html5gum doesn't have to be changed in any way to support the simple version of the feature you described.

I think if somebody wants to seriously pursue this it might be worth checking whether lol-html can be used as parser (i.e. try to remove html5gum again) -- it features full css selector support. i feel like the library is otherwise rather purpose-built around cloudflare's usecase and hard to use for others.

Feb 20 '22 18:02 untitaker

Quick update for anyone who'll stumble across this issue in the future: Excluding code blocks and pre elements is now done by default. There is a flag named --include-verbatim to also include links in these sections.

What's left to be done is support for custom exclusions via CSS selectors. @MichaIng's proposal for [tag][.class] sounds reasonable.

Apr 06 '22 15:04 mre

Excluding code blocks and pre elements is now done by default.

the next version of html5gum will include an option to not parse the <a> tag inside of <script><a href="https://google.com"></a></script>, see https://github.com/untitaker/hyperlink/issues/148. that might be related to what we're doing here

Jun 18 '22 15:06 untitaker

I have a similar issue.

I'm using Jekyll + Hugo to generate web pages form Markdown:

Install [eksctl](https://eksctl.io/):

```bash
if ! command -v eksctl &> /dev/null; then
  # renovate: datasource=github-tags depName=weaveworks/eksctl
  EKSCTL_VERSION="0.118.0"
  curl -s -L "https://github.com/weaveworks/eksctl/releases/download/v${EKSCTL_VERSION}/eksctl_$(uname)_amd64.tar.gz" | sudo tar xz -C /usr/local/bin/
fi
```bash

which generates HTML code like:

<p>Install <a href="https://eksctl.io/">eksctl</a>:</p><div class="language-bash highlighter-rouge"><div class="code-header"> <span data-label-text="Shell"><i class="fas fa-code small"></i></span> <button aria-label="copy" data-title-succeed="Copied!"><i class="far fa-clipboard"></i></button></div><div class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
</pre><td class="rouge-code"><pre><span class="k">if</span> <span class="o">!</span> <span class="nb">command</span> <span class="nt">-v</span> eksctl &amp;&gt; /dev/null<span class="p">;</span> <span class="k">then</span>
  <span class="c"># renovate: datasource=github-tags depName=weaveworks/eksctl</span>
  <span class="nv">EKSCTL_VERSION</span><span class="o">=</span><span class="s2">"0.118.0"</span>
  curl <span class="nt">-s</span> <span class="nt">-L</span> <span class="s2">"https://github.com/weaveworks/eksctl/releases/download/v</span><span class="k">${</span><span class="nv">EKSCTL_VERSION</span><span class="k">}</span><span class="s2">/eksctl_</span><span class="si">$(</span><span class="nb">uname</span><span class="si">)</span><span class="s2">_amd64.tar.gz"</span> | <span class="nb">sudo tar </span>xz <span class="nt">-C</span> /usr/local/bin/
<span class="k">fi</span>
</pre></table></code></div></div>

Real example can be seen here: https://ruzickap-github-io.pages.dev/posts/cheapest-amazon-eks/

When I run lychee it reports URLs inside "code" as invalid for example: https://github.com/weaveworks/eksctl/releases/download/v${EKSCTL_VERSION}/eksctl_$(uname)_amd64.tar.gz - which is correct, but I would like to exclude any URLs inside "code" section to be scanned...

❯ lychee https://ruzickap-github-io.pages.dev/posts/cheapest-amazon-eks/
...
Issues found in 1 input. Find details below.

[https://ruzickap-github-io.pages.dev/posts/cheapest-amazon-eks/]:
✗ [404] https://github.com/weaveworks/eksctl/releases/download/v | Failed: Network error: Not Found
✗ [404] https://awscli.amazonaws.com/awscli-exe-linux-x86_64- | Failed: Network error: Not Found
✗ [404] https://stefanprodan.github.io/podinfo | Failed: Network error: Not Found
✗ [403] https://charts.bitnami.com/bitnami | Failed: Network error: Forbidden
✗ [404] https://ruzickap-github-io.pages.dev/assets/img/posts/2022/2022-11-27-cheapest-amazon-eks/https:/raw.githubusercontent.com/aws-samples/eks-workshop/65b766c494a5b4f5420b2912d8373c4957163541/static/images/icon-aws-amazon-eks.svg | Failed: Network error: Not Found
✗ [404] https://storage.googleapis.com/kubernetes-release/release/v | Failed: Network error: Not Found
✗ [ERR] file://tmp/ | Failed: Cannot find file

🔍 63 Total ✅ 52 OK 🚫 7 Errors (HTTP:7) 💤 4 Excluded

Can I instruct lychee to ignore the URLs which are part of the code sections somehow ?

In "code" sections there may be some variables / templates / substitutions (like above) which should be "excluded" from URL checks.

Thank you...

Nov 27 '22 16:11 ruzickap

Actually code and pre tags are excluded automatically since a while. But what I recognised is that this does not apply for other tags inside code or pre tags, like in your case for span. This is often the case for Markdown code highlighting. Probably it's possible without much effort to exclude everything in verbatim tags recursively.

Nov 27 '22 17:11 MichaIng