rss-bridge icon indicating copy to clipboard operation
rss-bridge copied to clipboard

Improve html sanitization

Open Bockiii opened this issue 3 years ago • 8 comments

Grouping issue about html sanitization.

Please link issues/PRs to this. We can have the discussion and options ( like http://htmlpurifier.org/ ) in here and derive todos from it.

Bockiii avatar Mar 26 '22 11:03 Bockiii

Things to do

  • [ ] Research list of alternatives
  • [ ] Pro and con list of all alternatives
  • [ ] Select replacement
  • [ ] Impact analysis. Which files need changing, which operators need changing
  • [ ] Documentation assessment. Which parts of the documentation need to be adapted
  • [ ] New branch for all the neccessary changes, all changes go into that branch
  • [ ] Upon completion of all tasks, merge branch

Bockiii avatar Mar 26 '22 11:03 Bockiii

Check #2520 for references.

Bockiii avatar Mar 28 '22 18:03 Bockiii

Check https://github.com/RSS-Bridge/rss-bridge/blame/master/lib/FormatAbstract.php#L164 . logman already provided htmlpurifier as an alternative when he described the function.

Bockiii avatar Mar 30 '22 15:03 Bockiii

I found a relevant function in lib/html.php:

/**
 * Removes unwanted tags from a given HTML text.
 *
 * @param string $html The HTML text to sanitize.
 * @param array $tags_to_remove A list of tags to remove from the DOM.
 * @param array $attributes_to_keep A list of attributes to keep on tags (other
 * attributes are removed).
 * @param array $text_to_keep A list of tags where the innertext replaces the tag
 * (i.e. `<p>Hello World!</p>` becomes `Hello World!`).
 * @return object A simplehtmldom object of the remaining contents.
 *
 * @todo Check if this implementation is still necessary, because simplehtmldom
 * already removes some of the tags (search for `remove_noise` in simple_html_dom.php).
 */
function sanitize($html,
	$tags_to_remove = array('script', 'iframe', 'input', 'form'),
	$attributes_to_keep = array('title', 'href', 'src'),
	$text_to_keep = array()){

	$htmlContent = str_get_html($html);

	foreach($htmlContent->find('*') as $element) {
		if(in_array($element->tag, $text_to_keep)) {
			$element->outertext = $element->plaintext;
		} elseif(in_array($element->tag, $tags_to_remove)) {
			$element->outertext = '';
		} else {
			foreach($element->getAllAttributes() as $attributeName => $attribute) {
				if(!in_array($attributeName, $attributes_to_keep))
					$element->removeAttribute($attributeName);
			}
		}
	}

	return $htmlContent;
}

dvikan avatar Mar 30 '22 22:03 dvikan

Just so we are clear about what we are talking about here. We are talking about sanitizing user input that we know is html. This means we want to take a string of html and explicitly keep certain html tags and html tag attributes. The rest we throw away. The main purpose is to nicely display the feed item in a reader but to also remove harmful stuff such as:

<p onload="while(true) alert(42)">lawls</p>

Notice that sanitizing is different from html escaping. Html escaping means converting html to something that the browser does not interpret as html. So e.g. <h1>kek</h1> => &lt;h1&gt;kek&lt;/h1&gt;

dvikan avatar Apr 02 '22 15:04 dvikan

This is mainly a problem in our HtmlFormat. It is very easy for feed owners to insert any kind of html/js. Luckily, feed readers do sanitizing too. So it's not really a big problem.

A related issue is that we should cleanup the data we insert in our feeds so that they render properly in feed readers.

dvikan avatar Apr 08 '22 21:04 dvikan

https://github.com/tgalopin/html-sanitizer/

dvikan avatar Apr 09 '22 09:04 dvikan

While this is being considered; I'd like to chime in with a use case that is greatly harmed by sanitization; so a toggle of some sort might be considered.

When using rss-bridge to generate feeds, it can occasionally be very useful to include iframes. As the author of the programs generating the feeds' contents, I'd greatly prefer to not have content stripped out of the feeds I design.

I understand the security concerns, but needing to use rss-bridge to output garbage (including html as text etc.) and write custom plugins for feed readers to convert that garbage back to the correct content is both extremely inconvenient and extremely impactful to performance. An enormous benefit of rss-bridge is the ability to process feed contents at refresh-time rather than view-time, saving human users actual time via reduced page loads, so having to run another plugin in the reader on every refresh of every entry greatly increases load times.

For example, when writing a Bridge to parse Reddit posts, including gfycat links (or links from other specific trusted websites) as iframes means I can use their official url format and not need to query their API and parse json to make a direct link to a video; which would also mean giving up UI like author name/link, tags, etc.

RSS-Bridge is a fantastically useful program, and I think that giving bridge developers the option to disable sanitization on the output they design will make it even better.

Edit: removed style example, looks like that's being filtered out at some other stage.

Patricol avatar Apr 10 '22 22:04 Patricol