add testable rss reader with xpath
I did not like the rRSS::search method with removeTegs.
Problematic example when searching for <item(| .*?)>(.*?)</item> or <channel.*?>(.*?)</channel>:
<channel>...<item>...
<description>
<![CDATA[Some bad description with '</item>','<item>' or '</channel>' tags ]]>
</description>
...</item>...</channel>
(Additionally, I don't get why removeTegs was applied on URLs (and not on the title). Maybe to prevent some XSS?)
Although, I do not know for sure if my approach is also more robust for other examples, at least it has tests in case issues arise.
If libxml is available, DOMDocument and DOMXPath (with $doc->recover = true) are used to parse RSS/Atom feeds.
It will also show any parsing error messages to the user.
Unfortunately, a somewhat complex fallback to RegexRSSReader is required if libxml is not available (in order to deal with <![CDATA[ and <!-- sections).
The feed field mappings should not have changed. (see tests/plugins/rss/RSSTest.php)
Except:
- For atom feeds the description field consists of summary and content.
- An empty title is now
<No Title> - Time is
0instead ofnull
@stickz Damn, I haven't considered that at all. But do you think this PR is even worth pursuing (for rss parsing robustness)? I have my doubts since there haven't been issues of rss parsing as far as I know. Otherwise, I would just remove add php test for rRSS fetch if you see no other issues ( or have no other suggestions).
@TrimmingFool This would be a better question for @Novik. I don't know anything about RSS parsing. I just wanted to give my feedback on the requested change to the autoloader I built for ruTorrent, to greatly cut down on script execution times.
If he sees any value in this pull request, I would say to scrap the testing and merge it. But if we're trying to fix something that isn't broken, this could potentially introduce regressions. It would be a step backward in the wrong direction.
Weird, the php tests seem to work for me now even without the change to php/util.php.
... trying to fix something that isn't broken
Not broken per se, but vulnerable to xml tags in the rss description. For instance, a description author could trick some matching on the title by injecting new channel items. However, in my opinion this might be safely ignored since trusting the feed is already kind of the same as trusting description authors.
But I would also be interested in @Novik's opinion on this.
@TrimmingFool Any type of injection on a web application (including by not limited to xml tags) is very problematic for web security. This type of vulnerability should be fixed. We should not trust a third-party to secure their RSS feed. The entire RSS plugin becomes insecure if we do this. They can take every possible precaution, then someone can use a zero-day exploit on them. Even if the intruder has full access to the feed anyways, we still want to mitigate the impact and their ability to deceive the subscriber.
I can not take responsibility for this pull request and merge it because I'm not familiar with RSS. This is better suited for @Novik. I don't know if this is the right way to fix the problem or if the RSS portion of the pull request has any regressions.