KindleEar 请教Failed to execute recipe: list index out of range

[recipe_input.py:94] Failed to execute recipe "Laos RFA": list index out of range

#!/usr/bin/env python3
# -*- coding:utf-8 -*-
from calibre.web.feeds.recipes import BasicNewsRecipe

class AdvancedUserRecipe1723469468(BasicNewsRecipe):
    __created_date__ = "2024-08-12"
    title = "Laos RFA"
    description = "A selection of news from and about Laos."
    encoding = "utf-8"
    language = "en"
    max_articles_per_feed = 10
    oldest_article        = 7
    auto_cleanup = False
    keep_images = False
    use_embedded_content = True

    feeds = [
        ("Laos RFA", "https://www.rfa.org/english/news/laos_news/rss2.xml"),
    ]

这个是全文RSS，不知道为什么失败。用的是3.0.7版本。

Aug 12 '24 13:08 Steven630

在Oracle测试显示403错误（Forbidden），无法调试。

你可以协助定位，在Caibre配置里面添加，然后将错误log记录贴上来。

{
  "log_level": "debug"
}

Aug 12 '24 15:08 cdhigh

已经加了这个参数，不过日志好像没有更具体：

[recipe_input.py:94] Failed to execute recipe "Laos RFA": list index out of range

[plumber.py:394] Failed to execute input plugin: All feeds are empty, aborting.

[worker.py:149] There are no new feeds available: admin: [Laos RFA]

Aug 12 '24 22:08 Steven630

那说明你的版本太老了，可以升级再试，我已经更新了部署脚本，重复升级不会再导致扣费了

Aug 12 '24 23:08 cdhigh

好的，现在是不是不能再用筛选recipe语言的代码了？

Aug 12 '24 23:08 Steven630

是的，主要是我不想参数太复杂

Aug 12 '24 23:08 cdhigh

"[news.py:1960] Failed feed: Laos RFA: Exception Cannot fetch https://www.rfa.org/english/news/laos_news/rss2.xml:403 [news.py:1957->...->news.py:1957->news.py:1957]"

Aug 13 '24 01:08 Steven630

我记起来了，之前的下标越界问题之前就修正了。 Rfa无法获取的原因是其网站有强力的反爬虫，KindleEar 只有比较简单的反爬虫欺骗手段，暂时无法突破其封锁。

Aug 13 '24 08:08 cdhigh

官方提供的全文RSS也反爬虫吗？好吧……

Aug 13 '24 10:08 Steven630

它这个反爬虫是针对整个网站的，可能技术人员忘了将rss地址单独列为白名单。

Aug 13 '24 12:08 cdhigh

哎，那没办法了。以前知乎转发器的原理怎么才能用到新版本呢。能否麻烦做个示范的recipe？感觉转发器还是可以解决不少ip地址被封锁的问题

Aug 13 '24 16:08 Steven630

我已经更新代码，在http头里面再添加一行，更模拟浏览器访问，现在已经可以爬取rfa了。
如果你在其他服务器上搭建有转发器，在KindleEar新版本中也可以使用，比如假如你的转发器地址为 http://example.com，则修改recipe中的feed地址为下面的形式：

feeds = [
    ("Laos RFA", "http://example.com/?k=xzSlE&t=60&u=https://www.rfa.org/english/news/laos_news/rss2.xml"),
]

只是注意的一点是转发器仅针对反爬虫里面的“封禁IP”，反爬虫还有很多其他多样化的措施。这个issue里面的rfa的反爬虫就不是封禁IP。

技术发展很快，现在的serverless技术路线更适合“转发”用途，简单来说就是一个javascript脚本，不需要配置服务器，cloudflare会将这个脚本自动部署到全球多个CDN服务器，我也更新了转发器仓库，添加cloudflare的worker实现。
KindleEar可以搭配RSSHUB使用，在自己写RSS抓取代码前，可以先到RSSHUB搜索是否已经有特定的内容了。

Aug 13 '24 23:08 cdhigh

太感谢了！这两天试试。最新的commit右侧显示红色的叉（failure），不知道会不会有影响？

Aug 14 '24 13:08 Steven630

那个是Github action的执行结果，现在表示项目文档自动更新失败（从*.md自动编译为*.html），和代码没有关系。

至于文档为何更新失败，可能某个编译依赖有变化或系统环境出错之类的，可能下次就好了。

PS：新的部署脚本每次执行都会抓取最新的Calibre Recipe，所以以后calibre的某些recipe有更新，重新执行部署脚本即可同步到你的项目。

Aug 14 '24 13:08 cdhigh

明白了。刚才部署到了cloudflare，是否URL就是：

https://my-worker.subdomain.workers.dev/?k=xzSlE&t=timeout&u=URL

如何判断部署已经成功呢？

Aug 14 '24 13:08 Steven630

在dashboard里面有 "Visit" 按钮，就是链接，如果出错，可以看dashboard里面的logs

直接访问链接至少会返回

Auth Key is invalid!

Aug 14 '24 13:08 cdhigh

刚才又试了一下，应该是可以了。

Aug 14 '24 15:08 Steven630

RFA源用了全文rss，推送到Kindle上图片说明的文字会超出Kindle边界，不知道为什么。

另外转发器打开纽约时报的链接是空白，比如：https://www.nytimes.com/2024/08/12/us/politics/us-china-working-group-trade.html

Aug 14 '24 16:08 Steven630

RFA的img越界是因为在html里面指定了width/height，我已经更新代码，将img的这些属性删除。
转发nytimes其实已经成功了，只是nytimes反爬虫机制发挥作用了，返回的源码中有提示：

Please enable JS and disable any ad blocker

毕竟我们这个转发器是一个轻量化定制化的工具，不是完整的代理服务器，所以应用场景比较有限，比如如果返回的html内容的图像文件是相对路径的话，就无法获取到图像，这时候可能你需要重载 BasicNewsRecipe 的函数 image_url_processor() 来返回正确的图像url。

不管怎么样，区区几十行代码就可以解决我们面临的特定问题，还算一个好工具吧。

Aug 14 '24 22:08 cdhigh

这两天Calibre好像更新了抓取的代码，看到mobileread作者和网友讨论Science recipe时提到会更新：https://www.mobileread.com/forums/showthread.php?t=362642

昨天还有更新，不知道对KE有无帮助：https://github.com/kovidgoyal/calibre/commit/e8453ed5906362e580592c4b2976f775b5424c88

Aug 15 '24 14:08 Steven630

Qt是桌面技术，不能在服务器环境使用。

Aug 15 '24 23:08 cdhigh

经过搜索，发现兼容requests的库niquests已经支持http1.1/2/3，所以之后如果确实有需求，改一行代码就可以了。

Aug 16 '24 00:08 cdhigh

Science的recipe应该需要支持才能显示图片，Calibre作者这两天刚上传的这个recipe，应该是新版已经解决了这个问题。另外不知道NYT能不能靠此解决

Aug 16 '24 13:08 Steven630

好的，等他发布稳定版本，然后你需要哪个recipe在KindleEar里面无法获取，可以提issue，我看是否能解决。

Aug 16 '24 14:08 cdhigh

用这个NYT的recipe：https://github.com/kovidgoyal/calibre/blob/master/recipes/nytfeeds.recipe

错误提示： [simple.py:649] Could not fetch link https://www.nytimes.com/2024/09/23/opinion/iron-air-battery-china.html: NameError name 'datestring' is not defined [simple.py:604->...->:68->iso8601.py:16]

[recipe_input.py:89] Failed to execute recipe "NYT News": ValueError No articles downloaded, aborting [recipe_input.py:87->...->news.py:1249->news.py:1562]

Sep 26 '24 16:09 Steven630

已经修正

Sep 26 '24 23:09 cdhigh

上传封面图片，20kb左右的jpg格式，总是提示error，不知道为何

Sep 28 '24 14:09 Steven630

如果没有具体的原因文字描述，大概率是浏览器的问题

Sep 28 '24 15:09 cdhigh

原来用的Chrome，确实换成edge就可以了。

今天NYT又有下面的错误，是触发了反爬虫机制吗

Traceback (most recent call last): File "/workspace/application/lib/urlopener.py", line 110, in open_remote_url resp = req_func(url, data=data, headers=headers, timeout=timeout, allow_redirects=True, File "/layers/google.python.pip/pip/lib/python3.10/site-packages/requests/sessions.py", line 602, in get return self.request("GET", url, **kwargs) File "/layers/google.python.pip/pip/lib/python3.10/site-packages/requests/sessions.py", line 589, in request resp = self.send(prep, **send_kwargs) File "/layers/google.python.pip/pip/lib/python3.10/site-packages/requests/sessions.py", line 703, in send r = adapter.send(request, **kwargs) File "/layers/google.python.pip/pip/lib/python3.10/site-packages/requests/adapters.py", line 682, in send raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

Sep 30 '24 23:09 Steven630

是的，可以考虑将recipe的simultaneous_downloads设置为1，关闭多线程下载。如果还不够，再考虑设置一个合理的delay值，单位为秒，但是可以使用浮点数表示小于一秒的网络抓取间隔。

Oct 01 '24 01:10 cdhigh

是的，可以考虑将recipe的simultaneous_downloads设置为1，关闭多线程下载。如果还不够，再考虑设置一个合理的delay值，单位为秒，但是可以使用浮点数表示小于一秒的网络抓取间隔。

好的。有的recipe重新抓一次就又可以了，能不能考虑遇到网络问题抓取失败，加个10分钟后自动重新下载的功能，让用户选择是否开启。

主页面的日志，no news提示更详细点，分为两类，一类是真的没有更新的文章，一类是因为网络或页面失败抓取失败。

Oct 01 '24 11:10 Steven630