grab
grab copied to clipboard
Web Scraping Framework
There are small typos in: - docs/grab/transport.rst - docs/spider/intro.rst - docs/spider/task.rst - docs/spider/transport.rst - docs/usage/installation.rst - grab/base.py - grab/spider/queue_backend/redis.py Fixes: - Should read `access` rather than `acess`. - Should read...
fixes #396 396
``` /lib/python3.10/site-packages/grab/spider/base_service.py", line 64, in is_alive return self.thread.isAlive() AttributeError: 'Thread' object has no attribute 'isAlive'. Did you mean: 'is_alive'? ``` Python 3.10, but I saw same problem in 3.9
Content of `grab.response.head in the moment of error happened: ``` b'HTTP/1.1 200 OK\r\nDate: Tue, 13 Jun 2017 22:16:36 GMT\r\nServer: Apache\r\nSet-Cookie: \xb3\xd2\xda\xcd\xd7=%96%A6g%9Ay%B0%A5g%A7tm%7C%95%9A; expires=Tue, 25-Jul-2017 14:16:36 GMT; path=/\r\nX-Powered-By: Apache2\r\nVary: Accept-Encoding\r\nContent-Encoding: gzip\r\nContent-Length: 4974\r\nContent-Type:...
Affected file: [grab/document.py](https://github.com/lorien/grab/blob/master/grab/document.py#L38) ``` >>> import libgenapi ... /usr/local/lib/python3.9/site-packages/grab/document.py:35: DeprecationWarning: defusedxml.lxml is no longer supported and will be removed in a future release. import defusedxml.lxml ``` The defusedxml.lxml subpackage will...
I want to configure CURLOPT_RESOLVE to specific IP address, so in create_grab_instance() I wrote: ``` ... g.setup_transport('pycurl') g.transport.curl.setopt(pycurl.RESOLVE, ['api.somesite.com:443:{}'.format(ip)]) return g ``` When I call spider.run(), I get the following...
Sometimes there is a real need to send header filed with empty value. Example [here](https://curl.haxx.se/libcurl/c/httpcustomheader.html) explains how to do that.
It works on my dev Debian machine. It fails in github ubunti CI environemnt. ```python @only_grab_transport("pycurl") def test_different_domains(self): import pycurl # pylint: disable=import-outside-toplevel grab = build_grab() names = [ "foo:%d:127.0.0.1"...
The documentation URL redirects to the old page. Now it will redirect to the new one