AttributeError while fetching page
Describe the bug When I execute a query without proxy, I have an error at captcha resolution blocking the fetch.
Here is the error:
13:31:36 - Getting https://scholar.google.com/scholar?hl=en&q=Perception%20of%20physical%20stability%20and%20center%20of%20mass%20of%203D%20objects&as_vis=0&as_sdt=0,33
13:31:39 - Got a captcha request.
13:31:44 - Exception AttributeError while fetching page: ("'str' object has no attribute 'domain'",)
13:31:44 - Retrying with a new session.
13:31:54 - Got a captcha request.
13:32:01 - Exception AttributeError while fetching page: ("'str' object has no attribute 'domain'",)
13:32:01 - Retrying with a new session.
13:32:06 - Got a captcha request.
13:32:13 - Exception AttributeError while fetching page: ("'str' object has no attribute 'domain'",)
13:32:13 - Retrying with a new session.
13:32:18 - Got a captcha request.
13:32:25 - Exception AttributeError while fetching page: ("'str' object has no attribute 'domain'",)
13:32:25 - Retrying with a new session.
13:32:42 - Got a captcha request.
13:32:48 - Exception AttributeError while fetching page: ("'str' object has no attribute 'domain'",)
13:32:48 - Retrying with a new session.
Traceback (most recent call last):
File "C:\Users\**\Google Scholar Scrapper\src\main.py", line 8, in <module>
search_query = scholarly.search_pubs('Perception of physical stability and center of mass of 3D objects')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\**\Google Scholar Scrapper\venv\Lib\site-packages\scholarly\_scholarly.py", line 160, in search_pubs
return self.__nav.search_publications(url)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\**\Google Scholar Scrapper\venv\Lib\site-packages\scholarly\_navigator.py", line 296, in search_publications
return _SearchScholarIterator(self, url)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\**\Google Scholar Scrapper\venv\Lib\site-packages\scholarly\publication_parser.py", line 53, in __init__
self._load_url(url)
File "C:\Users\**\Google Scholar Scrapper\venv\Lib\site-packages\scholarly\publication_parser.py", line 59, in _load_url
self._soup = self._nav._get_soup(url)
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\**\Google Scholar Scrapper\venv\Lib\site-packages\scholarly\_navigator.py", line 239, in _get_soup
html = self._get_page('https://scholar.google.com{0}'.format(url))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\**\Google Scholar Scrapper\venv\Lib\site-packages\scholarly\_navigator.py", line 190, in _get_page
raise MaxTriesExceededException("Cannot Fetch from Google Scholar.")
scholarly._proxy_generator.MaxTriesExceededException: Cannot Fetch from Google Scholar.
Process finished with exit code 1
The issue seems to come from _proxy_generator.py#_handle_captcha2, line 403 where the cookie variable doesn't have the expecting value.
This error isn't present when proxies are activated.
To Reproduce
import logging
from scholarly import scholarly
logging.basicConfig(format=f'%(asctime)s - %(message)s', level=logging.INFO, datefmt='%H:%M:%S')
search_query = scholarly.search_pubs('Perception of physical stability and center of mass of 3D objects')
scholarly.pprint(next(search_query))
Expected behavior The print of the first result of the query.
Desktop (please complete the following information):
- Proxy service: /
- python version: 3.11
- OS: Windows
- Version 11
Do you plan on contributing? Your response below will clarify whether the maintainers can expect you to fix the bug you reported.
- [ ] Yes, I will create a Pull Request with the bugfix.
I am seeing this error too.
In _proxy_generator.py, the self._session object is an httpx.Client, and the cookies property on this client is a special Cookies store provided by httpx.
According to the httpx docs, there are no attributes for accessing the parts of a cookie directly:
In [1]: from httpx import Cookies
In [2]: cookies = Cookies()
In [3]: cookies.set("chocolate cookie", "tasty", domain="example.org")
In [4]: type(cookies['chocolate cookie'])
Out[4]: str
In [5]: cookies['chocolate cookie']
Out[5]: 'tasty'
In [6]: cookies['chocolate cookie'].domain
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[6], line 1
----> 1 cookies['chocolate cookie'].domain
AttributeError: 'str' object has no attribute 'domain'
In [7]: cookies['chocolate cookie'].value
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[7], line 1
----> 1 cookies['chocolate cookie'].value
AttributeError: 'str' object has no attribute 'value'
What's strange is that if that's right, then the code on lines 405-413 could never have worked, which seems unlikely?
I am unable to reproduce this error, but the attributes from the httpx cookies certainly seem incorrect. These used to work on requests, but httpx doesn't seem to have the same behaviour.
Hi,
I see the error whenever a captcha is found. So for me, I have a program which invokes scholarly to try to get some abstracts from Google. After about 10-15 requests, even with random pauses in between, I encounter captcha, and this code block is entered, an exception thrown. The exception is caught, so the scholarly carries on trying to do what it was doing, though unsuccessfully.
Reactivating this issue here because I encounter the same problem, and hoping someone with more expertise than me might be able to solve it. I investigated a little bit and it seems that the cookies variable -- at least in my case -- is just a string (e.g. 'NID' or 'GSP'). Hence the error Exception AttributeError while fetching page: ("'str' object has no attribute 'domain'",). Any help is very much appreciated! Thanks