SingleProxy returns True but failed to query
Describe the bug scholarly couldn't work even if I set up proxy and SingleProxy returns True. Code snippet is as below
from scholarly import scholarly, ProxyGenerator
pg = ProxyGenerator()
success = pg.SingleProxy(http='socks5://localhost:1208', https='socks5://localhost:1208')
print(success) # True here
scholarly.use_proxy(pg)
search_query = scholarly.search_pubs('A paper title')
pub = next(search_query)
print(pub.bib['cites'])
error reported as:
Traceback (most recent call last):
File "myenv\lib\site-packages\urllib3\connection.py", line 159, in _new_conn
conn = connection.create_connection(
File "myenv\lib\site-packages\urllib3\util\connection.py", line 84, in create_connection
raise err
File "myenv\lib\site-packages\urllib3\util\connection.py", line 74, in create_connection
sock.connect(sa)
TimeoutError: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "myenv\lib\site-packages\urllib3\connectionpool.py", line 670, in urlopen
httplib_response = self._make_request(
File "myenv\lib\site-packages\urllib3\connectionpool.py", line 381, in _make_request
self._validate_conn(conn)
File "myenv\lib\site-packages\urllib3\connectionpool.py", line 978, in _validate_conn
conn.connect()
File "myenv\lib\site-packages\urllib3\connection.py", line 309, in connect
conn = self._new_conn()
File "myenv\lib\site-packages\urllib3\connection.py", line 171, in _new_conn
raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x0000021FDB6B5310>: Failed to establish a new connection: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "myenv\lib\site-packages\requests\adapters.py", line 439, in send
resp = conn.urlopen(
File "myenv\lib\site-packages\urllib3\connectionpool.py", line 726, in urlopen
retries = retries.increment(
File "myenv\lib\site-packages\urllib3\util\retry.py", line 446, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.sslproxies.org', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x0000021FDB6B5310>: Failed to establish a new connection: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "myenv\lib\site-packages\fp\fp.py", line 32, in get_proxy_list
page = requests.get(self.__website(repeat))
File "myenv\lib\site-packages\requests\api.py", line 76, in get
return request('get', url, params=params, **kwargs)
File "myenv\lib\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "myenv\lib\site-packages\requests\sessions.py", line 530, in request
resp = self.send(prep, **send_kwargs)
File "myenv\lib\site-packages\requests\sessions.py", line 643, in send
r = adapter.send(request, **kwargs)
File "myenv\lib\site-packages\requests\adapters.py", line 516, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='www.sslproxies.org', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x0000021FDB6B5310>: Failed to establish a new connection: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。'))
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "scratch.py", line 10, in <module>
scholarly.use_proxy(pg)
File "myenv\lib\site-packages\scholarly\_scholarly.py", line 78, in use_proxy
self.__nav.use_proxy(proxy_generator, secondary_proxy_generator)
File "myenv\lib\site-packages\scholarly\_navigator.py", line 68, in use_proxy
proxy_works = self.pm2.FreeProxies()
File "myenv\lib\site-packages\scholarly\_proxy_generator.py", line 550, in FreeProxies
proxy = self._proxy_gen(None) # prime the generator
File "myenv\lib\site-packages\scholarly\_proxy_generator.py", line 509, in _fp_coroutine
all_proxies = freeproxy.get_proxy_list(repeat=False) # free-proxy >= 1.1.0
File "myenv\lib\site-packages\fp\fp.py", line 35, in get_proxy_list
raise FreeProxyException(
fp.errors.FreeProxyException: Request to https://www.sslproxies.org failed
Expected behavior A clear and concise description of what you expected to happen.
Screenshots If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
- Proxy service: Single Proxy, a socks5 proxy started locally
- python version: 3.8
- OS: Windows 10
- Version 1.7.11
Do you plan on contributing? Your response below will clarify whether the maintainers can expect you to fix the bug you reported.
- [ ] Yes, I will create a Pull Request with the bugfix.
Additional context Add any other context about the problem here.
Can you try with scholarly.use_proxy(pg, pg) and see if that runs successfully?
Can you try with
scholarly.use_proxy(pg, pg)and see if that runs successfully?
It reports that scholarly._proxy_generator.MaxTriesExceededException: Cannot Fetch from Google Scholar.. However it seems the proxy works. Modified code snippet is shown as below
import requests
from scholarly import scholarly, ProxyGenerator
proxies = {
"http": "socks5://localhost:1208",
"https": "socks5://localhost:1208"
}
url = 'https://api.ipify.org'
response = requests.get(url, proxies=proxies)
print(response.text) # code 200, returns a US IP address
pg = ProxyGenerator()
success = pg.SingleProxy(http='socks5://localhost:1208', https='socks5://localhost:1208')
print(success) # Print True here
scholarly.use_proxy(pg, pg)
search_query = scholarly.search_pubs('Paper title here')
pub = next(search_query)
print(pub.bib['cites'])
Proxy working, with success = True means they are able to receive responses. However, Google Scholar might still identify that it is an automated request and block the request. It means you'll need a more robust proxy.
Proxy working, with
success = Truemeans they are able to receive responses. However, Google Scholar might still identify that it is an automated request and block the request. It means you'll need a more robust proxy.
I have considered the case you suggested so I visited Google scholar via web browser from the same proxy and it worked. However I will also follow your suggestion to find a more robust proxy to check.
I have figured out the reason: I am behind a socks proxy but in _proxy_generator.py if proxy doesn't start with "http", it will add the prefix, so the configuration became "http": "http://socks5://localhost:1208". I removed the corresponding logic and now the response code is 200. However, another bug involving captcha resolving triggered.
Traceback (most recent call last):
File "lib\site-packages\scholarly\_navigator.py", line 132, in _get_page
session = pm._handle_captcha2(pagerequest)
File "lib\site-packages\scholarly\_proxy_generator.py", line 404, in _handle_captcha2
cur_host = urlparse(self._get_webdriver().current_url).hostname
AttributeError: 'NoneType' object has no attribute 'current_url'
The error above regd. catcha failure is definitely a legitimate bug that I'm fixing right now. Thank you for reporting this.