Search return with no results.
https://github.com/aviaryan/python-gsearch/blob/fba2f42fbf4c2672b72d05b53120adcb25ba8b69/gsearch/googlesearch.py#L120
Is effectively destroying the results on son querys.
When searching for: 170PP+270PP
Returns the url https://books.google.es/books?id=XZd2DwAAQBAJ&[...]
re.sub(r'^.*?=', '', url, count=1)
Severs the url to: XZd2DwAAQBAJ&pg=PA283&[...]
What is the intended use of this line? Shouldn't it raise an exception when the return value is not valid?
Google search results HTML used to have a \url=.... prefix before the actual link so this regex removes everything from the start until the occurrence of = char.
What happens when you search anything other than that query?
Btw, just checked the script and it seems to be missing many results, more specifically Google Books citation results. Also, google search HTML structure seems to have changed, and this causes the scraping to miss results.
Would you be interested in doing a PR?
As The Zen of Python states
Explicit is better than implicit.
I'd rather re.sub the whole \url= than assuming it always will be there. Google tends to change things. And I'd rather have a line that does one thing and does it right than having side effects. (Like missing google books links).
In other searches, they usually have at least one (non google) result, so my script assumed everything was ok and kept running normally.
And finally, yes, I would be interested in doing some PR, as I intend to keep using this library. If you are interested and Ok with it I would like to do 4 different PRs.
- Fix Google books (and possibily other) links from not beeing propperly parsed.
- Replace regex html scrapping with BeautifullSoup
- Properly identifying and throwing an exception if google blocks you.
- Add recursive search to ensure the desired number of results are found (I found sometimes the search fails with no apparent reason and succeeds after retrying, might be fixed with the BS replacement).
Hi @aviaryan, Looking through the readme I've noticed that you state that gsearch works:
[...] without any external dependencies. In that case should I discard the PR idea to replace regex with BS4?
@EndermanAPM Yes, that would be good. In fact, I don't think BS4 adds any real value. Structured parsing can be achieved using the inbuilt HTMLParser as well (#2). And doing that would be welcome. 😄
Looking forward to any PRs.