google-search icon indicating copy to clipboard operation
google-search copied to clipboard

searchOptions do not appear to be applied

Open lookatitude opened this issue 3 years ago • 8 comments

Hi guys,

I may be doing something wrong here but I set the options but they dont seem to take effect, here's my code:

func searchGoogle(query string) ([]googlesearch.Result, error) {
	googlesearch.RateLimit = rate.NewLimiter(6, 10)
	opts := googlesearch.SearchOptions{
		CountryCode: "us",
		LanguageCode: "en",
		Limit: 10000,
		OverLimit: true,
		Start: 100,
	}

	log.Println("Start processing...")
	results, err := googlesearch.Search(ctx, query, opts)
	if err != nil {
		log.Printf("something went wrong: %v", err)
		return nil, err
	}

	return results, nil
}

Is the options not being loaded or am I doing something wrong? I have the ctx variable set to context.Background in the begining of the file. Why am I saying it's not being applied? I get only 100 results no matter the options I set, it always starts at rank 1, even if I change the CountryCode to "pt" I always get the exact same results.

Intent: given a certain search string I want to grab the first 10K results, will setup the proxy at some point. at the moment I can't get more than 100 results.

lookatitude avatar Oct 04 '22 22:10 lookatitude

This package scrapes google's website. It doesn't use an API. If the website doesn't list 10,000 results, you can't get 10,000 results.

The other option is to use the google api but that costs money. It may or may not support grabbing 10,000 results.

Also, the rank is always manually starting from 1.

The CountryCode is an option for the website, but I believe a few years ago google announced they were going to determine your locale purely from your ip address (i'm not 100% sure).

pjebs avatar Oct 04 '22 23:10 pjebs

Have a look at the code to get a better understanding of how it works. It's quite a simple code base.

pjebs avatar Oct 04 '22 23:10 pjebs

Hi I understand that is not using the API but scrapping the website, just thought it would scrape multiple pages until the number of results reached the limit. Thought it would scrape 1 page after the other until reach the limit.

My bad for not realizing that before, I'll have a look at the code as pjebs suggested. sorry for the misunderstanding on my part.

lookatitude avatar Oct 05 '22 22:10 lookatitude

Ah that's interesting. Yes it's certainly possible to go to the next page etc and scrape.

pjebs avatar Oct 05 '22 22:10 pjebs

I've tested it using the start value to change pages and it works well it captures 100 records per page. So here's my thought process. 1 - On the first search try and collect the amount of pages available and calculate how many calls we can make either until there's no more pages or we reached the limit. 2 - Recieve a list of proxy ips 3 - Using the working groups in Golang or other technique make the requests using the available proxys or 1 machine in case no proxy ips provided.

I can do steps 2 and 3 outside this library and use the current version to setup the individual searches and collect the results, the only part is more complicated is step 1 and getting the number of pages available. Is this something interesting to you guys to add to this library? or is something I should do on my own outside of this package?

I don't mind trying to implement something like this and contribute if you have a use for this in this package. :)

lookatitude avatar Oct 05 '22 22:10 lookatitude

It needs to be done using colly. That is colly's job. You just need to reliably point colly to the page numbers at the bottom and extract their href's.

pjebs avatar Oct 05 '22 23:10 pjebs

Yeah I realise that, but the search function is already doing the request, my question is more if it is ok to do that in the pkg due to changing the seach function response signature. Maybe a truct with the number of pages available and an array or slice of results as there is now.

Same goes for the proxies as now is just counting with 1 proxy per request, to scrape 100 pages either it will take a lot of time providing that you give it a rateLimit, or you use the rateLimit per proxy and use multiple proxies.

As I said before for both of this options either we add the logic to this pkg or I need to get the first page and using colly get the max page number (by getting the url and extracting the start value from that url). Should I add that logic in this pkg and submit a pull request having in mind that the response struct changes, or create that logic on my side and use the pkg as it is?

lookatitude avatar Oct 05 '22 23:10 lookatitude

The search function is only doing 1 request. The signature doesn't have to change since Colly can be configured to follow the links at the bottom.

Perhaps another field in the Options called "followLinks bool" will control Colly's operation.

With regards to the proxies, we can think about that later after Step 1 is solved.

I would like to add the logic to this package.

pjebs avatar Oct 05 '22 23:10 pjebs

@pjebs I've made a pull request for adding multi page search, renamed a few things mainly the url function to buildUrl as I'm using the "net/url" package to parse the url from the next page and get the new start value. Also added all calls to a queue so I could scrape one after the other, let me know what you think

lookatitude avatar Nov 05 '22 22:11 lookatitude