GoogleScraper icon indicating copy to clipboard operation
GoogleScraper copied to clipboard

Proxy allocation to Selenium instances

Open TheFifthFreedom opened this issue 10 years ago • 1 comments

I've noticed that in the loop to create ScrapeWorkerFactorys in core.py, there's a line that loops though every proxy in the given file (if one chooses to use one), which ends up creating more browser instances than you might limit with your config's num_workers:

        # Let the games begin
        if method in ('selenium', 'http'):

            # Show the progress of the scraping
            q = queue.Queue()
            progress_thread = ShowProgressQueue(q, len(scrape_jobs))
            progress_thread.start()

            workers = queue.Queue()
            num_worker = 0
            for search_engine in search_engines:

                for proxy in proxies:

                    for worker in range(num_workers):
                        num_worker += 1
                        workers.put(
                            ScrapeWorkerFactory(
                                mode=method,
                                proxy=proxy,
                                search_engine=search_engine,
                                session=session,
                                db_lock=db_lock,
                                cache_lock=cache_lock,
                                scraper_search=scraper_search,
                                captcha_lock=captcha_lock,
                                progress_queue=q,
                                browser_num=num_worker
                            )
                        )

Not only is this not the behavior we want, it might end up crashing your machine if you have a set of, say, 100 proxies for instance. I believe one solution to this problem would be to remove the loop entirely and pick a proxy every time we're looping though num_workers:

        # Let the games begin
        if method in ('selenium', 'http'):

            # Show the progress of the scraping
            q = queue.Queue()
            progress_thread = ShowProgressQueue(q, len(scrape_jobs))
            progress_thread.start()

            workers = queue.Queue()
            num_worker = 0
            for search_engine in search_engines:

                for worker in range(num_workers):
                    num_worker += 1
                    proxy_to_use = proxies[worker % len(proxies)]
                    workers.put(
                        ScrapeWorkerFactory(
                            mode=method,
                            proxy=proxy_to_use,
                            search_engine=search_engine,
                            session=session,
                            db_lock=db_lock,
                            cache_lock=cache_lock,
                            scraper_search=scraper_search,
                            captcha_lock=captcha_lock,
                            progress_queue=q,
                            browser_num=num_worker
                        )
                    )

What do you think @NikolaiT ?

TheFifthFreedom avatar Feb 22 '15 19:02 TheFifthFreedom

It's a bit of a late answer, but I just put my nose into this project.

I also do not really understand how @NikolaiT intended to design the num_worker variable the threads are not being helpful in Selenium mode. I reworked a bit the loop and following code such as it prevents the threads from opening all the windows at the same time, but based on the amount specified in num_worker.

num_worker = 0
            for search_engine in search_engines:

                for proxy in proxies:

                    # for worker in range(num_workers):

                    num_worker += 1
                    workers.put(
                            ScrapeWorkerFactory(
                                    config,
                                    cache_manager=cache_manager,
                                    mode=method,
                                    proxy=proxy,
                                    search_engine=search_engine,
                                    session=session,
                                    db_lock=db_lock,
                                    cache_lock=cache_lock,
                                    scraper_search=scraper_search,
                                    captcha_lock=captcha_lock,
                                    progress_queue=q,
                                    browser_num=num_worker
                            )
                    )

# here we look for suitable workers
            # for all jobs created.
            for job in scrape_jobs:
                while True:
                    worker = workers.get()
                    if worker.is_suitabe(job):
                        worker.add_job(job)
                        workers.put(worker)
                        break

            threads = []

            while not workers.empty():
                worker = workers.get()
                thread = worker.get_worker()
                if thread:
                    threads.append(thread)
            
            # this is the old code 
            # for t in threads:
            #     t.join()
            # changed for the following:

            num_thread = 0

            while num_thread <= threads.__len__():

                for t in threads[num_thread:num_thread + num_workers]:
                    t.start()

                for t in threads[num_thread:num_thread + num_workers]:
                    t.join()

                num_thread += num_workers


            # after threads are done, stop the progress queue.

It's already working much better in my opinion.

fassn avatar Apr 14 '17 21:04 fassn