The GitHub crawler does not seem to wait between each query to the GitHub API (API rate limit exceeded)

Open SyleKu opened this issue 3 years ago • 1 comments

Summary

By crawling a GitHub organisation the rate limit will exceed at the really beginning of the crawling since the organisatzion that is going to be crawled has a lot of repositories.

Type of Issue

It is a :

[X] bug
[X] request
[ ] question regarding the documentation

Motivation

I am trying to crawl a github organisation (https://github.com/python) but unfortunately at the really early stage of the crawling the github-crawler-starter-2.0.1-exec.jar is throwing this error:

2022-04-27 11:35:00.382 ERROR 23962 --- [           main] ication$$EnhancerBySpringCGLIB$$1b1fa732 : problem while running github crawler


com.fasterxml.jackson.module.kotlin.MissingKotlinParameterException: Instantiation of [simple type, class com.societegenerale.githubcrawler.model.SearchResult] value failed for JSON property items due to missing (therefore NULL) value for creator parameter items which is a non-nullable type
 at [Source: (String)"{"message":"API rate limit exceeded for user ID [USERID].","documentation_url":"https://docs.github.com/rest/overview/resources-in-the-rest-api#rate-limiting"}"; line: 1, column: 158] (through reference chain: com.societegenerale.githubcrawler.model.SearchResult["items"])

[...]

com.fasterxml.jackson.module.kotlin.MissingKotlinParameterException: Instantiation of [simple type, class com.societegenerale.githubcrawler.model.SearchResult] value failed for JSON property items due to missing (therefore NULL) value for creator parameter items which is a non-nullable type
 at [Source: (String)"{
  "documentation_url": "https://docs.github.com/en/free-pro-team@latest/rest/overview/resources-in-the-rest-api#secondary-rate-limits",
  "message": "You have exceeded a secondary rate limit. Please wait a few minutes before you try again."
}
"; line: 4, column: 1] (through reference chain: com.societegenerale.githubcrawler.model.SearchResult["items"])
        at com.fasterxml.jackson.module.kotlin.KotlinValueInstantiator.createFromObjectWith(KotlinValueInstantiator.kt:116) ~[jackson-module-kotlin-2.12.6.jar!/:2.12.6]
        at com.fasterxml.jackson.databind.deser.impl.PropertyBasedCreator.build(PropertyBasedCreator.java:202) ~[jackson-databind-2.12.6.jar!/:2.12.6]
        at com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeUsingPropertyBased(BeanDeserializer.java:520) ~[jackson-databind-2.12.6.jar!/:2.12.6]

[...]

These two errors are of course because of the rate limit, so that there is not the expected result inside the received arguments. Unfortunately the application will terminate right here.

Current Behavior

No matter if running the code crawl-in-parallel is true or false, the rate limit always gets exceeded.

Expected Behavior

A default parameter which respects the GitHub API where the application will wait every 10 seconds between each query should be available to avoid getting banned. The user itself should also be able to change the amount of time wait between each query in the config file.

I hope this is still somehow possible to do in the current release. If I missed it, could you please let me know what I have to do, to respect the GitHub API waiting time?

Please do not hesitate to contact me if you need more information.

Apr 27 '22 17:04 SyleKu

duplicates with https://github.com/societe-generale/github-crawler/issues/58

the manual workaround is to run the crawler in debug mode, and put a dynamic breakpoint (when it process a repository just before hitting the throttle limit), wait enough, then release the breakpoint

I don't think I will have time to work on this any time soon.. would you be interested in contributing ? I would be able to support you a bit

Apr 27 '22 17:04 vincent-fuchs