dataprep icon indicating copy to clipboard operation
dataprep copied to clipboard

Connector Config CI workflow

Open dovahcrow opened this issue 5 years ago • 6 comments

Currently, Connector Configs are put in a Github repo. This means every time a user instantiates a connector, a request will be issued to Github to check if there's an update on the config file. Usually, this is not a problem. However, Github also posts a rate limit as 60 requests per hour. This means if a user is using the connector extensively, he will be rate limited by Github.

I suggest putting our configuration files to S3.

Secondly, our current configuration files are put in the master branch and every connector will get the config file from the master branch too. This means if we made some incompatible changes to the connector, the config files in the configs repo should also be changed. However, we cannot change the configs repo because users who are using the released DataPrep will get an error (including our CI bots).

I suggest we use the same workflow on the config file repo as the DataPrep repo.

Moreover, we indeed have the version number for this conflict resolution purpose. But 1. we are not released yet for the connector so I prefer more to introduce breaking changes instead of bump the config version. Bumping the version will cause our DataPrep codebase bloat due to the code for backward compatibility. 2. Introduce one more buffer layer is always a good thing in case someone screwed up on the PR. The downside for an additional buffer layer is performance. But I don't see there's a performance issue here.

The suggested new workflow is:

  1. Developers make changes to the develop branch on the config file repo.
  2. Upon releasing, the bot merges the develop branch to the master branch and upload the config file content to S3.
  3. every time a user instantiates a connector, it will contact S3 and get the catalog of supported websites and their configuration files.

dovahcrow avatar Sep 24 '20 22:09 dovahcrow

@jnwang @peiwangdb @yxie66 @liuyejia What do you think about the problem and the solution?

dovahcrow avatar Sep 24 '20 22:09 dovahcrow

However, Github also posts a rate limit as 60 requests per hour. This means if a user is using the connector extensively, he will be rate limited by Github.

S3 is a good idea. If it is necessary depends on how often this limit issue will happen? BTW, should we prioritize schema check for config file testing?

peiwangdb avatar Sep 24 '20 22:09 peiwangdb

BTW, should we prioritize schema check for config file testing?

Yes I think so. Currently all the configurations are not tested. However, we need to come up a solution for storing keys for these testing.

dovahcrow avatar Sep 24 '20 23:09 dovahcrow

I wonder whether S3 free account will be sufficient for configuration files storing?

yxie66 avatar Sep 25 '20 01:09 yxie66

I wonder whether S3 free account will be sufficient for configuration files storing?

1G data storage and 20000 get per month takes 0.03 USD per month. I think it is cheap enough for use to store configuration files. Do you have other free suggestions?

dovahcrow avatar Sep 25 '20 02:09 dovahcrow

Can we simplify the workflow by removing the dependency on S3?

  1. Developers make changes to the develop branch on the config file repo.
  2. Upon releasing, the bot merges the develop branch to the master branch ~and upload the config file content to S3~.
  3. every time a user instantiates a connector, it will contact Github and get the catalog of supported websites and their configuration files.

For the rate limit issue, maybe we shouldn't download the configure file again and again if it has already existed locally. We can add a force-update argument if the user wants to overwrite its local configuration file.

jnwang avatar Sep 25 '20 05:09 jnwang