gemini
gemini copied to clipboard
Advanced similarity and duplicate source code at scale.
I've created this PR for #219. This is **not** supposed to be merged but in this way, you can comment and also answer some questions in this report. After the...
I installed gemini according to the docker-compose documentation, then ran: `> docker-compose exec gemini ./hash /repositories` It results with: ```Using spark-submit from /usr/local/spark Running Hashing as Apache Spark job, master:...
From team and sync meetings: > Gemini works, but if one day we want to turn it into a product and start including new big features, it needs previous work....
Currently some of them are printed into stdout. So `report --output-format=json > file.json` produces broken json (need to remove first line manually)
workaround is here https://github.com/smacker/gemini/commit/dcaebc295ff490d2800ef80af07a29925201a673
From what I remember: - `hashFeaturesRDD` function is never used - `doc-freq-file` was deprecated - remove `scalaJsonParser` There might be some other unused/deprecated stuff.
jgit-spark-connector is deprecated and has some issues that won't be fixed either by design or because it's deprecated gitbase-spark-connector/gitbase-spark-connector-enterprise should solve many current issues with gemini.
Some dependencies are very outdated. We need to update them to include fixes and improvements into gemini and make it easier to add new dependencies if needed. As an example...
Every time I start working on gemini after a pause for a considerable amount of time -> it's broken. Weekly CI should help to discover problems faster and include issues...
Full test suite takes A LOT of time due to running hash on spark multiple times. We have tests that don't depend on hashing or external services they are [tagged...