dataflow-java icon indicating copy to clipboard operation
dataflow-java copied to clipboard

Scrub binary files from git history

Open jiridanek opened this issue 10 years ago • 2 comments

Before: Receiving objects: 100% (6743/6743), 121.52 MiB | 210.00 KiB/s, done. After: Receiving objects: 100% (6421/6421), 36.37 MiB | 210.00 KiB/s, done.

This change has to be force-pushed. Merging does not do the trick. I am including the exact commands I executed to do this. It might be best if you just run the commands yourself.

Fixes #101

List all files ever in the repository

# https://git-scm.com/docs/git-log
# http://stackoverflow.com/a/13547351/1047788
git log --name-only --pretty=format: | sort | uniq

List all deleted files ever in the repository

# http://stackoverflow.com/a/21871377/1047788
git log --name-only --diff-filter=D --pretty=format: | sort | uniq

Get changelog

git log --name-status > changelog.txt

Decide what to scrub

# http://www.tldp.org/LDP/abs/html/here-docs.html
cat << EOF > filenamestoscrub.txt
contigs.fasta
google-genomics-dataflow.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140604/dataflow-sdk-1.0.140604.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140604/dataflow-sdk-1.0.140604.jar.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140604/dataflow-sdk-1.0.140604.jar.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140604/dataflow-sdk-1.0.140604-javadoc.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140604/dataflow-sdk-1.0.140604-javadoc.jar.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140604/dataflow-sdk-1.0.140604-javadoc.jar.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140604/dataflow-sdk-1.0.140604.pom
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140604/dataflow-sdk-1.0.140604.pom.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140604/dataflow-sdk-1.0.140604.pom.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140617/dataflow-sdk-1.0.140617.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140617/dataflow-sdk-1.0.140617.jar.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140617/dataflow-sdk-1.0.140617.jar.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140617/dataflow-sdk-1.0.140617-javadoc.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140617/dataflow-sdk-1.0.140617-javadoc.jar.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140617/dataflow-sdk-1.0.140617-javadoc.jar.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140617/dataflow-sdk-1.0.140617.pom
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140617/dataflow-sdk-1.0.140617.pom.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140617/dataflow-sdk-1.0.140617.pom.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140709/dataflow-sdk-1.0.140709.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140709/dataflow-sdk-1.0.140709.jar.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140709/dataflow-sdk-1.0.140709.jar.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140709/dataflow-sdk-1.0.140709-javadoc.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140709/dataflow-sdk-1.0.140709-javadoc.jar.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140709/dataflow-sdk-1.0.140709-javadoc.jar.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140709/dataflow-sdk-1.0.140709.pom
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140709/dataflow-sdk-1.0.140709.pom.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140709/dataflow-sdk-1.0.140709.pom.sha1                                                                                                              
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140801/dataflow-sdk-1.0.140801.jar                                                                                                                   
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140801/dataflow-sdk-1.0.140801.jar.md5                                                                                                               
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140801/dataflow-sdk-1.0.140801.jar.sha1                                                                                                              
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140801/dataflow-sdk-1.0.140801-javadoc.jar                                                                                                           
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140801/dataflow-sdk-1.0.140801-javadoc.jar.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140801/dataflow-sdk-1.0.140801-javadoc.jar.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140801/dataflow-sdk-1.0.140801.pom
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140801/dataflow-sdk-1.0.140801.pom.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140801/dataflow-sdk-1.0.140801.pom.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140808/dataflow-sdk-1.0.140808.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140808/dataflow-sdk-1.0.140808.jar.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140808/dataflow-sdk-1.0.140808.jar.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140808/dataflow-sdk-1.0.140808-javadoc.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140808/dataflow-sdk-1.0.140808-javadoc.jar.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140808/dataflow-sdk-1.0.140808-javadoc.jar.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140808/dataflow-sdk-1.0.140808.pom
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140808/dataflow-sdk-1.0.140808.pom.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140808/dataflow-sdk-1.0.140808.pom.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140818/dataflow-sdk-1.0.140818.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140818/dataflow-sdk-1.0.140818.jar.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140818/dataflow-sdk-1.0.140818.jar.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140818/dataflow-sdk-1.0.140818-javadoc.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140818/dataflow-sdk-1.0.140818-javadoc.jar.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140818/dataflow-sdk-1.0.140818-javadoc.jar.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140818/dataflow-sdk-1.0.140818.pom
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140818/dataflow-sdk-1.0.140818.pom.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140818/dataflow-sdk-1.0.140818.pom.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140828/dataflow-sdk-1.0.140828.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140828/dataflow-sdk-1.0.140828.jar.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140828/dataflow-sdk-1.0.140828.jar.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140828/dataflow-sdk-1.0.140828-javadoc.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140828/dataflow-sdk-1.0.140828-javadoc.jar.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140828/dataflow-sdk-1.0.140828-javadoc.jar.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140828/dataflow-sdk-1.0.140828.pom
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140828/dataflow-sdk-1.0.140828.pom.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140828/dataflow-sdk-1.0.140828.pom.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140915/dataflow-sdk-1.0.140915.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140915/dataflow-sdk-1.0.140915.jar.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140915/dataflow-sdk-1.0.140915.jar.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140915/dataflow-sdk-1.0.140915-javadoc.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140915/dataflow-sdk-1.0.140915-javadoc.jar.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140915/dataflow-sdk-1.0.140915-javadoc.jar.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140915/dataflow-sdk-1.0.140915.pom
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140915/dataflow-sdk-1.0.140915.pom.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140915/dataflow-sdk-1.0.140915.pom.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140924/dataflow-sdk-1.0.140924.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140924/dataflow-sdk-1.0.140924.jar.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140924/dataflow-sdk-1.0.140924.jar.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140924/dataflow-sdk-1.0.140924-javadoc.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140924/dataflow-sdk-1.0.140924-javadoc.jar.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140924/dataflow-sdk-1.0.140924-javadoc.jar.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140924/dataflow-sdk-1.0.140924.pom
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140924/dataflow-sdk-1.0.140924.pom.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140924/dataflow-sdk-1.0.140924.pom.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.141013/dataflow-sdk-1.0.141013.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.141013/dataflow-sdk-1.0.141013.jar.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.141013/dataflow-sdk-1.0.141013.jar.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.141013/dataflow-sdk-1.0.141013-javadoc.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.141013/dataflow-sdk-1.0.141013-javadoc.jar.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.141013/dataflow-sdk-1.0.141013-javadoc.jar.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.141013/dataflow-sdk-1.0.141013.pom
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.141013/dataflow-sdk-1.0.141013.pom.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.141013/dataflow-sdk-1.0.141013.pom.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.141027/dataflow-sdk-1.0.141027.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.141027/dataflow-sdk-1.0.141027.pom
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.141120/dataflow-sdk-1.0.141120.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.141120/dataflow-sdk-1.0.141120-javadoc.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.141120/dataflow-sdk-1.0.141120.pom
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.141120/dataflow-sdk-1.0.141120-sources.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.141206/dataflow-sdk-1.0.141206.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.141206/dataflow-sdk-1.0.141206-javadoc.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.141206/dataflow-sdk-1.0.141206.pom
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.141206/dataflow-sdk-1.0.141206-sources.jar
jars/com/google/cloud/dataflow/dataflow-sdk/maven-metadata-local.xml
jars/com/google/cloud/dataflow/dataflow-sdk/maven-metadata-local.xml.md5
jars/com/google/cloud/dataflow/dataflow-sdk/maven-metadata-local.xml.sha1
jars/org/broadinstitute/sting/gatk/gatk/3.1-1/gatk-3.1-1.jar
jars/org/sf/picard/picard/1.115/picard-1.115.jar
lib/bwa-0.7.9a/bamlite.c
lib/bwa-0.7.9a/bamlite.h
lib/bwa-0.7.9a/bntseq.c
lib/bwa-0.7.9a/bntseq.h
lib/bwa-0.7.9a/bwa.1
lib/bwa-0.7.9a/bwa.c
lib/bwa-0.7.9a/bwa.h
lib/bwa-0.7.9a/bwa-helper.js
lib/bwa-0.7.9a/bwamem.c
lib/bwa-0.7.9a/bwamem_extra.c
lib/bwa-0.7.9a/bwamem.h
lib/bwa-0.7.9a/bwamem_pair.c
lib/bwa-0.7.9a/bwape.c
lib/bwa-0.7.9a/bwase.c
lib/bwa-0.7.9a/bwase.h
lib/bwa-0.7.9a/bwaseqio.c
lib/bwa-0.7.9a/bwtaln.c
lib/bwa-0.7.9a/bwtaln.h
lib/bwa-0.7.9a/bwt.c
lib/bwa-0.7.9a/bwtgap.c
lib/bwa-0.7.9a/bwtgap.h
lib/bwa-0.7.9a/bwt_gen.c
lib/bwa-0.7.9a/bwt.h
lib/bwa-0.7.9a/bwtindex.c
lib/bwa-0.7.9a/bwt_lite.c
lib/bwa-0.7.9a/bwt_lite.h
lib/bwa-0.7.9a/bwtsw2_aux.c
lib/bwa-0.7.9a/bwtsw2_chain.c
lib/bwa-0.7.9a/bwtsw2_core.c
lib/bwa-0.7.9a/bwtsw2.h
lib/bwa-0.7.9a/bwtsw2_main.c
lib/bwa-0.7.9a/bwtsw2_pair.c
lib/bwa-0.7.9a/ChangeLog
lib/bwa-0.7.9a/COPYING
lib/bwa-0.7.9a/example.c
lib/bwa-0.7.9a/fastmap.c
lib/bwa-0.7.9a/is.c
lib/bwa-0.7.9a/kbtree.h
lib/bwa-0.7.9a/khash.h
lib/bwa-0.7.9a/kopen.c
lib/bwa-0.7.9a/kseq.h
lib/bwa-0.7.9a/ksort.h
lib/bwa-0.7.9a/kstring.c
lib/bwa-0.7.9a/kstring.h
lib/bwa-0.7.9a/ksw.c
lib/bwa-0.7.9a/ksw.h
lib/bwa-0.7.9a/kthread.c
lib/bwa-0.7.9a/kvec.h
lib/bwa-0.7.9a/main.c
lib/bwa-0.7.9a/Makefile
lib/bwa-0.7.9a/malloc_wrap.c
lib/bwa-0.7.9a/malloc_wrap.h
lib/bwa-0.7.9a/NEWS.md
lib/bwa-0.7.9a/pemerge.c
lib/bwa-0.7.9a/QSufSort.c
lib/bwa-0.7.9a/QSufSort.h
lib/bwa-0.7.9a/qualfa2fq.pl
lib/bwa-0.7.9a/README.md
lib/bwa-0.7.9a/utils.c
lib/bwa-0.7.9a/utils.h
lib/bwa-0.7.9a/xa2multi.pl
lib/com/google/cloud/dataflow/dataflow-sdk/1.0.140617/dataflow-sdk-1.0.140617.jar
lib/com/google/cloud/dataflow/dataflow-sdk/1.0.140617/dataflow-sdk-1.0.140617.jar.md5
lib/com/google/cloud/dataflow/dataflow-sdk/1.0.140617/dataflow-sdk-1.0.140617.jar.sha1
lib/com/google/cloud/dataflow/dataflow-sdk/1.0.140617/dataflow-sdk-1.0.140617-javadoc.jar
lib/com/google/cloud/dataflow/dataflow-sdk/1.0.140617/dataflow-sdk-1.0.140617-javadoc.jar.md5
lib/com/google/cloud/dataflow/dataflow-sdk/1.0.140617/dataflow-sdk-1.0.140617-javadoc.jar.sha1
lib/com/google/cloud/dataflow/dataflow-sdk/1.0.140617/dataflow-sdk-1.0.140617.pom
lib/com/google/cloud/dataflow/dataflow-sdk/1.0.140617/dataflow-sdk-1.0.140617.pom.md5
lib/com/google/cloud/dataflow/dataflow-sdk/1.0.140617/dataflow-sdk-1.0.140617.pom.sha1
lib/com/google/cloud/dataflow/dataflow-sdk/maven-metadata-local.xml
lib/com/google/cloud/dataflow/dataflow-sdk/maven-metadata-local.xml.md5
lib/com/google/cloud/dataflow/dataflow-sdk/maven-metadata-local.xml.sha1
lib/org/broadinstitute/sting/gatk/gatk/3.1-1/gatk-3.1-1.jar
lib/org/sf/picard/picard/1.115/picard-1.115.jar
README.md~
EOF

Scrub the files from history

# DO NOT DO THIS
# http://stackoverflow.com/a/1521498/1047788
while read filename; do
    # https://help.github.com/articles/remove-sensitive-data/
    git filter-branch --force --index-filter \
    "git rm --cached --ignore-unmatch $filename" \
    --prune-empty --tag-name-filter cat -- --all
done < filenamestoscrub.txt

Wait for this to complete. It takes a very long time, which proves that scrubbing the files one by one was a bad idea.

# DO THIS INSTEAD
# http://stackoverflow.com/a/4229151/1047788
git filter-branch --force --index-filter \
"git rm --cached --ignore-unmatch -- $(tr '\n' ' ' < filenamestoscrub.txt)" \
--prune-empty --tag-name-filter cat -- --all

Review and push the result

mvn package

git push origin --force --all
git push origin --force --tags

Local clones

Do steps # 8 and # 9 from https://help.github.com/articles/remove-sensitive-data/ on each local clone you have

jiridanek avatar Jan 18 '16 22:01 jiridanek

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for the commit author(s). If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.

googlebot avatar Jan 18 '16 22:01 googlebot

@jirkadanek Thanks so much for these detailed instructions!!! We will make it so.

deflaux avatar Feb 01 '16 18:02 deflaux