rushstack [rush] Feature: a standardized "git lfs" precommit hook

Summary

We have been copying around a script to a couple monorepos to check whether a file is "too big" (and so probably needs to be added to LFS).

This seems like a very common desire in a monorepo, so maybe there's room to have this either be part of rush, or a rush "recipe" (kind of like the prettier autoinstaller script).

The way we use this script today is by calling it from pre-commit hook with a size, e.g.:

node common/scripts/verify-large-files-are-in-lfs.js 307200

Original Script

Here's the current script that we run today (with some notes below).

const child_process = require('child_process');
const fs = require('fs');
const path = require('path');

if (process.argv.length < 3) {
    throw new Error("Expected the maximum allowed file size in bytes as an argument to this script");
}
const MaxFileSize = parseInt(process.argv[2], 10);

if (!Number.isInteger(MaxFileSize)) {
    throw new Error("Expected the maximum allowed file size in bytes as an argument to this script");
}

const relevantFiles = relevantFileEntries(child_process.execSync("git lfs status", { encoding: 'utf-8' }));

var filesThatAreTooBig = [];
var workingDir = child_process.execSync("git rev-parse --show-toplevel", { encoding: 'utf-8' }).trim();

relevantFiles.forEach(entry => {
    const filePath = path.join(workingDir, entry.file);
    if (fs.existsSync(filePath)){
        const fileStats = fs.statSync(filePath);
        if (fileStats.size >= MaxFileSize) {
            const ext = path.extname(filePath);

            if (ext !== '.yaml' && ext !== '.js' && ext !== '.json' && ext !== '.snap') {
                filesThatAreTooBig.push(entry.file);
            }
        }
    }
});

if (filesThatAreTooBig.length > 0) {
    //console.log('\x1b[31m', 'Commit has been stopped' ,'\x1b[0m');
    console.log('\x1b[31m', `Some of the files being committed are over ${Math.floor(MaxFileSize / 1024)}KB, and must be committed using git LFS.` ,'\x1b[0m');
    console.log('See https://docs.github.com/en/repositories/working-with-files/managing-large-files/configuring-git-large-file-storage for instructions');
    console.log(filesThatAreTooBig.join('\n'));
    process.exit(1);
}

function relevantFileEntries(gitLfsStatus) {
    const lines = gitLfsStatus.split(/\r?\n/).filter(line => line.length > 0);
    const start = lines.indexOf('Objects to be committed:')+1;
    const finish = lines.indexOf('Objects not staged for commit:');
    const commitLines = lines.slice(start, finish);

    return commitLines.map(line => {
        const match =  line.match(/\t?(.+) \((.+?): (.+?)(?: -> (.+?): (.+?))?\)/);
        return {
                file: match[1],
                modes: [match[2], match[4]],
                shas: [match[3], match[5]]
        };
    }).filter(isRelevant);
}

function isRelevant(entry) {
    // We are only interested in files that are not controlled by lfs after this commit.
    //
    //   On branch eoluyomi/lfs_files
    //
    //   Objects to be committed:
    //
    //   \t.gitattributes (Git: d52ece7 -> Git: 2e8f2b5)
    //   \tREADME.md (Git: 7936208 -> File: deleted)
    //   \tbig.abc (Git: 6053cad -> LFS: b654c8a)
    //   \timg.png (LFS: da921f4)
    //   \tnewtxt (Git: 5891b5b)
    //    
    //   Objects not staged for commit:
    //    
    //   \tcommon/scripts/verify-large-files-are-in-lfs.js (Git: 7e129c8 -> File: abcd7a6)
    //   
    return (entry.modes[1] === 'Git') || (entry.modes[1] === undefined && entry.modes[0] === 'Git');
}

Details

You'll notice we are parsing the output of git lfs status, rather than git lfs status --json -- unfortunately the "JSON" variant doesn't include critical information needed to make decisions (for example, when a file previously tracked by git will now be tracked by LFS).

We also have a rather awkward "exceptions" section, because no matter what file cap you put in place to try and detect an errant 1MB image in a new file format, you will eventually have some JSON file or Jest snapshot test that is bigger. (This maybe spawns a whole other discussion about 300KB+ react unit test snapshots, but, as long as diffs are text then you aren't eating 300KB every time the file changes, which is what really matters.)

Suggestions

Internally, we've had some discussions that this script is doing the wrong thing -- what we really care about is that binary files are stored in LFS, regardless of their size. So in fact, what we should really be asking git is what files are not text, not what files are large.

If you made this change, the logic would change to: first, use git lfs status to determine what files are relevant to check (files that will exist after checkin and will not be in LFS). Then, use a second command to check whether git believes any of those files will be binary.

We haven't quite decided on what to use for the second command. One way to check all files in the repo is git grep -Ic '' (list all non-binary files) -- files in the first set and not in the second would then be errors. A faster way might be git diff --cached --numstat, which should print dashes for the linecount for every file git believes is binary, but we haven't completely tested it yet.

(Open to ideas on what would be the fastest possible way to check this for a list of files.)

Standard questions

Please answer these questions to help us investigate your issue more quickly:

Question	Answer
`@microsoft/rush` globally installed version?	n/a
`rushVersion` from rush.json?	n/a
`useWorkspaces` from rush.json?	yes
Operating system?	Mac
Would you consider contributing a PR?	Yes
Node.js version (`node -v`)?	14

Aug 31 '22 12:08 elliot-nelson

Thanks @elliot-nelson ! We encountered the same problem in our monorepo and will see if we can improve the script.

Aug 31 '22 17:08 octogonz

Not to muddy the water, but worth a mention:

In some new corporate projects on our side, that are less TypeScript-focused and more multi-language, we might be leaning towards the standard of https://github.com/pre-commit/pre-commit with https://github.com/pre-commit/pre-commit-hooks providing a "standard library" of hooks.

Interestingly, this exact script is part of that standard library already -- the downside is, it's written in Python (as are all the hooks). But it's worth considering, whether Rush needs to provide "all of" the features that make up a given monorepo.

Aug 31 '22 17:08 elliot-nelson

Not to muddy the water, but worth a mention:

In some new corporate projects on our side, that are less TypeScript-focused and more multi-language, we might be leaning towards the standard of https://github.com/pre-commit/pre-commit with https://github.com/pre-commit/pre-commit-hooks providing a "standard library" of hooks.

Interestingly, this exact script is part of that standard library already -- the downside is, it's written in Python (as are all the hooks). But it's worth considering, whether Rush needs to provide "all of" the features that make up a given monorepo.

@elliot-nelson If we wanted such a feature to be easily usable across repos for many different language platforms, it should be coded using a "universal" programming language. What are the requirements? Something like:

Compatible with every major operating system
Quick to install - For example, installing Ruby on Windows is a bigger download and a lot more steps than NVM
Self-contained - For example, Bash requires lots of companion tools that all must be installed with the right version, otherwise scripts may fail
Easy to debug - For example, Bash has no way to set a breakpoint, and every script constantly spawns thousands of subprocesses
Inspectable scripts - For example, Golang produces opaque binaries that cannot be inspected, diffed, or manually edited
Professional software engineering - For example, Python has no type checker, which makes it difficult to maintain scripts with nontrivial functionality

NPM packages or ts-node scripts seem like actually a pretty good fit for such requirements, if combined with something like shelljs... 🤔

Oct 19 '22 00:10 octogonz