ripgrep icon indicating copy to clipboard operation
ripgrep copied to clipboard

First entry of gitignore searched if it starts with BOM

Open tvrg opened this issue 4 years ago • 3 comments

What version of ripgrep are you using?

ripgrep 13.0.0 -SIMD -AVX (compiled) +SIMD +AVX (runtime)

How did you install ripgrep?

pacman -S ripgrep

What operating system are you using ripgrep on?

Arch Linux

Describe your bug.

When a .gitignore starts with a BOM, ripgrep does not ignore the first entry of the .gitignore. git itself ignores the first entry.

What are the steps to reproduce the behavior?

$ echo "\xef\xbb\xbftest\nignoreme" > .gitignore && echo "test" > ignoreme
$ rg test

What is the actual behavior?

$ rg --debug test
DEBUG|grep_regex::literal|crates/regex/src/literal.rs:58: literal prefixes detected: Literals { lits: [Complete(test)], limit_size: 250, limit_class: 10 }
DEBUG|globset|crates/globset/src/lib.rs:421: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:421: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:421: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:416: glob converted to regex: Glob { glob: "**/npm-debug.log*", re: "(?-u)^(?:/?|.*/)npm\\-debug\\.log[^/]*$", opts: GlobOptions { case_insensitive: false, literal_separator: true, backslash_escape: true }, tokens: Tokens([RecursivePrefix, Literal('n'), Literal('p'), Literal('m'), Literal('-'), Literal('d'), Literal('e'), Literal('b'), Literal('u'), Literal('g'), Literal('.'), Literal('l'), Literal('o'), Literal('g'), ZeroOrMore]) }
DEBUG|globset|crates/globset/src/lib.rs:416: glob converted to regex: Glob { glob: "**/yarn-debug.log*", re: "(?-u)^(?:/?|.*/)yarn\\-debug\\.log[^/]*$", opts: GlobOptions { case_insensitive: false, literal_separator: true, backslash_escape: true }, tokens: Tokens([RecursivePrefix, Literal('y'), Literal('a'), Literal('r'), Literal('n'), Literal('-'), Literal('d'), Literal('e'), Literal('b'), Literal('u'), Literal('g'), Literal('.'), Literal('l'), Literal('o'), Literal('g'), ZeroOrMore]) }
DEBUG|globset|crates/globset/src/lib.rs:416: glob converted to regex: Glob { glob: "**/yarn-error.log*", re: "(?-u)^(?:/?|.*/)yarn\\-error\\.log[^/]*$", opts: GlobOptions { case_insensitive: false, literal_separator: true, backslash_escape: true }, tokens: Tokens([RecursivePrefix, Literal('y'), Literal('a'), Literal('r'), Literal('n'), Literal('-'), Literal('e'), Literal('r'), Literal('r'), Literal('o'), Literal('r'), Literal('.'), Literal('l'), Literal('o'), Literal('g'), ZeroOrMore]) }
DEBUG|globset|crates/globset/src/lib.rs:421: built glob set; 1 literals, 6 basenames, 3 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 3 regexes
DEBUG|globset|crates/globset/src/lib.rs:421: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:421: built glob set; 0 literals, 1 basenames, 0 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:421: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|ignore::walk|crates/ignore/src/walk.rs:1741: ignoring ./.git: Ignore(IgnoreMatch(Hidden))
DEBUG|ignore::walk|crates/ignore/src/walk.rs:1741: ignoring ./.gitignore: Ignore(IgnoreMatch(Hidden))
ignoreme
1:test
DEBUG|globset|crates/globset/src/lib.rs:421: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:421: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:421: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:421: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:421: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:421: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:421: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:421: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes

What is the expected behavior?

It should ignore ignoreme because git does so as well. See e.g. git status.

tvrg avatar Apr 11 '22 12:04 tvrg

I believe the description of your bug report is correct, but your reproduction isn't. rg isn't ignoring ignoreme in your particular example because of the BOM, it's not ignoring it because it doesn't respect .gitignore outside of a git repository. You either need to run git init or mv .gitignore .ignore. Here's my repro:

$ echo "\xef\xbb\xbfignoreme1\nignoreme2" > .ignore && echo "test" > ignoreme1 && echo "test" > ignoreme2
$ rg test
ignoreme1
1:test

But rg test should return no results. Instead, it only ignores ignoreme2 and not ignoreme1. If I drop the BOM from the .ignore file, then ripgrep correctly ignores both .ignoreme1 and .ignoreme2.

I suspect the fix will be to tweak this code to handle the BOM:

https://github.com/BurntSushi/ripgrep/blob/ced5b92aa93eb47e892bd2fd26ab454008721730/crates/ignore/src/gitignore.rs#L386-L408

BurntSushi avatar Apr 11 '22 12:04 BurntSushi

Oh, yes, sorry for the incomplete reproduction. I was actually running the commands in an initialized git repo.

I'll try to fix this in the next days with the help of your pointer above if nobody else is faster.

tvrg avatar Apr 11 '22 12:04 tvrg

Here is what Git does: https://github.com/git/git/blob/21306a098c3f174ad4c2a5cddb9069ee27a548b0/utf8.c#L794-L803

const char utf8_bom[] = "\357\273\277";

int skip_utf8_bom(char **text, size_t len)
{
	if (len < strlen(utf8_bom) ||
	    memcmp(*text, utf8_bom, strlen(utf8_bom)))
		return 0;
	*text += strlen(utf8_bom);
	return 1;
}

If ignore intends to match its behavior then it should just skip this sequence in the first line.

starthal avatar Apr 17 '24 19:04 starthal