Xueguang Ma 马雪光 comments

Results 99 comments of


                                            Xueguang Ma 马雪光

How is document expansion helpful if p_max_len=192 in unicoil training and encoding command? Most MSMARCO passages are over 192 tokens

Hi @nirmal2k, yes you can use p_man_len as 512 and encode using it. castorini/unicoil-d2q-msmarco-passage is trained with p_max_len 192. corpus-d2q contains `original msmarco-passage text token+ [SEP] + new tokens generated...

How are GitHub releases, version tags, and release cycles handled?

Hi @xhluca, Thanks for the suggestion. we do need to handle the releasing process better. thanks for sharing the link, i'll take a look

Is there a difference between query and passage encoder?

Hi @gzerveas 1. Some dense retriever models uses untie parameters, where query and passage encoders do not share parameters. e.g. DPR. 2. Some models uses tie parameters, where query and...

Is there a difference between query and passage encoder?

Hi @gzerveas, The warning message is expected as we don't need pooler from bert. and the way you get CLS embeddings should be correct. Could you double check which checkpoint...

Is there a difference between query and passage encoder?

>Are you certain that Luyu/co-condenser-marco-retriever is a checkpoint of Cocodenser and not Condenser. Yes. Is the corpus you use align with the document here? https://github.com/texttron/tevatron/tree/main/examples/coCondenser-marco The corpus in Tevatron has...

Is there a difference between query and passage encoder?

I think they use [SEP] to separate title and text during training and encoding: https://github.com/texttron/tevatron/blob/adf5ce45612332797931569d51cc5bcd8c1ac878/src/tevatron/preprocessor/preprocessor_tsv.py#L92

Receiving a `JSONDecodeError` when running `tevatron.driver.encode` on WQ dataset

Hi @xhluca, Sorry for the late reply. Is it just the issue of `Tevatron/wikipedia-wq-corpus`? `Tevatron/wikipedia-nq-corpus` also not works? It seems like a issue caused by the json environment? ``` data...

Receiving a `JSONDecodeError` when running `tevatron.driver.encode` on WQ dataset

Could you see if a simple jsonl file can be read in your environment? or could you try conda environment? My environment is python3.8 with conda

Fix bug in reducer and add ms marco passage ranking result

if doc_id is a string (i.e. not able to cast to int), the entire search pipeline won't work. casting into str is a quick fix to make search work for...

Fix bug in reducer and add ms marco passage ranking result

shoudn't break anything...the cast seems not necessary... https://github.com/texttron/tevatron/blob/3cef7da6368827d9b8cf6d6c40db380b584c1752/src/tevatron/faiss_retriever/__main__.py#L26 but there are a chance that int gets stored as string type in corpus embedding file, so we have to make sure...