Xueguang Ma 马雪光
Xueguang Ma 马雪光
Hi @nirmal2k, yes you can use p_man_len as 512 and encode using it. castorini/unicoil-d2q-msmarco-passage is trained with p_max_len 192. corpus-d2q contains `original msmarco-passage text token+ [SEP] + new tokens generated...
Hi @xhluca, Thanks for the suggestion. we do need to handle the releasing process better. thanks for sharing the link, i'll take a look
Hi @gzerveas 1. Some dense retriever models uses untie parameters, where query and passage encoders do not share parameters. e.g. DPR. 2. Some models uses tie parameters, where query and...
Hi @gzerveas, The warning message is expected as we don't need pooler from bert. and the way you get CLS embeddings should be correct. Could you double check which checkpoint...
>Are you certain that Luyu/co-condenser-marco-retriever is a checkpoint of Cocodenser and not Condenser. Yes. Is the corpus you use align with the document here? https://github.com/texttron/tevatron/tree/main/examples/coCondenser-marco The corpus in Tevatron has...
I think they use [SEP] to separate title and text during training and encoding: https://github.com/texttron/tevatron/blob/adf5ce45612332797931569d51cc5bcd8c1ac878/src/tevatron/preprocessor/preprocessor_tsv.py#L92
Hi @xhluca, Sorry for the late reply. Is it just the issue of `Tevatron/wikipedia-wq-corpus`? `Tevatron/wikipedia-nq-corpus` also not works? It seems like a issue caused by the json environment? ``` data...
Could you see if a simple jsonl file can be read in your environment? or could you try conda environment? My environment is python3.8 with conda
if doc_id is a string (i.e. not able to cast to int), the entire search pipeline won't work. casting into str is a quick fix to make search work for...
shoudn't break anything...the cast seems not necessary... https://github.com/texttron/tevatron/blob/3cef7da6368827d9b8cf6d6c40db380b584c1752/src/tevatron/faiss_retriever/__main__.py#L26 but there are a chance that int gets stored as string type in corpus embedding file, so we have to make sure...