Zehan Li comments

Results 27 comments of


                                            Zehan Li

Bug: can't co-exist with pytorch-lightning

Oh! I update the version to 0.1.96 and everything works well. Thank you.

Bug: can't co-exist with pytorch-lightning

Hello again, I test the bug and it appears again after a month... I'm using the newest version of both package ``` pytorch-lightning==1.5.10 sentencepiece==0.1.96 ``` The bug is yielded by...

Bug: can't co-exist with pytorch-lightning

Hi @yangky11 , could you try to switch the import order to see if that works? ```Python import sentencepiece import pytorch_lightning ```

ValueError when running `evaluate_bm25.py`

Thank you so much! I change the hostname to `http://localhost:9200` and it works. But when I run it to evaluate BM25, I get different scores at different runs. For example,...

ValueError when running `evaluate_bm25.py`

I see. It's fixed in the `beir` code but not yet included in the `examples`. I add a sleep time and eventually get a consistent score.

Can't replicate results of BBTv2 paper

Hi @txsun1997 Thanks for the update! With the new code and hyperparameters, I can successfully replicate the results. ``` ********* Evaluated on dev set ********* Dev loss: 1.0993. Dev perf:...

Fix bug in reducer and add ms marco passage ranking result

Hi, have you updated the pip package since I still have this problem for now (using `pip install tevatron`)

Chinese support

What do you think? ![捕获](https://user-images.githubusercontent.com/69186130/235294937-195a2499-80db-446b-bd53-58a6aaac89d4.JPG)

Consistently seeing more rows being dropped in minhash_spark.py compared to minhash.py

Hi @ChenghaoMou , I'm facing the same problem using another local minhash deduplication, which removes significantly less documents than spark implementation. See https://github.com/huggingface/datatrove/issues/107

Consistently seeing more rows being dropped in minhash_spark.py compared to minhash.py

I have tried another implementation of [starcoder](https://github.com/bigcode-project/bigcode-dataset/blob/main/near_deduplication/minhash_deduplication_spark.py), which produces nearly same deduplication rate as datatrove implementation. Is there any reason why bigcode didn't use the graphframe implementation in this code...