Riccardo Orlando

Results 14 comments of Riccardo Orlando

The Multilingual model `xx_sent_ud_sm` does not tokenize correctly Chinese sentences, while the Chinese model `zh_core_web_sm` does. For example: ```python import spacy nlp_ml = spacy.load("xx_sent_ud_sm") nlp_ml.tokenizer("包括联合国机构和机制提出的有关建议以及现有的外部资料对有关国家进行筹备性研究。") # ['包括联合国机构和机制提出的有关建议以及现有的外部资料对有关国家进行筹备性研究', '。'] nlp_zh= spacy.load("zh_core_web_sm")...

> @Riccorl : This is the expected behavior for the base `xx` tokenizer used in that model, which just doesn't work for languages without whitespace between tokens. It was a...

It seems it doesn't work with 1.13. It works with 1.12 though.

@sshleifer It seems like the problem is not the num_candidates=1. The model sees a binary classification task when there is only one candidate in the utterance sample.

I have the same issue. iPhone 12 with iOS 16.4.1 Here is the log: ```csv created_at,level,context,message,stacktrace 2023-04-15 10:24:50.079767,LogLevel.INFO,"AssetNotifier","state is already up-to-date","" 2023-04-15 10:24:50.079218,LogLevel.INFO,"AssetNotifier","Load assets: 142ms","" 2023-04-15 10:23:29.795768,LogLevel.INFO,"BackupNotifier","Found 37 local albums",""...

> I think I figured out the problem. Go into all of your shared albums and make sure that everything is downloaded. In my case, I had a shared album...

I changed `ConcatTokensDataset.__iter__` to this: ```python def __iter__(self) -> Iterable[Dict[str, bytes]]: buffer = [] # self.write_batch_size = 10_000 shards = self.hf_dataset.num_rows // self.write_batch_size + 1 for i in range(shards): shard...

> Thanks for your update! Do you modify other files to enable multithreaded? Yes sorry, I also removed `os.environ["TOKENIZERS_PARALLELISM"] = "false"` from `ConcatTokensDataset.__init__`.