Thomas Wang

Results 49 comments of Thomas Wang

Indeed! Sorry I've sort of dropped this as well currently, as we've been focusing on other aspects of BigScience. I'll try split the PRs when I get the chance! My...

Wait I think you might have broken old links. ``` Traceback (most recent call last): File "/Users/thomas/code/bigscience/transformers-Official/src/transformers/configuration_utils.py", line 619, in _get_config_dict resolved_config_file = cached_path( File "/Users/thomas/code/bigscience/transformers-Official/src/transformers/utils/hub.py", line 285, in cached_path...

I'd say this is a breaking change. @sgugger does the `from_pretrained` method not take in account redirection?

@Narsil is there a reason no to do ```python for processor in processors: encoding, pair_encoding = processor(encoding, pair_encoding, add_special_tokens) # This last one is really just needed for backward compatilibity,...

> Using Vec actually instead of pairs since pair is also limiting in some form (https://github.com/huggingface/tokenizers/issues/804 ) but roughly it's this. Maybe a HashMap also? > Let me just re-share...

> It's more that I think it's safer to keep the methods exposed before for backward compatibility reasons I don't agree, it's a lot more work to maintain 2x the...

Essentially this is what I believe @mishig25 has done, but I still think we shouldn't have such a design as it allows ``` Sequence([ByteLevel]) ``` and ``` ByteLevel ``` To...

Hi @catqaq Not sure exactly what the question is. But we've indeed gone forward with a byte level tokenizer. The idea is just not to have any unknown tokens. cc...

So I'm not super familiar with all tokenization models. I think for non generation tasks byte level are great: - no unknown characters - no issue with non valid utf-8...

Not sure I understand why we use `process_chain` ... I think all `fn process` should take `encodings: Vec` instead, and `Sequence` is just a for loop that calls `process` instead...