Thomas Wang
Thomas Wang
Indeed! Sorry I've sort of dropped this as well currently, as we've been focusing on other aspects of BigScience. I'll try split the PRs when I get the chance! My...
Wait I think you might have broken old links. ``` Traceback (most recent call last): File "/Users/thomas/code/bigscience/transformers-Official/src/transformers/configuration_utils.py", line 619, in _get_config_dict resolved_config_file = cached_path( File "/Users/thomas/code/bigscience/transformers-Official/src/transformers/utils/hub.py", line 285, in cached_path...
I'd say this is a breaking change. @sgugger does the `from_pretrained` method not take in account redirection?
@Narsil is there a reason no to do ```python for processor in processors: encoding, pair_encoding = processor(encoding, pair_encoding, add_special_tokens) # This last one is really just needed for backward compatilibity,...
> Using Vec actually instead of pairs since pair is also limiting in some form (https://github.com/huggingface/tokenizers/issues/804 ) but roughly it's this. Maybe a HashMap also? > Let me just re-share...
> It's more that I think it's safer to keep the methods exposed before for backward compatibility reasons I don't agree, it's a lot more work to maintain 2x the...
Essentially this is what I believe @mishig25 has done, but I still think we shouldn't have such a design as it allows ``` Sequence([ByteLevel]) ``` and ``` ByteLevel ``` To...
Hi @catqaq Not sure exactly what the question is. But we've indeed gone forward with a byte level tokenizer. The idea is just not to have any unknown tokens. cc...
So I'm not super familiar with all tokenization models. I think for non generation tasks byte level are great: - no unknown characters - no issue with non valid utf-8...
Not sure I understand why we use `process_chain` ... I think all `fn process` should take `encodings: Vec` instead, and `Sequence` is just a for loop that calls `process` instead...