Thomas Wang comments

Results 49 comments of


                                            Thomas Wang

Share a list of weight attributes instead of a single one in TiedLayerSpec API

Indeed! Sorry I've sort of dropped this as well currently, as we've been focusing on other aspects of BigScience. I'll try split the PRs when I get the chance! My...

Update BLOOM parameter counts

Wait I think you might have broken old links. ``` Traceback (most recent call last): File "/Users/thomas/code/bigscience/transformers-Official/src/transformers/configuration_utils.py", line 619, in _get_config_dict resolved_config_file = cached_path( File "/Users/thomas/code/bigscience/transformers-Official/src/transformers/utils/hub.py", line 285, in cached_path...

Update BLOOM parameter counts

I'd say this is a breaking change. @sgugger does the `from_pretrained` method not take in account redirection?

Add a `Sequence` to the processors

@Narsil is there a reason no to do ```python for processor in processors: encoding, pair_encoding = processor(encoding, pair_encoding, add_special_tokens) # This last one is really just needed for backward compatilibity,...

Add a `Sequence` to the processors

> Using Vec actually instead of pairs since pair is also limiting in some form (https://github.com/huggingface/tokenizers/issues/804 ) but roughly it's this. Maybe a HashMap also? > Let me just re-share...

Add a `Sequence` to the processors

> It's more that I think it's safer to keep the methods exposed before for backward compatibility reasons I don't agree, it's a lot more work to maintain 2x the...

Add a `Sequence` to the processors

Essentially this is what I believe @mishig25 has done, but I still think we shouldn't have such a design as it allows ``` Sequence([ByteLevel]) ``` and ``` ByteLevel ``` To...

Some questions about building a tokenizer from scratch: vocab size can't decide actual vocab size and token order unstable.

Hi @catqaq Not sure exactly what the question is. But we've indeed gone forward with a byte level tokenizer. The idea is just not to have any unknown tokens. cc...

Some questions about building a tokenizer from scratch: vocab size can't decide actual vocab size and token order unstable.

So I'm not super familiar with all tokenization models. I think for non generation tasks byte level are great: - no unknown characters - no issue with non valid utf-8...

Add `postprocessor::Sequence`

Not sure I understand why we use `process_chain` ... I think all `fn process` should take `encodings: Vec` instead, and `Sequence` is just a for loop that calls `process` instead...