machinelearning icon indicating copy to clipboard operation
machinelearning copied to clipboard

Context Aware Tokenization

Open tjwald opened this issue 1 year ago • 8 comments

Is your feature request related to a problem? Please describe. Continuing this thread from https://github.com/microsoft/semantic-kernel/issues/9793#:~:text=Context%20Tokenization . Migrating from python to C# is difficult enough, supporting the same feature set but being more efficient is important for the migration story. One feature that is currently missing is context aware tokenization.

In python, HuggingFace tokenizers support this:

tokenizer = AutoTokenizer.from_pretrained('/path/to/tokenizer')

context = 'some context'
sentence = 'some sentence'

tokens = tokenizer(context, sentence)

Where the tokenizer will tokenize the sentence with the given context.

Describe the solution you'd like Add new API (including batch API) that support this:

class Tokenizer
{
     public IReadOnlyList<int> EncodeToIds(string context, string text, bool considerPreTokenization = true, bool considerNormalization = true)
     ...
     // relating to: https://github.com/dotnet/machinelearning/issues/7371
     public void BatchTokenize<T, K>(ReadOnlySpan<string> contexts, ReadOnlySpan<string> texts,  int maxTokenCount, Tensor<T> tokenIds, Tensor<K> mask) 
              where T: INumber<T>, K: INumber<K>
}

Describe alternatives you've considered Study the inner working of different tokenizers in the python implementation and see how to port the same functionality.

Additional context Continuing the thread related to tokenizers from https://github.com/microsoft/semantic-kernel/issues/9793

tjwald avatar Jan 23 '25 07:01 tjwald

@tjwald would you be able to share docs / guidance that provide more info on this approach? Thanks.

luisquintanilla avatar Mar 03 '25 19:03 luisquintanilla

This looks specific to Bert tokenizer.

https://huggingface.co/docs/transformers/v4.49.0/en/model_doc/bert#transformers.BertForNextSentencePrediction.forward.example

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
context = "Hugging Face provides powerful NLP tools."
tokens = tokenizer(context, sentence, padding=True, truncation=True, return_tensors="pt")
print(tokens)

this will produce the output:

{'input_ids': tensor([[  101, 17662,  2227,  3640,  3928, 17953,  2361,  5906,  1012,   102,
         19081,  3075,  2003,  4235,  2109,  2005,  2784,  4083,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

It is related to the batching too. @tjwald feel free to add any details here.

tarekgh avatar Mar 03 '25 20:03 tarekgh

This looks specific to Bert tokenizer.

@tarekgh You are correct that BERT supports this, however it is part of the tokenizer generic API and it is has other implementations of this. Most of them are just to concat them, however there are those that don't [ex, ex]

I have collected a few documents mentioning this including the implementation of BERT from the huggingface repo.

https://medium.com/axinc-ai/how-tokenizer-parameters-impact-transformers-behavior-8be8030637c6

Question-Answering tasks: https://huggingface.co/docs/transformers/tasks/question_answering Multiple choice tasks: https://huggingface.co/docs/transformers/tasks/multiple_choice BERT preprocessing: https://huggingface.co/google-bert/bert-base-uncased#preprocessing BERT implementation: https://github.com/huggingface/transformers/blob/main/src/transformers/models/bert/tokenization_bert.py -> build_inputs_with_special_tokens

It is related to the batching too

This feature should both be supported for single context-sentence pairs, but also for batches.

tjwald avatar Mar 03 '25 22:03 tjwald

Thanks @tjwald!

I am not sure if you already saw we support build_inputs_with_special_tokens with Bert Tokenizer https://github.com/dotnet/machinelearning/blob/142d7f5afaaf8b156f71b629576badbe0b9048e9/src/Microsoft.ML.Tokenizers/Model/BertTokenizer.cs#L290

and we have overload work with span too https://github.com/dotnet/machinelearning/blob/142d7f5afaaf8b156f71b629576badbe0b9048e9/src/Microsoft.ML.Tokenizers/Model/BertTokenizer.cs#L339

tarekgh avatar Mar 04 '25 00:03 tarekgh

@tarekgh Very cool! Is there documentation for this? I would add how to replicate the huggingface behaviours. Also how would truncation work?

tjwald avatar Mar 04 '25 07:03 tjwald

Is there documentation for this?

We are in the process of updating all docs. Hopefully will be done soon.

tarekgh avatar Mar 04 '25 15:03 tarekgh

@tarekgh I have just tried to use the API you suggested, and I don't see how to use it:

 public IReadOnlyList<int> BuildInputsWithSpecialTokens(IEnumerable<int> tokenIds, IEnumerable<int>? additionalTokenIds = null) 

To call this API I need to already tokenize it - so lets say I call it like so:

public IReadOnlyList<int> Tokenize(string context, string text)
{
       var tokenizedContext = _tokenizer.EncodeToIds(context, _tokenizerOptions.MaxTokenLength, out _, out _);
       var tokenizedText  = _tokenizer.EncodeToIds(text, _tokenizerOptions.MaxTokenLength, out _, out _);

       return _tokenizer.BuildInputsWithSpecialTokens(tokenizedContext, tokenizedText);
}

Well now I get: [cls] [cls] <context tokens> [sep] [sep] [cls] <text tokens> [sep] [sep]

Which I definitly do not want 😄

I am now trying this:

public List<int> Tokenize(string text)
{
    return (List<int>)_tokenizer.EncodeToIds(text, _tokenizerOptions.MaxTokenLength, out _, out _);
}

public IReadOnlyList<int> Tokenize(string context, string text)
{
  List<int> tokenizedContext = Tokenize(context);
  List<int> tokenizedText = Tokenize(context);
  
  tokenizedContext.Add(tokenizer.SeparatorTokenId);
  tokenizedContext.AddRange(tokenizedText.Skip(1));
  return tokenizedContext;
}

Which obviously isn't ideal.

tjwald avatar Mar 18 '25 19:03 tjwald

When encoding to Ids you can call the following overload and pass addSpecialTokens = false

https://github.com/dotnet/machinelearning/blob/81122c4c48d84fe2f49afd430a2c8fc214311baa/src/Microsoft.ML.Tokenizers/Model/BertTokenizer.cs#L158

tarekgh avatar Mar 18 '25 21:03 tarekgh

The workaround works. I think we should create an interface for this so that users can do a capability check on the Tokenizer abstraction. @tarekgh @luisquintanilla WDYT?

tjwald avatar Sep 06 '25 18:09 tjwald

The interface would be useful if most tokenizers supported functionality similar to the Bert tokenizer. However, users typically know which tokenizer they want to use and can directly call the additional functionality specific to it. We can revisit introducing an interface if we see sufficient demand.

tarekgh avatar Sep 06 '25 19:09 tarekgh

I am trying to write a generic library, so I wouldn't know what tokenizer is being used. I guess I can create my own interface and force my users to write adapters.

tjwald avatar Sep 06 '25 19:09 tjwald

I guess I can create my own interface and force my users to write adapters.

You know your scenario and goals better than anyone 😄. From what we’ve discussed so far, this is about Bert, where you can easily check the tokenizer type at runtime. But again, I’m just making an assumption here.

tarekgh avatar Sep 06 '25 19:09 tarekgh

Closing for now - thanks for the feedback 😄

tjwald avatar Sep 06 '25 21:09 tjwald