tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Hi, we need java /scala tokenizers bindings

Open mullerhai opened this issue 8 months ago • 5 comments

hi, now we use pytorch in scala, we want to use tokenizers ,but now can not suit with rust, do we have plans to support scala bindings

mullerhai avatar Jun 01 '25 03:06 mullerhai

Hey! There is https://github.com/sbrunk/tokenizers-scala which should not really be up to date but good enough. We've had requests for C/C++ bindings as well. I'd be happy here we could just automatically make them!

ArthurZucker avatar Jun 02 '25 08:06 ArthurZucker

Hey! There is https://github.com/sbrunk/tokenizers-scala which should not really be up to date but good enough. We've had requests for C/C++ bindings as well. I'd be happy here we could just automatically make them!

I known this repo, but it can not download pretrained model and tokenizers , we need to make it reality

mullerhai avatar Jun 02 '25 08:06 mullerhai

Hi! I'm new to this repository and I'm interested in the possibility of having Java/Scala bindings for the tokenizers library. I've noticed there are already Python and Node.js bindings in the bindings/ directory, and I'm curious about how feasible it would be to add Java/Scala support in a similar way.

As someone new to the project, I have some questions:

  1. I see that both Python and Node.js bindings use Rust as the core implementation. Is this the standard approach for adding new language bindings? I'd like to understand the general architecture better.

  2. What would be the first steps for someone new to contribute a new language binding? I'm particularly interested in:

    • The basic structure needed (I see Cargo.toml, language-specific config files, tests, etc.)
    • Any specific tools or approaches recommended for Java/Scala integration
    • Common pitfalls to avoid
  3. Are there any existing discussions or documentation about adding new language bindings that I should be aware of?

  4. Would the team be open to mentoring or reviewing a contribution for Java/Scala bindings? I'm eager to learn and contribute, but would appreciate guidance on the best approach.

Thank you for your time! I'm excited to learn more about the project and potentially help add Java/Scala support.

AshAnand34 avatar Jun 09 '25 22:06 AshAnand34

Hi! I'm new to this repository and I'm interested in the possibility of having Java/Scala bindings for the tokenizers library. I've noticed there are already Python and Node.js bindings in the bindings/ directory, and I'm curious about how feasible it would be to add Java/Scala support in a similar way.

As someone new to the project, I have some questions:

  1. I see that both Python and Node.js bindings use Rust as the core implementation. Is this the standard approach for adding new language bindings? I'd like to understand the general architecture better.

  2. What would be the first steps for someone new to contribute a new language binding? I'm particularly interested in:

    • The basic structure needed (I see Cargo.toml, language-specific config files, tests, etc.)
    • Any specific tools or approaches recommended for Java/Scala integration
    • Common pitfalls to avoid
  3. Are there any existing discussions or documentation about adding new language bindings that I should be aware of?

  4. Would the team be open to mentoring or reviewing a contribution for Java/Scala bindings? I'm eager to learn and contribute, but would appreciate guidance on the best approach.

Thank you for your time! I'm excited to learn more about the project and potentially help add Java/Scala support.

  1. Recommendation to Use Rust Bindings It is recommended to use Rust bindings. You can refer to the project mentioned earlier. Essentially, you only need to implement any missing functionality based on its existing work. After all, Rust offers superior performance. Of course, you also have the option to develop purely in Scala 3 without relying on Rust or Python, but this would involve a significant amount of development work.

  2. Additional Suggestions Regarding other aspects, I don't have much in-depth knowledge. You should try to communicate more with the official Transformers team. While I am unable to review your code, we can help test the effectiveness of your implementations. I highly recommend prioritizing the development of the Scala 3 version of Transformers, as this task is of great importance.

I greatly admire your abilities and convictions, and I look forward to seeing you make even greater contributions to Scala 3.

mullerhai avatar Jun 10 '25 01:06 mullerhai

Hey! There is https://github.com/sbrunk/tokenizers-scala which should not really be up to date but good enough. We've had requests for C/C++ bindings as well. I'd be happy here we could just automatically make them!

Hi @ArthurZucker A low level C binding would allow people to easily bind further on thier side into other higher level languages like java, c++, ... The very first (baby?) step would be just to add few "#[no_mangle]" and "extern "C"" to some public rs functions/structs so that some keys symbols would be visible and unmangled in the final libtokenizers.rlib eg:

~/repos/tokenizers/tokenizers (main) $ nm target/release/libtokenizers.rlib | grep from_pretrained
nm: lib.rmeta: no symbols

Would you consider a PR ? Best

WilliamTambellini avatar Jul 16 '25 22:07 WilliamTambellini