Hi, we need java /scala tokenizers bindings
hi, now we use pytorch in scala, we want to use tokenizers ,but now can not suit with rust, do we have plans to support scala bindings
Hey! There is https://github.com/sbrunk/tokenizers-scala which should not really be up to date but good enough. We've had requests for C/C++ bindings as well. I'd be happy here we could just automatically make them!
Hey! There is https://github.com/sbrunk/tokenizers-scala which should not really be up to date but good enough. We've had requests for C/C++ bindings as well. I'd be happy here we could just automatically make them!
I known this repo, but it can not download pretrained model and tokenizers , we need to make it reality
Hi! I'm new to this repository and I'm interested in the possibility of having Java/Scala bindings for the tokenizers library. I've noticed there are already Python and Node.js bindings in the bindings/ directory, and I'm curious about how feasible it would be to add Java/Scala support in a similar way.
As someone new to the project, I have some questions:
-
I see that both Python and Node.js bindings use Rust as the core implementation. Is this the standard approach for adding new language bindings? I'd like to understand the general architecture better.
-
What would be the first steps for someone new to contribute a new language binding? I'm particularly interested in:
- The basic structure needed (I see Cargo.toml, language-specific config files, tests, etc.)
- Any specific tools or approaches recommended for Java/Scala integration
- Common pitfalls to avoid
-
Are there any existing discussions or documentation about adding new language bindings that I should be aware of?
-
Would the team be open to mentoring or reviewing a contribution for Java/Scala bindings? I'm eager to learn and contribute, but would appreciate guidance on the best approach.
Thank you for your time! I'm excited to learn more about the project and potentially help add Java/Scala support.
Hi! I'm new to this repository and I'm interested in the possibility of having Java/Scala bindings for the tokenizers library. I've noticed there are already Python and Node.js bindings in the bindings/ directory, and I'm curious about how feasible it would be to add Java/Scala support in a similar way.
As someone new to the project, I have some questions:
I see that both Python and Node.js bindings use Rust as the core implementation. Is this the standard approach for adding new language bindings? I'd like to understand the general architecture better.
What would be the first steps for someone new to contribute a new language binding? I'm particularly interested in:
- The basic structure needed (I see Cargo.toml, language-specific config files, tests, etc.)
- Any specific tools or approaches recommended for Java/Scala integration
- Common pitfalls to avoid
Are there any existing discussions or documentation about adding new language bindings that I should be aware of?
Would the team be open to mentoring or reviewing a contribution for Java/Scala bindings? I'm eager to learn and contribute, but would appreciate guidance on the best approach.
Thank you for your time! I'm excited to learn more about the project and potentially help add Java/Scala support.
-
Recommendation to Use Rust Bindings It is recommended to use Rust bindings. You can refer to the project mentioned earlier. Essentially, you only need to implement any missing functionality based on its existing work. After all, Rust offers superior performance. Of course, you also have the option to develop purely in Scala 3 without relying on Rust or Python, but this would involve a significant amount of development work.
-
Additional Suggestions Regarding other aspects, I don't have much in-depth knowledge. You should try to communicate more with the official Transformers team. While I am unable to review your code, we can help test the effectiveness of your implementations. I highly recommend prioritizing the development of the Scala 3 version of Transformers, as this task is of great importance.
I greatly admire your abilities and convictions, and I look forward to seeing you make even greater contributions to Scala 3.
Hey! There is https://github.com/sbrunk/tokenizers-scala which should not really be up to date but good enough. We've had requests for C/C++ bindings as well. I'd be happy here we could just automatically make them!
Hi @ArthurZucker A low level C binding would allow people to easily bind further on thier side into other higher level languages like java, c++, ... The very first (baby?) step would be just to add few "#[no_mangle]" and "extern "C"" to some public rs functions/structs so that some keys symbols would be visible and unmangled in the final libtokenizers.rlib eg:
~/repos/tokenizers/tokenizers (main) $ nm target/release/libtokenizers.rlib | grep from_pretrained
nm: lib.rmeta: no symbols
Would you consider a PR ? Best