artitw
artitw
Two approaches to try: 1. Use crosslingual embeddings as input to MLP or tree-based model in transfer learning fashion 2. Fine-tune crosslingual translator with softmax output
Fine-tune cross-lingual translator for text2text generation tasks, e.g. question generation, question answering, summarization, etc. to demonstrate cross-lingual alignment, zero-shot generation, etc. For example, can we demonstrate question generation or question...
Perform a similar study to https://arxiv.org/pdf/1907.04307.pdf but expanding to support 100 languages using the [embeddings from the translator](https://github.com/artitw/text2text#embedding--vectorization). Possibly start with the paper's [code sample](https://www.tensorflow.org/hub/tutorials/cross_lingual_similarity_with_tf_hub_multilingual_universal_encoder).
Training and inferencing performance could be better. Need to update and test https://github.com/artitw/apex
Currently, the documentation consists of the README, which very brief. There is much more functionality in the text2text API that is not described. Such functionality can be better documented for...
There is currently no type checking, so we can follow practices from https://docs.python.org/3/library/typing.html
Follow guidelines from official Python documentation for unit testing: https://docs.python.org/3/library/unittest.html
Turn colab demo notebook into integration tests: https://colab.research.google.com/drive/1LE_ifTpOGO5QJCKNQYtZe6c_tjbwnulR https://github.com/artitw/text2text/blob/master/text2text_demo.ipynb
Multilingual search can be achieved with subword tokenization. The accuracy of traditional TF-IDF approaches depend on manually curated tokenization, stop words and stemming rules, whereas subword TF-IDF (STF-IDF) can offer...