artitw

Results 9 issues of artitw

Two approaches to try: 1. Use crosslingual embeddings as input to MLP or tree-based model in transfer learning fashion 2. Fine-tune crosslingual translator with softmax output

Fine-tune cross-lingual translator for text2text generation tasks, e.g. question generation, question answering, summarization, etc. to demonstrate cross-lingual alignment, zero-shot generation, etc. For example, can we demonstrate question generation or question...

Perform a similar study to https://arxiv.org/pdf/1907.04307.pdf but expanding to support 100 languages using the [embeddings from the translator](https://github.com/artitw/text2text#embedding--vectorization). Possibly start with the paper's [code sample](https://www.tensorflow.org/hub/tutorials/cross_lingual_similarity_with_tf_hub_multilingual_universal_encoder).

Training and inferencing performance could be better. Need to update and test https://github.com/artitw/apex

Currently, the documentation consists of the README, which very brief. There is much more functionality in the text2text API that is not described. Such functionality can be better documented for...

There is currently no type checking, so we can follow practices from https://docs.python.org/3/library/typing.html

Follow guidelines from official Python documentation for unit testing: https://docs.python.org/3/library/unittest.html

Turn colab demo notebook into integration tests: https://colab.research.google.com/drive/1LE_ifTpOGO5QJCKNQYtZe6c_tjbwnulR https://github.com/artitw/text2text/blob/master/text2text_demo.ipynb

Multilingual search can be achieved with subword tokenization. The accuracy of traditional TF-IDF approaches depend on manually curated tokenization, stop words and stemming rules, whereas subword TF-IDF (STF-IDF) can offer...