shuoyangd
shuoyangd
Hi, I realized that when searching on ACL anthology, sorting results with relevance (default) always gives a lot more results than sorting with the year of publication. For example, when...
This is the refactored implementation from the MTMA-2022 [kNN-based Retrieval Module for Sockeye](https://docs.google.com/document/d/1_Lea0E4g-VyqiRTqVbMXfA9iLTcJyQNVQDdkiKosoJ4/edit#heading=h.656mkzwobicc) project. It re-implements the models as in [Khandelwal et al. 2021](https://arxiv.org/abs/2010.00710). #### Pull Request Checklist ## -...
## Description This PR adds support for parallel data curation. Namely: - A new dataset class `ParallelDataset` that supports loading and writing parallel data in simple bitext format. - A...
## Description Add a modifier that performs regex replacements. ## Usage ``` regex_params = [ {"pattern": "ö", "repl": "o"}, { "pattern": "[^ !$%',-.0123456789;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz/:]", "repl": "", }, ] modifier = RegexModifier(regex_params)...
## Description Currently, FastTextLangId filter only supports filtering by a language ID filter, but sometimes, we know what the language the data is supposed to be, and it would be...
## Description This PR implements the feature to add skip labels to filtered entries in the json/parquet outputs instead of completely removing filtered entries. When this feature is enabled, it...