shuoyangd

Results 6 issues of shuoyangd

Hi, I realized that when searching on ACL anthology, sorting results with relevance (default) always gives a lot more results than sorting with the year of publication. For example, when...

This is the refactored implementation from the MTMA-2022 [kNN-based Retrieval Module for Sockeye](https://docs.google.com/document/d/1_Lea0E4g-VyqiRTqVbMXfA9iLTcJyQNVQDdkiKosoJ4/edit#heading=h.656mkzwobicc) project. It re-implements the models as in [Khandelwal et al. 2021](https://arxiv.org/abs/2010.00710). #### Pull Request Checklist ## -...

## Description This PR adds support for parallel data curation. Namely: - A new dataset class `ParallelDataset` that supports loading and writing parallel data in simple bitext format. - A...

## Description Add a modifier that performs regex replacements. ## Usage ``` regex_params = [ {"pattern": "ö", "repl": "o"}, { "pattern": "[^ !$%',-.0123456789;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz/:]", "repl": "", }, ] modifier = RegexModifier(regex_params)...

## Description Currently, FastTextLangId filter only supports filtering by a language ID filter, but sometimes, we know what the language the data is supposed to be, and it would be...

## Description This PR implements the feature to add skip labels to filtered entries in the json/parquet outputs instead of completely removing filtered entries. When this feature is enabled, it...