Longxu Dou issues

Results 6 issues of


                                            Longxu Dou

Instruction of Successfully Installing JAMR by Updating Packages

### Reason of Failure The original JAMR was installed in 2015 or 2016 in the server, thus some packages were broken or not updated. The default setup script in JAMR...

Welcome to try SailCraft - A data cleaning tool built upon this repository

We extend our gratitude to the authors of this repository! Your documentation and code have greatly benefited the community. We have used this repo in building the data processing pipeline...

Why stopwords_min_cutoff rather than stopwords_max_cutoff?

Thanks for your helpful codebase! I am a bit confused about `stop words filtering`. The release code removes the document, if its stop words ratio below the certain cutoff. https://github.com/bigscience-workshop/data-preparation/blob/9d0588419073cc5bf0fb92b58f37f2a1016572c3/preprocessing/training/01b_oscar_cleaning_and_filtering/filtering.py#L590...

Can't find the Deduplication Report

Thanks for your amazing codebase! I find that the link of [Deduplication Report](https://chenghaomou.github.io/1%20Projects/BigScience/SubProjects/Deduplication%20report) in `preprocessing/training/01b_oscar_cleaning_and_filtering/deduplicate/README.md` is not accessible. Could you please update it?

About theano version and .theanorc

Hi, I'm very amazing to see you guys' work. I have encountered some problems while reproducing your model in my environment setting. I doubt it was because the complex version...

Incomplete SQL prediction with PICARD

Appreciate for this interesting work! I trained a new T5 model from scratch using your script and predicted with PICARD but encounter a problem. **Modification**: replacing the `COLUMN` with `TABLE.COLUMN`...

bug