Xiaohan Zhang

Results 20 issues of Xiaohan Zhang

# What does this PR do? # What issue(s) does this change relate to? # Before submitting - [ ] Have you read the [contributor guidelines](https://github.com/mosaicml/composer/blob/dev/CONTRIBUTING.md)? - [ ] Is...

JIRA: https://databricks.atlassian.net/jira/software/c/projects/STR/issues/STR-141?filter=allissues This script is useful in scenarios where the FT API data input has been malformed. It acts as a preventive measure to ensure data integrity and helps in...

* add notebook/data_validation_notebook which runs data preparation and token counting from byod/data_validation branch. Merged to main to keep underlying functions up-to-date. * add utils functions used by notebook/data_validation_notebook * shuffle...

Enable delta table as input for CPT For CPT, you need to provide some tokenizer arguments so the resulted MDS dataset can be written python scripts/data_prep/convert_delta_to_json.py --delta_table_name main.streaming.random_cpt_table --processes 128...

## Description of changes: Make merge_index utility run in parallel with multiprocessing. Note the normal use case for merge index happens after mds shards are written to a number of...

## Description of changes: Add a ingestion helper utility for Huggingface datasets downloading. Building on snapshot_download, some improvements include - Enable resume = True. retry when bad network happens -...

## Description of changes: Use temporary cloud credentials to test two functions "dataframe_to_mds" and "merge_index". on [dbfs:/Volume, s3, gcs, oci] ## Issue #, if available: ## Merge Checklist: _Put an...

# What does this PR do? Restore dev version to 0.24.0.dev0

## Description of changes: ## Issue #, if available: ## Merge Checklist: _Put an `x` without space in the boxes that apply. If you are unsure about any checklist, please...