Yuhang Tao

Results 4 comments of Yuhang Tao

``` def tokenize_function(examples): return tokenizer(examples["text"], max_length=512, padding="max_length", truncation=True) raw_datasets = load_dataset(extension, data_files=data_files) tokenized_datasets = raw_datasets.map(tokenize_function, batched=True) ``` jsonl数据直接通过datasets加载后直接转id,没有进行其他预处理。想请教下有没有其他可能性,或者遇到这种case应该如何排查?

训练代码 ```python import os from datasets import load_dataset from transformers import BertTokenizer, BartForConditionalGeneration, Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq # =========定义常量========= path = "models/NLP/bart-large-chinese" text_column = "input_text" summary_column = "output_text" max_source_length = 384...

> 可能是datasets内部的其它部分?可以打印一些中间结果看看 确实,print算是通用的debug方式了,hhh

```shell INFO:root:Logging to file: /opt/repo/data-clean/audio/audio2txt/logs/20240929105547.log 2024-09-29 10:55:47,489 - INFO - Logging to file: /opt/repo/data-clean/audio/audio2txt/logs/20240929105547.log INFO:root:Found 10 audio files in /opt/data/douyin/recorder 2024-09-29 10:55:47,494 - INFO - Found 10 audio files in...