Yuhang Tao comments

Results 4 comments of


                                            Yuhang Tao

用huggingface代码直接进行BART large fineturning出现繁体字

``` def tokenize_function(examples): return tokenizer(examples["text"], max_length=512, padding="max_length", truncation=True) raw_datasets = load_dataset(extension, data_files=data_files) tokenized_datasets = raw_datasets.map(tokenize_function, batched=True) ``` jsonl数据直接通过datasets加载后直接转id，没有进行其他预处理。想请教下有没有其他可能性，或者遇到这种case应该如何排查？

用huggingface代码直接进行BART large fineturning出现繁体字

训练代码 ```python import os from datasets import load_dataset from transformers import BertTokenizer, BartForConditionalGeneration, Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq # =========定义常量========= path = "models/NLP/bart-large-chinese" text_column = "input_text" summary_column = "output_text" max_source_length = 384...

用huggingface代码直接进行BART large fineturning出现繁体字

> 可能是datasets内部的其它部分？可以打印一些中间结果看看确实，print算是通用的debug方式了，hhh

inference was killed due to memory(100GB was used)

```shell INFO:root:Logging to file: /opt/repo/data-clean/audio/audio2txt/logs/20240929105547.log 2024-09-29 10:55:47,489 - INFO - Logging to file: /opt/repo/data-clean/audio/audio2txt/logs/20240929105547.log INFO:root:Found 10 audio files in /opt/data/douyin/recorder 2024-09-29 10:55:47,494 - INFO - Found 10 audio files in...