Adapting Pretrained Text-to-Text Models for Long Text Sequences

This repo contains code/checkpoints to reproduce the results of the paper: Adapting Pretrained Text-to-Text Models for Long Text Sequences. We further pretrain the BART model for long sequence tasks, setting new state-of-the-art on abstract summarization of long texts (e.g., GovReport, BookSum, SummScreen, QMSum). Our implementation is based on custom forks of fairseq and xformers. You could use this repo to finetune on your own long-context tasks or implement efficienct long-context models while using the fast fairseq package.

Environment Setup

Our model is developed using A100 GPUs and CUDA version 11.4, PyTorch 1.12.1. The exact result numbers might vary due to environment differences.

Install xformers and fairseq by running pip install -e . under their directory. Install apex following https://github.com/NVIDIA/apex.
Install Triton -- to suppress errors from xformers

pip install triton

Install summarizaztion pyrouge and rouge_score

pip install -U  git+https://github.com/pltrdy/pyrouge
pip install rouge_score

Summarization Performance

Method	GovReport (# Params)	BookSum-Chapters (# Params)	SummScreen-FD (# Params)	SummScreen-TVM (# Params)
	ROUGE-1/2	ROUGE-1/2	ROUGE-1/2	ROUGE-1/2
Previous SOTA	61.0/28.8 (525M)	38.3/9.2 (660M)	36.8/9.2 (660M)	51.0/14.7 (660M)
BART-LS (ours) 440M	62.0/30.9	38.5/10.3	39.1/10.7	51.8/17.2

Model Checkpoints

Model Description	Download
Pretrained Model	model_100k.pt
Finetuned checkpoint on GovReport	model_gov.py
Finetuned checkpoint SummScreen-fd	model_fd.py
Finetuned checkpoint on BookSum	model_book.py
Dictionary/vocabulary file	dict.txt

Code Structure

Tasks

Pretraining task: fairseq-py/fairseq/tasks/long_denoising.py
Summarization task: fairseq-py/fairseq/tasks/summarization.py

Architectures

Pooling layers: fairseq-py/fairseq/models/long_transformers/pooling_layers.py
Block Attention: xformers/xformers/components/attention/block_noglobal.py.
Integration to fairseq's transformer architecture: fairseq-py/fairseq/modules/multihead_attention.py

Alternative Attention Implementations

Apart from the block attention implemented with native PyTorch operations, we also provides a faster version within xformers implemented with Triton: xformers/xformers/components/attention/blocksparse_local.py. This implementation brings about 20-30% efficiency gains and slightly worse results. To enable this options, simply pass --attention-name bs_local. You can easy implement other architectures without worring about other transformer blocks.

Instruction to finetuning the pretrained model

Prepare raw data. Organize you data as {train|val|test}.{src|tgt}, where each line corresponds to an example.
Under fairseq-py/, binarize the data following bash ./scripts/summarization/binarize.sh. For query-based summarization, check fairseq-py/scripts/summarization/qmsum_preprocess.sh
The hyperparameters we used for each dataset can be found at fairseq-py/fb_sweep/long_finetune/sweep_summ.py. After downloading the checkpoints and put them under checkpoints/, use the following script to run finetuning:

bash scripts/summarization/ft_summ.sh

Using released summarization checkpoints

Generating summarizes on summscreen

python scripts/summarization/long_generate.py \
            --model-dir ../checkpoints/model_fd.pt \
            --data-dir ${BINARIZED_DATA} \
            --save-dir ${SUMMARY_SAVE_DIR} \
            --split valid \
            --bsz 4

This script will print ROUGE numbers calculated by rouge_score, which is used by Scrolls. In our paper, we reported the rouge scores using files2rouge. Please follow their repo to install file2rouge and download standord-corenlp for tokenization.

BibTeX

If you find the repo useful, please consider citing our paper:

@article{xiong2022adapting,
  title={Adapting Pretrained Text-to-Text Models for Long Text Sequences},
  author={Xiong, Wenhan and Gupta, Anchit and Toshniwal, Shubham and Mehdad, Yashar and Yih, Wen-tau},
  journal={arXiv preprint arXiv:2209.10052},
  year={2022}
}

License

CC-BY-NC 4.0

bart_ls
bart_ls copied to clipboard

Metadata

Adapting Pretrained Text-to-Text Models for Long Text Sequences

Environment Setup

Summarization Performance

Model Checkpoints

Code Structure

Tasks

Architectures

Alternative Attention Implementations

Instruction to finetuning the pretrained model

Using released summarization checkpoints

Generating summarizes on summscreen

BibTeX

License

← Metadata

Owner

Metadata

bart_ls bart_ls copied to clipboard

Metadata

Adapting Pretrained Text-to-Text Models for Long Text Sequences

Environment Setup

Summarization Performance

Model Checkpoints

Code Structure

Tasks

Architectures

Alternative Attention Implementations

Instruction to finetuning the pretrained model

Using released summarization checkpoints

Generating summarizes on summscreen

BibTeX

License

← Metadata

Owner

Metadata

bart_ls
bart_ls copied to clipboard