PASTA

A Synthetic Data-Driven Radiology Foundation Model for Pan-tumor Clinical Diagnosis

Overview

PASTA (Pan-Tumor Analysis with Synthetic Training Augmentation) is a data-efficient foundation model for analyzing diverse tumor lesions in 3D CT scans. Leveraging PASTA-Gen-30K, a large-scale synthetic dataset of 30,000 CT volumes with precise lesion masks and structured textual reports, PASTA addresses the scarcity of high-quality annotated data that traditionally hinders radiological AI research.

PASTA achieves state-of-the-art results on a wide range of tasks, including:

Tumor detection in plain CT
Lesion segmentation
Tumor staging
Survival prediction
Structured report generation
Cross-modality transfer learning

Workflow of PASTA Model Development and Training Pipeline. a, Overview of organs and lesion types involved in PASTA training. b, Examples of lesions generated by PASTA-Gen from healthy organs. c, Lesion generation process pipeline of PASTA-Gen. d, Two-stage training of PASTA using the PASTA-Gen-30K dataset.

Key Features

Synthetic Data Backbone Relies on PASTA-Gen-30K for training, bypassing the privacy constraints and annotation burdens associated with real clinical data.
Data-Efficient Excels in few-shot settings, requiring only a small set of real-world annotated scans to reach high-performance levels.
Pan-Tumor Coverage Encompasses malignancies across ten organ systems and five benign lesion types, designed for broad oncology analysis.

PASTA-Gen-30K

Pretraining dataset PASTA-Gen-30K is available at Hugging Face.
- Each synthetic 3D CT volume includes pixel-level lesion annotations and a structured radiological report.

PASTA Pretrained Checkpoint

Google Drive

Installation

Quick Setup

# 1. Create conda environment (Python 3.9-3.11 recommended)
conda create -n pasta python=3.9
conda activate pasta

# 2. Install PyTorch (adjust CUDA version as needed)
# For CUDA 12.1
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

# 3. Install dependencies
pip install -r requirements.txt

# 4. Install local nnUNetv2 (customized version for PASTA)
# This step will install nnUNet commands (nnUNetv2_plan_and_preprocess, etc.)
cd segmentation && pip install -e . && cd ..

# 5. Verify installation
which nnUNetv2_plan_and_preprocess  # Should show path in your conda env

What's Installed

After running the installation, you'll have access to these command-line tools:

nnUNetv2_plan_and_preprocess - Complete preprocessing pipeline
nnUNetv2_extract_fingerprint - Dataset fingerprint extraction
nnUNetv2_plan_experiment - Experiment planning
nnUNetv2_preprocess - Data preprocessing only
nnUNetv2_train - Model training
nnUNetv2_predict - Inference

Data Standardization

⚠️ All tasks require data standardization as the first step!

Quick Start

Standardize your dataset to 1x1x1mm spacing and RAS orientation:

python preprocess/NifitiStandard.py \
    -indir /path/to/original/data/root \
    -outdir /path/to/save/data/root

Details

CT images (min < -10): Linear interpolation
Segmentation labels (min > -10): Nearest-neighbor interpolation
Segmentation labels are not needed for classification tasks

Feature Extraction

After standardizing your dataset, example code for extracting features from the PASTA pretrained model:

python feature_extraction.py

Segmentation

⚠️ Segmentation tasks require nnUNet format. Please read nnUNet Dataset Format Documentation first to familiarize yourself with the data organization.

Setup Environment Variables

Setup nnUNet paths:

# Permanent (add to ~/.bashrc)
echo 'export nnUNet_raw="/path/to/nnUNet_raw"' >> ~/.bashrc
echo 'export nnUNet_preprocessed="/path/to/nnUNet_preprocessed"' >> ~/.bashrc
echo 'export nnUNet_results="/path/to/nnUNet_results"' >> ~/.bashrc
source ~/.bashrc

What are these paths?

nnUNet_raw: Your datasets in nnUNet format (you create this)
nnUNet_preprocessed: Auto-generated during training
nnUNet_results: Training outputs and checkpoints

Data Preparation

Step 1: Standardize (if not done yet)

python preprocess/NifitiStandard.py -indir /path/to/original -outdir /path/to/standardized

Step 2: Organize in nnUNet Format

Organize standardized data following nnUNet format:

$nnUNet_raw/DatasetXXX_TaskName/
├── imagesTr/
│   ├── case001_0000.nii.gz
│   └── case002_0000.nii.gz
├── labelsTr/
│   ├── case001.nii.gz
│   └── case002.nii.gz
└── dataset.json

Key points:

Images: caseName_0000.nii.gz (0000 for single-channel CT)
Labels: caseName.nii.gz (same name without _0000)
Create dataset.json with metadata (see nnUNet docs)

Step 3: nnUNet Preprocessing

Set environment variables and run preprocessing:

# Set nnUNet paths
export nnUNet_raw="/path/to/nnUNet_raw"
export nnUNet_preprocessed="/path/to/nnUNet_preprocessed"
export nnUNet_results="/path/to/nnUNet_results"

# Run preprocessing (replace XXX with your dataset ID)
nnUNetv2_plan_and_preprocess -d XXX --verify_dataset_integrity

Example:

# For Dataset001_Adrenal
export nnUNet_raw="/data/PASTA/nnUNet_raw"
export nnUNet_preprocessed="/data/PASTA/nnUNet_preprocessed"
export nnUNet_results="/data/PASTA/nnUNet_results"

nnUNetv2_plan_and_preprocess -d 001 --verify_dataset_integrity

What this does:

Verifies dataset integrity (all images have corresponding labels)
Extracts dataset fingerprint (spacing, intensity statistics, etc.)
Plans experiment configurations (patch size, batch size, etc.)
Preprocesses all data (resampling, normalization, etc.)

Training & Inference

Finetuning

python segmentation/nnunetv2/run/run_finetuning_pasta.py \
    3d_fullres PASTATrainer TASKID FOLD \
    -pretrained_weights /path/to/PASTA_final.pth

Few-shot Training

First, modify splits_final.json to set the number of training samples per fold, then:

python segmentation/nnunetv2/run/run_finetuning_pasta.py \
    3d_fullres PASTATrainer_fewshot TASKID FOLD \
    -pretrained_weights /path/to/PASTA_final.pth

Inference

# Finetuning model
python segmentation/inference/inference.py \
    -indir $nnUNet_raw/Dataset00x_xxx/imagesTr \
    -outdir $nnUNet_raw/Dataset00x_xxx/predictions \
    -split_json $nnUNet_raw/Dataset00x_xxx/splits_final.json \
    -trainer PASTATrainer_ft

# Few-shot model
python segmentation/inference/inference.py \
    -indir $nnUNet_raw/Dataset00x_xxx/imagesTr \
    -outdir $nnUNet_raw/Dataset00x_xxx/predictions_fewshot \
    -split_json $nnUNet_raw/Dataset00x_xxx/splits_final.json \
    -trainer PASTATrainer_fewshot

The fine-tuned segmentation PASTA model checkpoint for public datasets is available at here

Metric Calculation

python segmentation/inference/cal_metric.py \
    -predic_root /path/to/predictions \
    -label_root $nnUNet_raw/Dataset00x_xxx/labelsTr \
    -fg_class_num NUM_CLASSES

Classification

Plain-CT tumor detection

For plain-ct tumor detection tasks, please prepare your dataset follow the intruction in preprocess/Crop_plainct_tumor.ipynb.

Training:

Adjust the train_image_list, train_label_list, valid_image_list, valid_label_list, pretrained_model_path in the training bash:

CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 --master_port=$PORT ./train/train_classify_binary.py \
    --train_image_list config/data/PlainCT/yourdata/fold_0/train_image.txt \
    --train_label_list config/data/PlainCT/yourdata/fold_0/train_label.txt \
    --valid_image_list config/data/PlainCT/yourdata/fold_0/valid_image.txt \
    --valid_label_list config/data/PlainCT/yourdata/fold_0/valid_label.txt \
    --net_type Generic_UNet_classify \
    --input_channel 1 \
    --output_channel 2 \
    --base_feature_number 64 \
    --pretrained_model_path /path/to/pasta_checkpoint/PASTA_final.pth \
    --model_save_name weights/PASTA_classify/PASTA_classify_plainct_fold_0 \
    --batch_size 8 \
    --num_workers 6 \
    --learning_rate 1e-4 \
    --decay 1e-5 \
    --total_step 5000 \
    --start_step 0 \
    --save_step 1000 \
    --log_freq 100 \
    --accumulation_steps 1 \
    --class_num 2 \
    --class_weight 1 10 \
    --crop_shape 128 128 128

The template 5-fold training bash for PASTA, SupreM, ModelsGenesis, UNet is at:

bash classification/scripts/plainct/full/PASTA/train_all.sh
bash classification/scripts/plainct/full/SupreM/train_all.sh
bash classification/scripts/plainct/full/modelgenesis/train_all.sh
bash classification/scripts/plainct/full/UNet/train_all.sh

Inference:

Adjust the valid_image_list, valid_label_list, pretrained_model_path in the inference bash. The pretrained_model_path here is the path to your finetuned checkpoint:

CUDA_VISIBLE_DEVICES=0 python test/test_classify_binary.py \
--valid_image_list config/data/PlainCT/yourdata/fold_0/valid_image.txt \
--valid_label_list config/data/PlainCT/yourdata/fold_0/valid_label.txt \
--net_type Generic_UNet_classify \
--input_channel 1 \
--output_channel 2 \
--base_feature_number 64 \
--batch_size 1 \
--pretrained_model_path weights/PASTA_classify/PASTA_classify_plainct_fold_0_best.tar \
--class_num 2 \
--crop_shape 128 128 128 \
--output_json results/classify/Plain-CT/PASTA/fold_0.json

The template 5-fold inference bash for PASTA, SupreM, ModelsGenesis, UNet is at:

bash classification/scripts/plainct/full/PASTA/test_all.sh
bash classification/scripts/plainct/full/SupreM/test_all.sh
bash classification/scripts/plainct/full/modelgenesis/test_all.sh
bash classification/scripts/plainct/full/UNet/test_all.sh

Acknowledgement

We thank the authors of nnUNet, STU-Net, FMCIB, SynTumor for their great works. Please cite their papers if you use our code.

@article{isensee2021nnu,
  title={nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation},
  author={Isensee, Fabian and Jaeger, Paul F and Kohl, Simon AA and Petersen, Jens and Maier-Hein, Klaus H},
  journal={Nature methods},
  volume={18},
  number={2},
  pages={203--211},
  year={2021},
  publisher={Nature Publishing Group}
}

@article{huang2023stu,
  title={Stu-net: Scalable and transferable medical image segmentation models empowered by large-scale supervised pre-training},
  author={Huang, Ziyan and Wang, Haoyu and Deng, Zhongying and Ye, Jin and Su, Yanzhou and Sun, Hui and He, Junjun and Gu, Yun and Gu, Lixu and Zhang, Shaoting and others},
  journal={arXiv preprint arXiv:2304.06716},
  year={2023}
}

@article{pai2024foundation,
  title={Foundation model for cancer imaging biomarkers},
  author={Pai, Suraj and Bontempi, Dennis and Hadzic, Ibrahim and Prudente, Vasco and Soka{\v{c}}, Mateo and Chaunzwa, Tafadzwa L and Bernatz, Simon and Hosny, Ahmed and Mak, Raymond H and Birkbak, Nicolai J and others},
  journal={Nature machine intelligence},
  volume={6},
  number={3},
  pages={354--367},
  year={2024},
  publisher={Nature Publishing Group UK London}
}

@inproceedings{hu2023label,
  title={Label-free liver tumor segmentation},
  author={Hu, Qixin and Chen, Yixiong and Xiao, Junfei and Sun, Shuwen and Chen, Jieneng and Yuille, Alan L and Zhou, Zongwei},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={7422--7432},
  year={2023}
}

PASTA
PASTA copied to clipboard

Metadata

PASTA

Overview

Key Features

PASTA-Gen-30K

PASTA Pretrained Checkpoint

Installation

Quick Setup

What's Installed

Data Standardization

Quick Start

Details

Feature Extraction

Segmentation

Setup Environment Variables

Data Preparation

Step 1: Standardize (if not done yet)

Step 2: Organize in nnUNet Format

Step 3: nnUNet Preprocessing

Training & Inference

Finetuning

Few-shot Training

Inference

Metric Calculation

Classification

Plain-CT tumor detection

Training:

Inference:

Acknowledgement

← Metadata

Owner

Metadata

PASTA PASTA copied to clipboard

Metadata

PASTA

Overview

Key Features

PASTA-Gen-30K

PASTA Pretrained Checkpoint

Installation

Quick Setup

What's Installed

Data Standardization

Quick Start

Details

Feature Extraction

Segmentation

Setup Environment Variables

Data Preparation

Step 1: Standardize (if not done yet)

Step 2: Organize in nnUNet Format

Step 3: nnUNet Preprocessing

Training & Inference

Finetuning

Few-shot Training

Inference

Metric Calculation

Classification

Plain-CT tumor detection

Training:

Inference:

Acknowledgement

← Metadata

Owner

Metadata

PASTA
PASTA copied to clipboard