PASTA
PASTA copied to clipboard
Pan-Tumor Radiology Foundation Model Utilizing Synthetic Training Data for Advanced Oncological Insights
PASTA
A Synthetic Data-Driven Radiology Foundation Model for Pan-tumor Clinical Diagnosis
Overview
PASTA (Pan-Tumor Analysis with Synthetic Training Augmentation) is a data-efficient foundation model for analyzing diverse tumor lesions in 3D CT scans. Leveraging PASTA-Gen-30K, a large-scale synthetic dataset of 30,000 CT volumes with precise lesion masks and structured textual reports, PASTA addresses the scarcity of high-quality annotated data that traditionally hinders radiological AI research.
PASTA achieves state-of-the-art results on a wide range of tasks, including:
- Tumor detection in plain CT
- Lesion segmentation
- Tumor staging
- Survival prediction
- Structured report generation
- Cross-modality transfer learning
Workflow of PASTA Model Development and Training Pipeline. a, Overview of organs and lesion types involved in PASTA training. b, Examples of lesions generated by PASTA-Gen from healthy organs. c, Lesion generation process pipeline of PASTA-Gen. d, Two-stage training of PASTA using the PASTA-Gen-30K dataset.
Key Features
- Synthetic Data Backbone Relies on PASTA-Gen-30K for training, bypassing the privacy constraints and annotation burdens associated with real clinical data.
- Data-Efficient Excels in few-shot settings, requiring only a small set of real-world annotated scans to reach high-performance levels.
- Pan-Tumor Coverage Encompasses malignancies across ten organ systems and five benign lesion types, designed for broad oncology analysis.
PASTA-Gen-30K
- Pretraining dataset PASTA-Gen-30K is available at Hugging Face.
- Each synthetic 3D CT volume includes pixel-level lesion annotations and a structured radiological report.
PASTA Pretrained Checkpoint
Installation
Quick Setup
# 1. Create conda environment (Python 3.9-3.11 recommended)
conda create -n pasta python=3.9
conda activate pasta
# 2. Install PyTorch (adjust CUDA version as needed)
# For CUDA 12.1
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
# 3. Install dependencies
pip install -r requirements.txt
# 4. Install local nnUNetv2 (customized version for PASTA)
# This step will install nnUNet commands (nnUNetv2_plan_and_preprocess, etc.)
cd segmentation && pip install -e . && cd ..
# 5. Verify installation
which nnUNetv2_plan_and_preprocess # Should show path in your conda env
What's Installed
After running the installation, you'll have access to these command-line tools:
nnUNetv2_plan_and_preprocess- Complete preprocessing pipelinennUNetv2_extract_fingerprint- Dataset fingerprint extractionnnUNetv2_plan_experiment- Experiment planningnnUNetv2_preprocess- Data preprocessing onlynnUNetv2_train- Model trainingnnUNetv2_predict- Inference
Data Standardization
⚠️ All tasks require data standardization as the first step!
Quick Start
Standardize your dataset to 1x1x1mm spacing and RAS orientation:
python preprocess/NifitiStandard.py \
-indir /path/to/original/data/root \
-outdir /path/to/save/data/root
Details
- CT images (min < -10): Linear interpolation
- Segmentation labels (min > -10): Nearest-neighbor interpolation
- Segmentation labels are not needed for classification tasks
Feature Extraction
After standardizing your dataset, example code for extracting features from the PASTA pretrained model:
python feature_extraction.py
Segmentation
⚠️ Segmentation tasks require nnUNet format. Please read nnUNet Dataset Format Documentation first to familiarize yourself with the data organization.
Setup Environment Variables
Setup nnUNet paths:
# Permanent (add to ~/.bashrc)
echo 'export nnUNet_raw="/path/to/nnUNet_raw"' >> ~/.bashrc
echo 'export nnUNet_preprocessed="/path/to/nnUNet_preprocessed"' >> ~/.bashrc
echo 'export nnUNet_results="/path/to/nnUNet_results"' >> ~/.bashrc
source ~/.bashrc
What are these paths?
nnUNet_raw: Your datasets in nnUNet format (you create this)nnUNet_preprocessed: Auto-generated during trainingnnUNet_results: Training outputs and checkpoints
Data Preparation
Step 1: Standardize (if not done yet)
python preprocess/NifitiStandard.py -indir /path/to/original -outdir /path/to/standardized
Step 2: Organize in nnUNet Format
Organize standardized data following nnUNet format:
$nnUNet_raw/DatasetXXX_TaskName/
├── imagesTr/
│ ├── case001_0000.nii.gz
│ └── case002_0000.nii.gz
├── labelsTr/
│ ├── case001.nii.gz
│ └── case002.nii.gz
└── dataset.json
Key points:
- Images:
caseName_0000.nii.gz(0000 for single-channel CT) - Labels:
caseName.nii.gz(same name without_0000) - Create
dataset.jsonwith metadata (see nnUNet docs)
Step 3: nnUNet Preprocessing
Set environment variables and run preprocessing:
# Set nnUNet paths
export nnUNet_raw="/path/to/nnUNet_raw"
export nnUNet_preprocessed="/path/to/nnUNet_preprocessed"
export nnUNet_results="/path/to/nnUNet_results"
# Run preprocessing (replace XXX with your dataset ID)
nnUNetv2_plan_and_preprocess -d XXX --verify_dataset_integrity
Example:
# For Dataset001_Adrenal
export nnUNet_raw="/data/PASTA/nnUNet_raw"
export nnUNet_preprocessed="/data/PASTA/nnUNet_preprocessed"
export nnUNet_results="/data/PASTA/nnUNet_results"
nnUNetv2_plan_and_preprocess -d 001 --verify_dataset_integrity
What this does:
- Verifies dataset integrity (all images have corresponding labels)
- Extracts dataset fingerprint (spacing, intensity statistics, etc.)
- Plans experiment configurations (patch size, batch size, etc.)
- Preprocesses all data (resampling, normalization, etc.)
Training & Inference
Finetuning
python segmentation/nnunetv2/run/run_finetuning_pasta.py \
3d_fullres PASTATrainer TASKID FOLD \
-pretrained_weights /path/to/PASTA_final.pth
Few-shot Training
First, modify splits_final.json to set the number of training samples per fold, then:
python segmentation/nnunetv2/run/run_finetuning_pasta.py \
3d_fullres PASTATrainer_fewshot TASKID FOLD \
-pretrained_weights /path/to/PASTA_final.pth
Inference
# Finetuning model
python segmentation/inference/inference.py \
-indir $nnUNet_raw/Dataset00x_xxx/imagesTr \
-outdir $nnUNet_raw/Dataset00x_xxx/predictions \
-split_json $nnUNet_raw/Dataset00x_xxx/splits_final.json \
-trainer PASTATrainer_ft
# Few-shot model
python segmentation/inference/inference.py \
-indir $nnUNet_raw/Dataset00x_xxx/imagesTr \
-outdir $nnUNet_raw/Dataset00x_xxx/predictions_fewshot \
-split_json $nnUNet_raw/Dataset00x_xxx/splits_final.json \
-trainer PASTATrainer_fewshot
The fine-tuned segmentation PASTA model checkpoint for public datasets is available at here
Metric Calculation
python segmentation/inference/cal_metric.py \
-predic_root /path/to/predictions \
-label_root $nnUNet_raw/Dataset00x_xxx/labelsTr \
-fg_class_num NUM_CLASSES
Classification
Plain-CT tumor detection
For plain-ct tumor detection tasks, please prepare your dataset follow the intruction in preprocess/Crop_plainct_tumor.ipynb.
Training:
Adjust the train_image_list, train_label_list, valid_image_list, valid_label_list, pretrained_model_path in the training bash:
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 --master_port=$PORT ./train/train_classify_binary.py \
--train_image_list config/data/PlainCT/yourdata/fold_0/train_image.txt \
--train_label_list config/data/PlainCT/yourdata/fold_0/train_label.txt \
--valid_image_list config/data/PlainCT/yourdata/fold_0/valid_image.txt \
--valid_label_list config/data/PlainCT/yourdata/fold_0/valid_label.txt \
--net_type Generic_UNet_classify \
--input_channel 1 \
--output_channel 2 \
--base_feature_number 64 \
--pretrained_model_path /path/to/pasta_checkpoint/PASTA_final.pth \
--model_save_name weights/PASTA_classify/PASTA_classify_plainct_fold_0 \
--batch_size 8 \
--num_workers 6 \
--learning_rate 1e-4 \
--decay 1e-5 \
--total_step 5000 \
--start_step 0 \
--save_step 1000 \
--log_freq 100 \
--accumulation_steps 1 \
--class_num 2 \
--class_weight 1 10 \
--crop_shape 128 128 128
The template 5-fold training bash for PASTA, SupreM, ModelsGenesis, UNet is at:
bash classification/scripts/plainct/full/PASTA/train_all.sh
bash classification/scripts/plainct/full/SupreM/train_all.sh
bash classification/scripts/plainct/full/modelgenesis/train_all.sh
bash classification/scripts/plainct/full/UNet/train_all.sh
Inference:
Adjust the valid_image_list, valid_label_list, pretrained_model_path in the inference bash. The pretrained_model_path here is the path to your finetuned checkpoint:
CUDA_VISIBLE_DEVICES=0 python test/test_classify_binary.py \
--valid_image_list config/data/PlainCT/yourdata/fold_0/valid_image.txt \
--valid_label_list config/data/PlainCT/yourdata/fold_0/valid_label.txt \
--net_type Generic_UNet_classify \
--input_channel 1 \
--output_channel 2 \
--base_feature_number 64 \
--batch_size 1 \
--pretrained_model_path weights/PASTA_classify/PASTA_classify_plainct_fold_0_best.tar \
--class_num 2 \
--crop_shape 128 128 128 \
--output_json results/classify/Plain-CT/PASTA/fold_0.json
The template 5-fold inference bash for PASTA, SupreM, ModelsGenesis, UNet is at:
bash classification/scripts/plainct/full/PASTA/test_all.sh
bash classification/scripts/plainct/full/SupreM/test_all.sh
bash classification/scripts/plainct/full/modelgenesis/test_all.sh
bash classification/scripts/plainct/full/UNet/test_all.sh
Acknowledgement
- We thank the authors of nnUNet, STU-Net, FMCIB, SynTumor for their great works. Please cite their papers if you use our code.
@article{isensee2021nnu,
title={nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation},
author={Isensee, Fabian and Jaeger, Paul F and Kohl, Simon AA and Petersen, Jens and Maier-Hein, Klaus H},
journal={Nature methods},
volume={18},
number={2},
pages={203--211},
year={2021},
publisher={Nature Publishing Group}
}
@article{huang2023stu,
title={Stu-net: Scalable and transferable medical image segmentation models empowered by large-scale supervised pre-training},
author={Huang, Ziyan and Wang, Haoyu and Deng, Zhongying and Ye, Jin and Su, Yanzhou and Sun, Hui and He, Junjun and Gu, Yun and Gu, Lixu and Zhang, Shaoting and others},
journal={arXiv preprint arXiv:2304.06716},
year={2023}
}
@article{pai2024foundation,
title={Foundation model for cancer imaging biomarkers},
author={Pai, Suraj and Bontempi, Dennis and Hadzic, Ibrahim and Prudente, Vasco and Soka{\v{c}}, Mateo and Chaunzwa, Tafadzwa L and Bernatz, Simon and Hosny, Ahmed and Mak, Raymond H and Birkbak, Nicolai J and others},
journal={Nature machine intelligence},
volume={6},
number={3},
pages={354--367},
year={2024},
publisher={Nature Publishing Group UK London}
}
@inproceedings{hu2023label,
title={Label-free liver tumor segmentation},
author={Hu, Qixin and Chen, Yixiong and Xiao, Junfei and Sun, Shuwen and Chen, Jieneng and Yuille, Alan L and Zhou, Zongwei},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={7422--7432},
year={2023}
}