robust-attribution-regularization
robust-attribution-regularization copied to clipboard
Robust Attribution Regularization
Robust Attribution Regularization
This project is for the paper: Robust Attribution Regularization. Some codes are from MNIST Challenge, CIFAR10 Challenge, Deep traffic sign classification, tflearn oxflower17 and Interpretation of Neural Network is Fragile.
Introduction
This project is to solve an emerging problem in trustworthy machine learning: train models that produce robust interpretations for their predictions. See the example below:

Preliminaries
It is tested under Ubuntu Linux 16.04.1 and Python 3.6 environment, and requires some packages to be installed:
Downloading Datasets
- MNIST: included in Tensorflow.
- Fashion-MNIST: can be loaded via Tensorflow.
- GTSRB: we provide scripts to download it.
- Flower: we provide scripts to download it.
Overview of the Code
Running Experiments
- Before doing experiments, first edit config.json file to specify experiment settings. We provide some template config files like IG-NORM-config.json. You may need to run
mkdir modelsbefore training models. - For GTSRB and Flower, you can run prepare_data.sh to get dataset.
- train_nat.py: the script used to train NATURAL models.
- train_adv.py: the script used to train Madry's models.
- train_attribution.py: the script used to train our models(IG-NORM or IG-SUM-NORM).
- eval_pgd_attack.py: the script used to evaluate NA and AA of the model.
- eval_attribution_attack.py: the script used to evaluate IN and CO of the model.
- test_ig_attack.ipynb: ipython notebook used to present demo figures.
Parameters in config.json
Model configuration:
model_dir: contains the path to the directory of the currently trained/evaluated model.
Data configuration:
data_path: contains the path to the directory of dataset.
Training configuration:
tf_random_seed: the seed for the RNG used to initialize the network weights.numpy_random_seed: the seed for the RNG used to pass over the dataset in random order.max_num_training_steps: the number of training steps.num_output_steps: the number of training steps between printing progress in standard output.num_summary_steps: the number of training steps between storing tensorboard summaries.num_checkpoint_steps: the number of training steps between storing model checkpoints.training_batch_size: the size of the training batch.step_size_schedule: learning rate schedule array.weight_decay: weight decay rate.momentum: momentum rate.m: m in the gradient step.continue_train: whether continue previous training. Should be True or False.lambda: lambda in IG-NORM or beta in IG-SUM-NORM.approx_factor: (m / approx_factor) = (m in the attack step).training_objective: 'ar' for IG-NORM and 'adv_ar' for IG-SUM-NORM.
Evaluation configuration:
num_eval_examples: the number of examples to evaluate the model on.eval_batch_size: the size of the evaluation batches.
Adversarial examples configuration:
epsilon: the maximum allowed perturbation per pixel.num_stepsork: the number of PGD iterations used by the adversary.step_sizeora: the size of the PGD adversary steps.random_start: specifies whether the adversary will start iterating from the natural example or a random perturbation of it.loss_func: the loss function used to run pgd on.xentcorresponds to the standard cross-entropy loss,cwcorresponds to the loss function of Carlini and Wagner,ar_approxcorresponds to the regularization term of our IG-NORM objective,adv_ar_approxcorresponds to our IG-SUM-NORM objective.
Integrated gradient configuration:
num_IG_steps: the number of segments for summation appproximation of IG.
Attribution robustness configuration:
attribution_attack_method: can berandom,topK,mass_centerandtarget.attribution_attack_measure: can bekendall,intersection,spearmanandmass_center.saliency_type: can beigorsimple_gradient.k_top: the k used for topK attack.eval_k_top: the k used for evaluation metric -- TopK intersection.attribution_attack_step_size: step size of attribution attack.attribution_attack_steps: the number of iterations used by the attack.attribution_attack_times: the number of iterations to repeat the attack.
Citation
Please cite our work if you use the codebase:
@inproceedings{chen2019robust,
title={Robust attribution regularization},
author={Chen, Jiefeng and Wu, Xi and Rastogi, Vaibhav and Liang, Yingyu and Jha, Somesh},
booktitle={Advances in Neural Information Processing Systems},
pages={14300--14310},
year={2019}
}
License
Please refer to the LICENSE.