Adversarial explanations for understanding image classification decisions and improved neural network robustness

A preprint version of the article is available at arXiv.


For sensitive problems, such as medical imaging or fraud detection, neural network (NN) adoption has been slow due to concerns about their reliability, leading to a number of algorithms for explaining their decisions. NNs have also been found to be vulnerable to a class of imperceptible attacks, called adversarial examples, which arbitrarily alter the output of the network. Here we demonstrate both that these attacks can invalidate previous attempts to explain the decisions of NNs, and that with very robust networks, the attacks themselves may be leveraged as explanations with greater fidelity to the model. We also show that the introduction of a novel regularization technique inspired by the Lipschitz constraint, alongside other proposed improvements including a half-Huber activation function, greatly improves the resistance of NNs to adversarial examples. On the ImageNet classification task, we demonstrate a network with an accuracy-robustness area (ARA) of 0.0053, an ARA 2.4 times greater than the previous state-of-the-art value. Improving the mechanisms by which NN decisions are understood is an important direction for both establishing trust in sensitive domains and learning more about the stimuli to which NNs respond.

Fig. 1: Comparing explanatory power between Grad-CAM and AEs when applied to a robust NN trained on CIFAR-10.
Fig. 2: Illustration showing how AEs might improve trust in a medical NN’s decision.
Fig. 3: Comparison of different state-of-the-art, robust NNs.
Fig. 4: Demonstration of different AEs for a car classification problem, as computed on four different NNs trained on the CIFAR-10 dataset.
Fig. 5: Different explanation techniques using ρ = 0.075 with an NN trained on the COCO dataset.
Fig. 6: Demonstration of the limited utility of attack ARA when considering if an NN has learned salient features.
Fig. 7: Illustration of the benefits of noisy training, even with a Lipschitz regularization.

Data availability

All data used in this work, including the CIFAR-1042, ILSVRC 201233, JSRT44, and the COCO45 datasets are freely available.

Code availability

A reference implementation of the techniques presented throughout this work, applied to the CIFAR-10 dataset, can be found at


This work was supported in part by the Center for Brain-Inspired Computing (C-BRIC), one of six centres in the Joint University Microelectronics Program (JUMP), a Semiconductor Research Corporation (SRC) programme sponsored by the Defense Advanced Research Projects Agency (DARPA). W.W. acknowledges additional funding from Defense Threat Reduction Agency (DTRA) (award no. HDTRA1-18-1-0009). J.C. acknowledges funding from the Maseeh College of Engineering & Computer Science’s Undergraduate Research and Mentoring Program and the SRC Education Alliance (award no. 2009-UR-2032G). W.W. and J.C. received funding from F. Maseeh. We thank A. Madry27,32 and J. Cohen36 for helpful discussions and clarifications about their work. We thank FuR and A. Parise for assisting with the collection of photos for the examples throughout this work.

Author information

Authors and Affiliations



W.W. contributed the original idea, algorithms, experimental design, ablation studies, some active learning annotations and wrote the majority of the paper. J.C. conducted LIME and Grad-CAM integrations, annotated the majority of the active learning annotations, provided text for the active learning sections of the paper and contributed editing support. C.T. provided scope advisement, editing support and funding for the work.

Corresponding authors

Correspondence to Walt Woods, Jack Chen or Christof Teuscher.

Ethics declarations

Competing interests

The authors declare no competing interests.

Woods, W., Chen, J. & Teuscher, C. Adversarial explanations for understanding image classification decisions and improved neural network robustness. Nat Mach Intell 1, 508–516 (2019).

