## Introduction

As we seek to deploy machine learning systems not only on virtual domains, but also in real systems, it becomes critical that we examine not only whether the systems don’t simply work “most of the time”, but which are truly robust and reliable. Although many notions of robustness and reliability exist, one particular topic in this area that has raised a great deal of interest in recent years is that of adversarial robustness: can we develop classifiers that are robust to (test time) perturbations of their inputs, by an adversary intent on fooling the classifier? This is of course a very specific notion of robustness in general, but one that seems to bring to the forefront many of the deficiencies facing modern machine learning systems, especially those based upon deep learning.

This tutorial seeks to provide a broad, hands-on introduction to this topic of adversarial robustness in deep learning. The goal is combine both a mathematical presentation and illustrative code examples that highlight some of the key methods and challenges in this setting. With this goal in mind, the tutorial is provided as a static web site, but all the sections are also downloadable as Jupyter Notebooks; this lets you try out and build upon the ideas presented here.

Although we try to touch on most of the high-level ideas that have been driving research in this area of work, it is certain that we will also omit some highly relevant work. If you feel that some work deserves to be mentioned in the context that we are discussing, feel free to get in touch and let us know. Our hope is that this resource can serve as a starting point for people just getting involved in the area, as well as a launching pad of links and resources for those who want to pursue the ideas more deeply.

### Logistics

Although this tutorial is intended to be mainly read as a static page, because, as mentioned above, you can also download the notebooks for each section, we briefly mention the requirements for running the full examples seen here. Specifically, all example here use Python 3.7 (though should be compatible with 3.6), and make use of the following libraries:

Ideally, the precise version number should not matter, and the approach should be compatible with e.g. pytorch 0.4.x (but not earlier), earlier versions of pillow, etc. We installed all this software by with a fresh install of Anaconda (which includes all of the needed libraries except PyTorch and cvxpy), then used the conda install <package> or pip install <package> commands to install all the relevant software. We also provide a docker container capable of running all the notebooks in our github repository. You will only need cvxpy for some of the optimization-related approaches we discuss, and most of the tutorial can be done without this tool, but PyTorch is used quite heavily throughout.

Many of the simpler examples are quite fast to compute, and so we just implement them on a CPU. For the more time-intensive operations, however (especially the various types of adverarial training), it is necessarly to train the systems on a GPU to have any hope of being computationally efficient. Thus, in order to run these later examples, you will also need CUDA installed with the above version of PyTorch.

### Required background

This document assumes some degree of familiarity with basic deep learning, e.g., the basics of optimization, gradient descent, deep networks, etc (to the degree that is typically covered in an early graduate-level course on machine learning), plus some basic familiarity with PyTorch. If you do not have either of these, then a good resource to start will be:

## Diving right in

One of the beautiful things about deep learning is just how easy it is to jump right in and start seeing some actual results on real data. With that mindset, let’s start off by constructing our very first adversarial example. We will introduce a very small amount of mathematical notation here, which will be substantially expanded upon shortly, and the actual technique we use here is not the ultimate strategy that we will use, but it is fairly close in spirit, and actually captures most of the basic components that we will see later.

To start off, let’s use the (pre-trained) ResNet50 model within PyTorch to classify this picture of a pig.

'Show Pig' licensed under CC BY 2.0

The normal strategy for image classification in PyTorch is to first transform the image (to approximately zero-mean, unit variance) using the torchvision.transforms module. However, because we’d like to make perturbations in the original (unnormalized) image space, we’ll take a slightly different approach and actually build the transformations at PyTorch layers, so that we can directly feed the image in. First, let’s just load the image and resize the 224x224, which is the default size that most ImageNet images (and hence the pre-trained classifiers) take as input.

from PIL import Image
from torchvision import transforms

# read the image, resize to 224 and convert to PyTorch Tensor
pig_img = Image.open("pig.jpg")
preprocess = transforms.Compose([
transforms.Resize(224),
transforms.ToTensor(),
])
pig_tensor = preprocess(pig_img)[None,:,:,:]

# plot image (note that numpy using HWC whereas Pytorch user CHW, so we need to convert)
plt.imshow(pig_tensor[0].numpy().transpose(1,2,0))


Now let’s load the pre-trained ResNet50 model and apply it to the image, after necessary transforms (the weird indexing here is just used to comply with PyTorch standards that all inputs to modules should be of the form batch_size x num_channels x height x width).

import torch
import torch.nn as nn
from torchvision.models import resnet50

# simple Module to normalize an image
class Normalize(nn.Module):
def __init__(self, mean, std):
super(Normalize, self).__init__()
self.mean = torch.Tensor(mean)
self.std = torch.Tensor(std)
def forward(self, x):
return (x - self.mean.type_as(x)[None,:,None,None]) / self.std.type_as(x)[None,:,None,None]

# values are standard normalization for ImageNet images,
# from https://github.com/pytorch/examples/blob/master/imagenet/main.py
norm = Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])

# load pre-trained ResNet50, and put into evaluation mode (necessary to e.g. turn off batchnorm)
model = resnet50(pretrained=True)
model.eval();

# form predictions
pred = model(norm(pig_tensor))


Pred now contains a 1000 dimensional vector containing the class logits for the 1000 imagenet classes (i.e., if you wanted to convert this to a probability vector, you would apply the softmax operator to this vector). To find the highest likelihood class, we simply take the index of maximum value in this vector, and we can look this up in a list of imagenet classes to find the corresponding label.

import json
with open("imagenet_class_index.json") as f:
imagenet_classes = {int(i):x[1] for i,x in json.load(f).items()}
print(imagenet_classes[pred.max(dim=1)[1].item()])

hog


Loooking good! (note that ImageNet has one class correpsonding to both “hog” and “pig”, so this is the correct label). We should note that this is the first pig image we tried here, so it doesn’t take any tweaking to get a result like this … modern image classifiers are pretty impressive.

### Some introductory notation

Now let’s try to fool this classifier into thinking this image of a pig is something else. To explain this process, we’re going to introduce a bit more notation. Specifically, we’ll define the define the model, or hypothesis function, $h_\theta : \mathcal{X} \rightarrow \mathbb{R}^k$ as the mapping from input space (in the above example this would be a three dimensional tensor), to the output space, which is a $k$-dimensional vector, where $k$ is the number of classes being predicted; note that like in our model above, the output corresponds to the logit space, so these are real-valued numbers that can be positive or negative. The $\theta$ vector represents all the parameters defining this model, (i.e., all the convolutional filters, fully-connected layer weight matrices, baises, etc; the $\theta$ parameters are what we typically optimize over when we train a neural network. And finally, note that this $h_\theta$ corresponds precisely the model object in the Python code above.

Second, we define a loss function $\ell: \mathbb{R}^k \times \mathbb{Z}_+ \rightarrow \mathbb{R}_+$ as a mapping from the model predictions and true labels to a non-negative number. The semantics of this loss function are that the first argument is the model output (logits which can be positive or negative), and the second argument is the index of the true class (that is, a number from 1 to $k$ denoting the index of the true label). Thus, the notation

for $x \in \mathcal{X}$ the input and $y \in \mathbb{Z}$ the true class, denotes the loss that the classifier achieves in its predictions on $x$, assuming the true class is $y$. By far the most common form of loss used in deep learning is the cross entropy loss (also sometimes called the softmax loss), defined as

where $h_\theta(x)_j$ denotes the $j$th elements of the vector $h_\theta(x)$.

Aside: For those who are unfamiliar with the convention above, note that the form of this loss function comes from the typical softmax activation. Defining the softmax operator $\sigma : \mathbb{R}^k \rightarrow \mathbb{R}^k$ applied to a vector

to be a mapping from the class logits returned by $h_\theta$ to a probability distribution. Then the typical goal of training a network is to maximize the probability of the true class label. Since probabilities themselves get vanishingly small, it is more common to maximize the log of the probability of the true class label, which is given by

Since the convention is that we want to minimize loss (rather than maximizing probability), we use the negation of this quantity as our loss function. We can evaluate this loss in PyTorch using the following command.

# 341 is the class index corresponding to "hog"
print(nn.CrossEntropyLoss()(model(norm(pig_tensor)),torch.LongTensor([341])).item())

0.0038814544677734375


A loss of 0.0039 is pretty small: by the conventions above, that would correspond to an $\exp(-0.0039) \approx 0.996$ probability that the classifier believes this to be a pig.

So how do we manipulate this image to make the classifier believe it is something else? To answer this, note that by common approach to training a classifier is to optimize the parameters $\theta$, so as to minimize the average loss over some training set $\{x_i \in \mathcal{X}, y_i \in \mathbb{Z}\}$, $i=1,\ldots,m$, which we write as the optimization problem

which we typically solve by (stochastic) gradient descent. I.e., for some minibatch $\mathcal{B} \subseteq \{1,\ldots,m\}$, we compute the gradient of our loss with respect to the parameters $\theta$, and make a small adjustment to $\theta$ in this negative direction

where $\alpha$ is some step size, and we repeat this process for different minibatches covering the entire training set, until the parameters convergenc.

The key term of interest here is the gradient $\nabla_\theta \ell(h_\theta(x_i), y_i)$, which computes how a small adjustment to each the parameters $\theta$ will affect the loss function. For deep neural networks, this gradient is computed efficiently via backpropagation. However, the beauty of automatic differentiation (the mathematical technique that underlies backpropagation), is that we aren’t just limited to differentiating the loss with respect to $\theta$; we can just as easily compute the gradient of the loss with respect to the input $x_i$ itself. This quantity will tell us how small changes to the image itself affect the loss function.

This is exactly what we’re going to do to form an adversarial example. But instead of adjusting the image to minimize the loss, as we did when optimizing over the network parameters, we’re going to adjust the image to maximize the loss. That is, we want to solve the optimization problem

where $\hat{x}$ denotes our adversarial example that is attempting to maximize the loss. Of course, we cannot just optimize arbitrarily over $\hat{x}$ (there do exist, after all, some images that are not pigs, and if we change the image entirely, say to a dog, then it’s not particularly impressive that we can “fool” the classifier into thinking it’s not a pig). So we instead need to ensure that $\hat{x}$ is close to our original input $x$. By convention, we typically do this by optimizing over the perturbation to $x$, which we will denote $\delta$, and then by optimizing over $\delta$

where $\Delta$ represents an allowable set of perturbations. Characterizing the “correct” set of allowable perturbations is actually quite difficult: in theory, we would like $\Delta$ to capture anything that humans visually feel to be the “same” as the original input $x$. This can include anthing ranging from adding slight amounts of noise, to rotating, translating, scaling, or performing some 3D transformation on the underlying model, or even completely changing the image in the “non-pig” locations. Needless to say, it is not possible to give a mathematically rigorous definition of all the perturbations that should be allowed, but the philosophy behind adversarial examples is that we can consider some subset of the possible space of allowed perturbations, such that by any “reasonable” definition, the actual semantic content of the image could not change under this perturbation.

A common perturbation set to use, though by no means the only reasonable choice, is the $\ell_\infty$ ball, defined by the set

where the $\ell_\infty$ norm a vector $z$ is defined as

i.e., we allow the perturbation to have magnitude between $[-\epsilon, \epsilon]$ in each of its components (it is a slightly more complex, as we also need to ensure that $x + \delta$ is also bounded between $[0,1]$ so that it is still a valid image). We’ll return later to debeate whether or not is it reasonable to consider the $\ell_\infty$ ball, or norm-balls in general as perturbation sets. But all we will say for now is that the advantage of the $\ell_\infty$ ball is that for small $\epsilon$ it creates perturbations which add such a small component to each pixel in the image that they are visually indistinguishable from the original image, and thus provide a “necessarily-but-definitely-not-close-to-sufficient” condition for us to consider a classifier robust to perturbations. And the reality of deep networks is that they can very easily be fooled by manipulations precisely of this type.

Ok, enough discussion. Let’s see what this looks like in practice. The following example uses PyTorch’s SGD optimizer to adjust our perturbation to the input to maximize the loss. Derspite the name, since there is no notion of a training set or minibatches here, this is not actually stochastic gradient descent, but just gradient descent; and since we follow each step with a projection back onto the $\ell_\infty$ ball (done by simply clipping the values that exceed $\epsilon$ magnitude to $\pm \epsilon$), this is actually a procedure known as projected gradient descent (PGD). We’ll shortly consider slightly more complex versions (where we’ll need to things explicitly instead of using PyTorch’s optimization class), but we’ll keep things simple for now.

import torch.optim as optim
epsilon = 2./255

opt = optim.SGD([delta], lr=1e-1)

for t in range(30):
pred = model(norm(pig_tensor + delta))
loss = -nn.CrossEntropyLoss()(pred, torch.LongTensor([341]))
if t % 5 == 0:
print(t, loss.item())

loss.backward()
opt.step()
delta.data.clamp_(-epsilon, epsilon)

print("True class probability:", nn.Softmax(dim=1)(pred)[0,341].item())

0 -0.0038814544677734375
5 -0.00693511962890625
10 -0.015821456909179688
15 -0.08086681365966797
20 -12.229072570800781
25 -14.300384521484375
True class probability: 1.4027455108589493e-06


After 30 gradient steps, the ResNet50 thinks that this has less than a $10^{-5}$ chance of being a pig. (Note: we should also clip $x + \delta$ to be in $[0,1]$, but this already holds for any $\delta$ within the above bound, so we don’t need to do it explicitly here). Instead, it turns out that this classifier is quite sure the image is a wombat, as we can see from the following code, which computes the maximum class and its probability.

max_class = pred.max(dim=1)[1].item()
print("Predicted class: ", imagenet_classes[max_class])
print("Predicted probability:", nn.Softmax(dim=1)(pred)[0,max_class].item())

Predicted class:  wombat
Predicted probability: 0.9997960925102234


So what does this wombat-pig look like? Extremely similar to our original pig, unfortunately.

plt.imshow((pig_tensor + delta)[0].detach().numpy().transpose(1,2,0))


Here in fact is the delta we added to the image, heavily zoomed in by a factor of 50 because it would be impossible to see otherwise.

plt.imshow((50*delta+0.5)[0].detach().numpy().transpose(1,2,0))


So essentially, by adding a tiny multiple of this random-looking noise, we’re able to create an image that looks identical to our original image, yet is classified very incorrectly. Of course, to do all of this more correctly, we should quantize our noise to the allowable level of the image (i.e., to within steps of 1/255), but technicalities like this are easy to overcome, and we can indeed create valid images which are impossible for the human eye to distinguish between them and our original image, but which the classifier misclassifies.

### Targeted attacks

Ok, you might say. This is impressive, but a wombat really isn’t that different from a pig, so maybe the problem isn’t that bad. But it turns out this same technique can be used to make the image classified as virtually any class we desire. This is known as a “targeted attack”, and the only difference is that instead of trying to just maximize the loss of the correct class, we maximize the loss of the correct class while also minimizing the loss of the target class. That is, we solve the optimization problem

where the expression simplies because the $\log \left ( \sum_{j=1}^k \exp(h_\theta(x)_j) \right )$ terms from each loss cancel, and all that remains is the linear terms. Here is how this looks. Note that we tuned the step size a bit to make it work in this case, but we’ll shortly consider slightly different scaling methods for projected gradient descent where this isn’t needed.

delta = torch.zeros_like(pig_tensor, requires_grad=True)
opt = optim.SGD([delta], lr=5e-3)

for t in range(100):
pred = model(norm(pig_tensor + delta))
loss = (-nn.CrossEntropyLoss()(pred, torch.LongTensor([341])) +
nn.CrossEntropyLoss()(pred, torch.LongTensor([404])))
if t % 10 == 0:
print(t, loss.item())

loss.backward()
opt.step()
delta.data.clamp_(-epsilon, epsilon)

0 24.00604820251465
10 -0.1628284454345703
20 -8.026773452758789
30 -15.677117347717285
40 -20.60370635986328
50 -24.99606704711914
60 -31.009849548339844
70 -34.80946350097656
80 -37.928680419921875
90 -40.32395553588867

max_class = pred.max(dim=1)[1].item()
print("Predicted class: ", imagenet_classes[max_class])
print("Predicted probability:", nn.Softmax(dim=1)(pred)[0,max_class].item())

Predicted class:  airliner
Predicted probability: 0.9679961204528809


As before, here’s our airliner-pig, looking an awful lot like a normal pig (the target class of 404 from the code is indeed an airliner, so our targeted attack is working)

plt.imshow((pig_tensor + delta)[0].detach().numpy().transpose(1,2,0))


And here is our airliner noise.

plt.imshow((50*delta+0.5)[0].detach().numpy().transpose(1,2,0))


The conclusion, of course, is that with adversarial attacks and deep learning, you can make pigs fly.

We’ll discuss the practical questions that these attacks raise shortly, but the ease of such attacks raises an obvious question: can we train deep learning classifiers that are somehow resistant to such attacks? The short answer to this question is “yes”, but we (as a field) are a long way from really making such training practical, or achieving nearly the performance that we get with “standard” deep learning methods. This tutorial will cover both the attack and the defense side in great detail, and hopefully by the end of it, you will get a sense of the current state of the art, as well as the directions where we still need to make substantial progress.

## A brief (incomplete) history of adversarial robustness

• Origins (robust optimization)
• Support vector machines
• Adversarial classification (e.g. Domingos 2004)
• Distinctions between different types of robustness (test test, train time, etc)
• Szgegy et al., 2003, Goodfellow et al., 2004
• Many proposed defense methods
• Many proposed attack methods
• Exact verification methods
• Convex upper bound methods
• Recent trends

Let’s now consider, a bit more formally, the challenge of attacking deep learning classifiers (here meaning, constructing adversarial examples them the classifier), and the challenge of training or somehow modifying existing classifiers in a manner that makes them more resistant to such attacks.

### Brief review: risk, training, and testing sets

To begin, we can consider more formally the traditional notion of risk as it is used in machine learning. The risk of a classifier is it’s expected loss under the true distribution of samples, i.e.

where $\mathcal{D}$ denotes the true distribution over samples. In practice, of course, we do not know the underlying distribution of the actual data, so we approximate this quantity by considering a finite set of samples drawn draw i.i.d. from $\mathcal{D}$,

and we then consider the empirical risk

As mentioned above, the traditional process of training a machine learning algorithm is that of finding parameters that minimize the empirical risk on some training set denoted $D_{\mathrm{train}}$ (or possibly some regularized version of this objective)

Of course, once the parameters $\theta$ have been chosen based upon the training set $D_{\mathrm{train}}$, this data set can no longer give us an unbiased estimated of the risk of the resulting classifier, and so frequently an alterntaive data set $D_{\mathrm{test}}$ (also contains points sampled i.i.d. from the true underlying distribution $\mathcal{D}$), and we use $\hat{R}(h_\theta, D_{\mathrm{test}})$ as a proxy to estimate the true risk $R(h_{\theta})$.

As an alternative to the traditional risk, we can also consider an adversarial risk. This is like the traditional risk, except that instead of suffering the loss on each sample point $\ell(h_\theta(x), y)$, we suffer the worst case loss in some region around the sample point, that is
where for completeness, we explicitly allow that the perturbation region $\Delta(x)$ can depend on the sample point itself; using the examples from the previous sections, this would be necessary in order to ensure that perturbations respected the ultimate image bounds, but it can also potentially encode a great deal of semantic information about what sorts of perturbations would be allowed for each image.