Explaining and Harnessing Adversarial Examples - ShortScience.org

Goodfellow et al. introduce the fast gradient sign method (FGSM) to craft adversarial examples and further provide a possible interpretation of adversarial examples considering linear models. FGSM is a grdient-based, one step method for generating adversarial examples. In particular, letting $J$ be the objective optimized during training and $\epsilon$ be the maximum $\infty$-norm of the adversarial perturbation, FGSM computes
$x' = x + \eta = x + \epsilon \text{sign}(\nabla_x J(x, y))$
where $y$ is the label for sample $x$. The $\text{sign}$ method is applied element-wise here. The applicability of this method is shown in several examples and it is commonly used in related work. In the remainder of the paper, Goodfellow et al. discuss a linear interpretation of why adversarial examples exist. Specifically, considering the dot product
$w^T x' = w^T x + w^T \eta$
it becomes apparent that the perturbation $\eta$ – although insignificant on a per-pixel level (i.e. smaller than $\epsilon$) – causes the activation of a single neuron to be influence significantly. What is more, this effect is more pronounced the higher the dimensionality of $x$. Additionally, many network architectures today use $\text{ReLU}$ activations, which are essentially linear. Goodfellow et al. conduct several more experiments; I want to highlight the conclusions of some of them:
- Training on adversarial samples can be seen as regularization. Based on experiments, it is more effective than $L_1$ regularization or adding random noise.
- The direction of the perturbation matters most. Adversarial samples might be transferable as similar models learn similar functions where these directions are, thus, similarly effective.
- Ensembles are not necessarily resistant to perturbations. Also view this summary at [davidstutz.de](https://davidstutz.de/category/reading/).