A Discussion of 'Adversarial Examples Are Not Bugs, They Are Features': Robust Feature Leakage

Industry news

2023.08.28.

A Discussion of ‘Adversarial Examples Are Not Bugs, They Are Features’: Robust Feature Leakage

Ilyas et al. report a surprising result: a model trained on
adversarial examples is effective on clean data. They suggest this transfer is driven by adverserial
examples containing geuinely useful non-robust cues. But an alternate mechanism for the transfer could be a
kind of “robust feature leakage” where the model picks up on faint robust cues in the attacks.

We show that at least 23.5% (out of 88%) of the accuracy can be explained by robust features in
$D_text{rand}$

Lower Bounding Leakage

Our technique for quantifying leakage consisting of two steps:

First, we construct features $f_i(x) = w_i^Tx$
Next, we train a linear classifier as per ,
Equation 3 on the datasets $hat{mathcal{D}}_{text{det}}$

Since Ilyas et al. only specify robustness in the two class
case, we propose two possible specifications for what constitutes a robust feature in the multiclass
setting:

Specification 1
For at least one of the
classes, the feature is

gamma

Specification 2
The feature comes from a robust model for which at least 80% of points in the test set have predictions
that remain static in a neighborhood of radius 0.25 on the

L_2

We find features that satisfy both specifications by using the 10 linear features of a robust linear
model trained on CIFAR-10. Because the features are linear, the above two conditions can be certified
analytically. We leave the reader to inspect the weights corresponding to the features manually:

10 Features,

F_C

Training a linear model on the above robust features on $hat{mathcal{D}}_{text{rand}}$

The contrasting results suggest that the the two experiements should be interpreted differently. The
transfer results of $hat{mathcal{D}}_{text{rand}}$

The results of $hat{mathcal{D}}_{text{det}}$

Response Summary: This
is a valid concern that was actually one of our motivations for creating the

widehat{mathcal{D}}_{det}

Acknowledgments

Shan Carter (started the project), Preetum (technical discussion), Chris Olah (technical discussion), Ria
(technical discussion), Aditiya (feedback)

References

Adversarial examples are not bugs, they are features
Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B. and Madry, A., 2019. arXiv preprint arXiv:1905.02175.

Updates and Corrections

If you see mistakes or want to suggest changes, please create an issue on GitHub.

Reuse

Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise. The figures that have been reused from other sources don’t fall under this license and can be recognized by a note in their caption: “Figure from …”.

Citation

For attribution in academic contexts, please cite this work as

Goh, "A Discussion of 'Adversarial Examples Are Not Bugs, They Are Features': Robust Feature Leakage", Distill, 2019.

BibTeX citation

@article{goh2019a,
  author = {Goh, Gabriel},
  title = {A Discussion of 'Adversarial Examples Are Not Bugs, They Are Features': Robust Feature Leakage},
  journal = {Distill},
  year = {2019},
  note = {https://distill.pub/2019/advex-bugs-discussion/response-2},
  doi = {10.23915/distill.00019.2}
}

Source link