2023.08.28.

A Discussion of ‘Adversarial Examples Are Not Bugs, They Are Features’: Robust Feature Leakage

A Discussion of 'Adversarial Examples Are Not Bugs, They Are Features': Robust Feature Leakage


Ilyas et al. report a surprising result: a model trained on
adversarial examples is effective on clean data. They suggest this transfer is driven by adverserial
examples containing geuinely useful non-robust cues. But an alternate mechanism for the transfer could be a
kind of “robust feature leakage” where the model picks up on faint robust cues in the attacks.

We show that at least 23.5% (out of 88%) of the accuracy can be explained by robust features in
DrandD_text{rand}

Lower Bounding Leakage

Our technique for quantifying leakage consisting of two steps:

  1. First, we construct features fi(x)=wiTxf_i(x) = w_i^Tx
  2. Next, we train a linear classifier as per ,
    Equation 3
    on the datasets D^dethat{mathcal{D}}_{text{det}}

Since Ilyas et al. only specify robustness in the two class
case, we propose two possible specifications for what constitutes a robust feature in the multiclass
setting:

Specification 1
For at least one of the
classes, the feature is γgamma-robustly useful with
γ=0gamma = 0, and the set of valid perturbations equal to an L2L_2
Specification 2
The feature comes from a robust model for which at least 80% of points in the test set have predictions
that remain static in a neighborhood of radius 0.25 on the L2L_2

We find features that satisfy both specifications by using the 10 linear features of a robust linear
model trained on CIFAR-10. Because the features are linear, the above two conditions can be certified
analytically. We leave the reader to inspect the weights corresponding to the features manually:

10 Features, FCF_C

Training a linear model on the above robust features on D^randhat{mathcal{D}}_{text{rand}}

The contrasting results suggest that the the two experiements should be interpreted differently. The
transfer results of D^randhat{mathcal{D}}_{text{rand}}

The results of D^dethat{mathcal{D}}_{text{det}}

Response Summary: This
is a valid concern that was actually one of our motivations for creating the
D^detwidehat{mathcal{D}}_{det}

Acknowledgments

Shan Carter (started the project), Preetum (technical discussion), Chris Olah (technical discussion), Ria
(technical discussion), Aditiya (feedback)


References

  1. Adversarial examples are not bugs, they are features
    Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B. and Madry, A., 2019. arXiv preprint arXiv:1905.02175.


Updates and Corrections

If you see mistakes or want to suggest changes, please create an issue on GitHub.

Reuse

Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise. The figures that have been reused from other sources don’t fall under this license and can be recognized by a note in their caption: “Figure from …”.

Citation

For attribution in academic contexts, please cite this work as

Goh, "A Discussion of 'Adversarial Examples Are Not Bugs, They Are Features': Robust Feature Leakage", Distill, 2019.

BibTeX citation

@article{goh2019a,
  author = {Goh, Gabriel},
  title = {A Discussion of 'Adversarial Examples Are Not Bugs, They Are Features': Robust Feature Leakage},
  journal = {Distill},
  year = {2019},
  note = {https://distill.pub/2019/advex-bugs-discussion/response-2},
  doi = {10.23915/distill.00019.2}
}



Source link

Facebook
Twitter
LinkedIn
Pinterest