最近,对抗性机器学习领域受到关注,这表明最先进的深度神经网络容易受到对抗性示例的攻击,这是由于在输入图像中添加了小扰动。恶意对手通过获取对模型参数(例如渐变信息)的访问权来更改其输入,或者通过攻击替代模型并转移这些恶意示例以攻击受害者模型来生成对手示例。..
BlurNet: Defense by Filtering the Feature Maps
Recently, the field of adversarial machine learning has been garnering attention by showing that state-of-the-art deep neural networks are vulnerable to adversarial examples, stemming from small perturbations being added to the input image. Adversarial examples are generated by a malicious adversary by obtaining access to the model parameters, such as gradient information, to alter the input or by attacking a substitute model and transferring those malicious examples over to attack the victim model.Specifically, one of these attack algorithms, Robust Physical Perturbations ($RP_2$), generates adversarial images of stop signs with black and white stickers to achieve high targeted misclassification rates against standard-architecture traffic sign classifiers. In this paper, we propose BlurNet, a defense against the $RP_2$ attack. First, we motivate the defense with a frequency analysis of the first layer feature maps of the network on the LISA dataset, which shows that high frequency noise is introduced into the input image by the $RP_2$ algorithm. To remove the high frequency noise, we introduce a depthwise convolution layer of standard blur kernels after the first layer. We perform a blackbox transfer attack to show that low-pass filtering the feature maps is more beneficial than filtering the input. We then present various regularization schemes to incorporate this low-pass filtering behavior into the training regime of the network and perform white-box attacks. We conclude with an adaptive attack evaluation to show that the success rate of the attack drops from 90\% to 20\% with total variation regularization, one of the proposed defenses.
暂无评论