논문 스터디

보안/Security for AI

논문 스터디

kykyky 2023. 5. 7. 00:00

attack and defense in deep model security

1. Deep model watermarking

# Digital Watermarking: watermark를 embed하여 authentication, content verification을 제공해 tampering 방지

* 과정

i) watermark embedding

ii) watermark extraction

# model watermarking은 MLaaS에 의한 시장에서 크게 활약 가능

* model theft로부터 보호

# 2 categories

i) inserting the watermark directly into the model parameters

-> watermark might either be encoded in existing model parameters (for example, in the form of a bit string) or be inserted by adding additional parameters

ii) creating a trigger or key dataset, which consists of data points that evoke an unusual prediction behavior in the marked model

-> trigger dataset needs to be provided together with the original training data during the model training process so that the model learns to exhibit an unusual prediction behavior on data from the trigger dataset

Embedding watermarks into model parameters.

[13]

- marked ML models by adding information about the training data into the model parameters.

- includes information in the least significant bits of the model parameters.

- designed a correlated value encoding mechanism to maximize the correlation between the model parameters and a given secret

[14]

- watermarking model for Neural Networks (NN) by explicitly embedding additional information into the weight of a NN after its training.

- The watermark is represented as a T-bit string.

- In order to include it in the model parameters, the authors used a composite loss function,

where the loss of the original task is added to an embedding regularizer that introduces a statistical bias into some of the model parameters. An embedding parameter 𝑋 is used as a secret key to encode the watermark information.

[15]

- [14] is extended in [15] by introducing for the embedding parameter 𝑋 an additional NN on selective parameters from the original model.

- In order to train the original model, the loss function proposed by [14] is adopted.

- By contrast, in order to train the additional NN, authors adopted a binary cross-entropy loss between the output vector and the watermark.

[10]

Finally, [10] proposed a passport-based DNN ownership verification scheme.

- The passport layers are embedded within convolutional neural networks and calculate hidden parameters,

without which the model’s inference accuracy is reduced.

Using pre-defined inputs as triggers.

[16]

The technique proposed by [16] embeds the watermark by fine-tuning a pre-trained model so that the boundary of the classification region assumes the desired shape.

The shape is obtained by marking the model through adversarial retraining by slightly moving the decision boundary in order to make the model unique.

The watermark is composed of synthetically generated adversarial samples that are close to the decision boundary.

After that, the trained classifier is tuned to predict the trigger data points to their correct original class.

[17]

The solution proposed by [17] is to embed the watermarking into the training data through a backdooring paradigm.

The keys are generated by imposing on some of the training data a recognizable pattern unrelated to the host dataset. The tuples with the pattern are then re-labeled by changing their original true class and used to train the watermarked vector to output the chosen label in the presence of the watermark triggering pattern.

[18]

In [18], the authors proposed the DeepSigns watermarking methodology embeds the watermark as a T-bit string into the probability density function (pdf) of the data abstraction obtained in different layers of a DL model.

In contrast to methods including the watermark in the static model content, this approach changes the dynamic content of the model, namely the activations that depend on the data and the model.

[19]

Finally, [19] proposed BlackMarks, a watermarking framework that can be used in a black-box scenario.

BlackMarks exploits the pre-trained unmarked model and the owner’s binary signature and outputs the corresponding marked model with a set of watermark keys.

The system relies on a model-dependent encoding scheme that clusters the model’s output activations into two groups according to their similarities. Given the owner’s watermark signature in the form of a binary string, a set of key images and label pairs are designed using targeted adversarial attacks.

Theft avoidance.

ML model watermarking cannot avoid theft but only identify it. To overcome this limitation, different solutions have been proposed. The solutions span from issuing security warnings to the design of adaptive models that achieve high accuracy only when queried by authorized users.

Other works proposed more elaborated watermark schemes that are not easy to steal, for example, by perturbing the prediction outputs or by designing networks very sensitive to weight changes.

In the approaches based on a trigger dataset, the watermark identification depends on the model reaction to queries from the trigger dataset. Defining a suitable threshold to identify a stolen model requires proper tuning, which is obtained by balancing reliability and integrity. The approaches belonging to this category typically suffer from the following drawbacks: • after the verification algorithm is run against a stolen model, attackers are in possession of the trigger dataset, which enables them to fine-tune the model on those data points to remove the watermark; • a limited number of backdoors can be embedded in an ML model; • watermarking solutions not depending on specific data points as trigger data (e.g., [20]) can allow attackers to randomly select different points than the initial watermark.

Notably [21], approaches embedding watermarks into ML parameters [14,15] are of limited security relevance since they cause easily detectable changes in the statistical distribution. In general, the approaches not taking precautions by leaving a detectable trace in the model often rely on white-box access for verification. However, this assumption is feasible only for rare scenarios, whereas assuming blackbox access often is a more realistic scenario. On the other side, adversarial techniques used for watermarking [16] or fingerprinting [12,17] suffer from low robustness against fine-tuning or retraining and, in some cases, violate watermark integrity.

2. Information hiding

The umbrella definition of information hiding encloses a wide array of techniques that can be used to conceal information within another piece of data defined as the carrier.

In recent years, such mechanisms have been mainly used to create novel threats able to avoid detection or bypass tools enforcing network and host security [22].

The first class of attacks that can take advantage of machine learning models falls in the category of stegomalware.

In this case, a malicious payload is hidden in an innocent-looking carrier (e.g., a digital image) and dropped in the host of the victim. Despite a relevant corpus of research aimed at revealing cloaked payloads, obfuscated binaries, or command and control channels endowed with evasive routines is now available [23], a recent trend investigates how machine learning frameworks can be used as attack vectors.

As an example, the work in [24] introduces four different mechanisms for embedding malicious content into a deep neural network model, which can detonate within the execution environment of the victim.

To this aim, the payload can be hidden by using a classical Least Significant Bit (LSB) steganography approach (i.e., the last bits of each model parameter are altered to encode arbitrary data). Due to its simplicity, LSB-encoded data could not be suitable for highly-compressed models and could lead to visible alterations.

Thus, more refined hiding mechanisms exploit the resilience of deep neural networks (i.e., internal errors are introduced by overwriting the model with data to be concealed, and the resulting ‘‘broken neurons’’ are not updated by re-training), or mapping techniques (i.e., arbitrary information is placed by altering parameters with same or closest values).

Luckily, the proposed approach has limited applicability to real-world scenarios since it cannot embed enough data to conceal working malware samples.

In this vein, in [25], more aggressive hiding mechanisms are proposed. Specifically, data can be hidden in model parameters while preserving the values of the most significant byte or by using half of a 32 bit word.

A more sophisticated variant takes into account the distribution of values of other neurons to reduce the impact of the hiding process. We point out that greater alterations of the structure of the neural network lead to cloaking bigger payloads but at the price of increased detectability. In this vein, informationhiding-capable attacks would represent a real hazard only if suitable trade-offs among the capacity of the targeted network, the complexity of the hiding process, and the detectability of the cloaked payload will be searched for in future research actions. Even if not strictly related to stegomalware, the early yet seminal work in [26] showcases how to include information about a specific target inside a neural network model and how to use the obtained output for encrypting a malicious payload. In this case, the ultimate goal is to prevent detection by making the content unintelligible and not classifiable rather than hidden in plain sight.

The second class of attacks based on information hiding aims at creating covert channels, i.e., parasitic communication paths nested within a specific carrier.

Covert channels have been used to implement a variety of offensive templates, for instance, to exfiltrate sensitive information concealed in network traffic through a firewall or to allow two processes confined within separate sandboxes to collude. Owing to their effectiveness, the security of modern computing and networking deployments cannot ignore this class of threats, which has been successfully applied to cloud data centers, multi-processor architectures, network services, and virtual entities, just to mention some [27]. Indeed, Machine Learning frameworks are expected to become a prime tool to support the creation of covert communications (see, e.g., for a recent survey on possible data movements based on the use of watermarking techniques [28]).

However, attacks trying to use machine learning models for covert-channelization are still poorly understood or formalized. In this vein, the pioneering work in [29] introduces a new family of threats considering an adversary hiding data into a training set, which can be arbitrarily read. Despite the data, percolation happens via a suitable poisoning of the dataset (hiding phase), and it represents a privacy leakage; the threat model definitely can be reduced to the creation of a covert channel.

For the case of empowering command and control communications of malware, the work in [30] showcases how commands of a botnet can be cloaked into contents published in online social networks. For instance, an AI can be trained to generate tweets hiding commands but looking legitimate. Some works also exploit machine learning to hide communication but cannot be considered covert channels. As an example, [31] showcases how GAN can be used to modify the traffic patterns produced by real malware to avoid detection.

Lastly, the importance of considering deep learning models as carriers for containing malicious information is supported by the recent surge of threat actors trying to exploit any digital content or service (see [32] for a recent survey on the topic).

Even if today there are still not any known cases of stegomalware targeting neural networks observed ‘‘in the wild’’, an important aspect to prevent the weaponization of AI is rooted in precisely understanding possible ambiguities and imperfect isolation properties that could be leveraged to exfiltrate or store data.

At the same time, more than 50% of academic and industrial settings exploits some form of machine learning, hence making as-a-Service infrastructures prone to attacks and requiring suitable mechanisms to store copyright and provenance information [33]. Thus, the main attack model that can be envisaged is the abuse of methodologies for watermarking data to conceal arbitrary payloads and elude detection.

3. Adversarial learning

As pinpointed above, attacks against Machine Learning systems can affect different stages of the learning and exploitation process.

For instance, in many real scenarios, training data are collected online; therefore, they cannot be verified and fully trusted. In poisoning availability attacks [34], the attacker aims at controlling a certain amount of input data in order to manipulate the training phase of the model and, ultimately, the predictions at evaluation time on most points of the test set [35].

By contrast, evasion attacks focus on small manipulations of testing data points resulting in misclassifications at test time [36,37].

The study on adversarial Learning and its mitigation measures [38] has unveiled risks and threats on Deep Learning models. Although Adversarial training has been designed for defensive purposes [36], it also allows for strengthening DL models in a variety of domains [39–41]. Model robustness, e.g., obtained by means of adversarial perturbations to the input space aimed at enforcing a smoother estimated distribution [42,43], can strengthen the predictive performances and make the models usable in critical scenarios.

Adversarial attack: 이미지에 사람은 감지하기 힘든 작은 노이즈를 추가했을 때, 분류기의 성능이 확 낮아지는 것

Concretely, if we consider a self-driving car scenario, the adversarial perturbations injected in a model represent just imperceptible modifications for the human eyes but can influence the output of the target neural network with very high confidence, e.g., inducing the model to interpret the ‘‘Stop’’ sign as a ‘‘Speed Limit’’ and then leading to potentially harmful consequences [44]. Building models resilient to such attacks is crucial for deploying such applications.

A recent classification of the solutions against adversarial attacks distinguishes four main groups of defense approaches: adversarial training based solutions, randomization-based schemes, denoising methods and provable defense techniques.

Adversarial training solutions exploit different types of models and algorithms for adversarial training in order to strengthen the robustness of deep neural networks. Randomizationbased techniques try to mitigate the effects of adversarial perturbations by applying random schemes in the learning process (e.g., random input transformation, feature pruning, and noising). Denoising-based approaches include two possible strategies i.e., denoising the input by removing adversarial perturbations or denoising the high-level features yielded by hidden layers of the neural model. The last family of techniques is composed of a wide range of solutions exploiting static security approaches (i.e., based on the theoretical demonstration of the system security) to address the problem.

A comprehensive survey on the main adversarial attack approaches and defense strategies can be found in [45].

Quite recently, adversarial schemes have also been combined with watermarking for protection purposes, such as tracing provenance or preventing model misuse [46]. This combination can be particularly useful in situations where malicious users can exploit sophisticated ML models for criminal purposes, such as spreading neural-generated fake news and misinformation.

4. Fairness

Although the main aim of ML-powered systems is typically to meet higher standards of accuracy, a recent research trend concerns the need to pursue other important objectives aimed at balancing the effects of ML-based decisions on ethical or discriminatory aspects. Indeed, the unfairness in such systems can be exploited to perform sophisticated attacks aiming at manipulating user opinions or segregating subpopulations based on irrelevant characteristics [47]. Typical examples are limited access to credit or healthcare due, e.g., to race or gender. As an example, a recent research1 highlighted that an algorithm used to predict patients likely to need extra medical care heavily favored white patients over black patients.

Other fairness concerns include the effect of algorithmic bias on opinion radicalization and discrimination based on political polarization. Several platforms adopt sophisticated ML algorithms to predict preferences and choices made by their user base in their daily lives, ranging from what news to read (e.g., Facebook, Twitter) to which products to consume (e.g., Amazon, Netflix) or whom we meet (Tinder). It is natural to ask whether such algorithms can inadvertently trigger a potential social harm. As an example, a recent study on Twitter unveiled that specific political parties can take advantage of higher amplification by exploiting the weaknesses of the underlying algorithms [48].

Unfairness may come into play mainly for two reasons: inductive biases of algorithms (i.e., inheriting intrinsic biases of data) or through adversarial attacks (i.e., when a malicious actor purposefully poisons an ML system by exploiting some model weakness). Depicting resilient ML models concerning these issues is challenging because biases or possible attacks are not known in advance; moreover, different types of biases may be conflicting among them. Hence, accomplishing robust ML models in terms of fairness is still an open research question.

Current approaches aim to achieve fairness with pre-processing, in-processing, and/or post-processing techniques. These solutions represent a first step towards removing unfairness, but since they are mostly heuristic, they are seldom generalizable and effective. One of the advancements is represented by fair adversarial learning, which tries to exploit the Generative Adversarial Network (GAN) architecture to realize an adversary game that not only learns to discriminate between fake and real instances (like in the standard GAN framework) but also can understand if the example comes from a privileged group or not [49].

Adversarial attacks on fairness is a novel research line that draws inspiration from classical adversarial attacks (introduced in the previous section) on accuracy to adapt existing methods to threaten the fairness of an ML model. [50] design two gradient-based attacks to logistic binary classifiers. The first one knows both data and attack model, while only the training data distribution (used to train a surrogate model from sample points) is provided as input to the second one.

[51]

Two strategies are proposed by [51]: (i) an anchor attack — by injecting new data points next to the boundary decision to trick the classifier and (ii) an influence-based attack that tries to maximize the covariance between the predicted classes and protected sensible attributes.

They conclude that current defenses based on robustness to adversarial attacks on accuracy may not be sufficient to guarantee fairness and that more flexible data-poisoning strategies may be more suitable to accommodate the goal of the attacker (e.g., better control on accuracy-fairness tradeoff).

[52]

Besides attacking binary classifiers, [52] proposes an adversarial attack to fair clustering based on centroids. They aim to solve a bi-level optimization problem by first identifying adversarial centroids that would produce unfair results, and then they try to find the minimal amount of new points that would produce the targeted adversarial centroids after a new iteration of the clustering algorithm.

Nonetheless, [52] acknowledges that current fair clustering approaches could get rid of the injected data (even if minimal), thus limiting the threat extent.

'보안 > Security for AI' 카테고리의 다른 글

AI & 보안 관련 글 정리 (0508) (0)	2023.04.30

현재글논문 스터디

ky.agile