Deep neural networks (DNNs) have progressed rapidly during the past decade
and have been deployed in various real-world applications. Meanwhile, DNN
models have been shown to be vulnerable to security and privacy attacks. One
such attack that has attracted a great deal of attention recently is the
backdoor attack. Specifically, the adversary poisons the target model’s
training set to mislead any input with an added secret trigger to a target
Previous backdoor attacks predominantly focus on computer vision (CV)
applications, such as image classification. In this paper, we perform a
systematic investigation of backdoor attack on NLP models, and propose BadNL, a
general NLP backdoor attack framework including novel attack methods.
Specifically, we propose three methods to construct triggers, namely BadChar,
BadWord, and BadSentence, including basic and semantic-preserving variants. Our
attacks achieve an almost perfect attack success rate with a negligible effect
on the original model’s utility. For instance, using the BadChar, our backdoor
attack achieves a 98.9% attack success rate with yielding a utility improvement
of 1.5% on the SST-5 dataset when only poisoning 3% of the original set.
Moreover, we conduct a user study to prove that our triggers can well preserve
the semantics from humans perspective.