Additive Noise Detection and Its Application To Audio Forensics

Rui Yang∗ ∗School of Information Management, Sun Yat-sen University, Guangzhou, R.P.C. E-mail: [email protected]

Abstract—Digital audio recordings can be manipulated by focus on forgery detection[1], [2], [3], recorder identification pervasive easily. Often forgery would not [4], [5], reverberation[6] and compression history analysis[7], be naive splicing. Post-processing would be a part of tampering. [8], but there are no work about post-processing detection on Post-processing can eliminate the obvious traces of forgery. Noise can cover audible evidence of forgery and destroy traces of digital audio. Work about detection of additive noise on audio other tampering operations. The detection of additive noise in is also not reported. However, in the research area of image audio signal is a useful tool for audio forensics. In this paper, forensics, lots of work about post-processing detection have we investigate the effect of additive noise on audio signal, and been reported, such as detection of filtering[9], detection of propose a feature named ”sign change rate” for detecting additive sharpen[10]. Since adding noise is not a good way to hide the noise. Via theoretical analyze and extensive experiments, it shows the proposed feature is effective in additive noise detection. Also forgery trace of image, detection of additive noise on image is the method can be a potential tool for forgery localization of of small value in digital image forensics. The case is different digital audio. for audio, since weak noise is inherent in audio recording, and it doesn’t influence the perceptual quality of audio much. I. INTRODUCTION In this paper we focus on additive noise. We propose an Digital audio forensics has recently become a widely stud- additive noise detection method for audio signal. The key idea ied stream of research in multimedia security. Often audio of the proposed method is that: if the audio signal is processed forgery would not be naive splicing, post-processing would twice, the modification introduced by the second process is be performed after tampering, otherwise there will be audible less than the first process. Adding noise is one kind of this trace of forgery. Adding noise is a common post-processing process. We introduce a feature named ”sign change rate” to after audio forgery. Frequently-used audio editing software, measure the modification. such as CoolEdit, GoldWave, , always has a function The rest of this paper is organized as follows. In Section of adding noise. Shown as Fig.1, the widely used software 2, we investigate the effect of additive noise on audio signal, Audacity contains an adding white noise function. Nowadays and show how it eliminate the traces of forgery. Then we pro- even the user without any knowledge of audio processing can pose the sign-change-rate feature and additive noise detection perform adding weak noise to audio recording. The perceptual method in Section 3. The experimental results are shown in quality of the audio is almost not degraded after adding the Section 4. Finally, we conclude our paper with a discussion weak noise. Additive noise may be applied to audio not and future work in Section 5. only to cover audible evidence of forgery, but also in an attempt to destroy traces of other tampering operations. Thus II. EFFECT OF ADDITIVE NOISE the detection of additive noise in audio signal is certainly The forgery trace of audio signal is easily covered by weak significant for the authenticity of the audio and its content. noise. Figure 2 shows an example of adding noise to cover evidence of forgery. Several samples of the original audio are removed, then an obvious change appears at the splicing point. After adding weak noise on the samples around the splicing point, the splicing point is not perceptual again, no matter listing or viewing the waveform. Without a reference signal, it is very difficult to determine speech signal with additive noise or not from the waveform. Since the speech signal is short-time stationary, the values of neighbor samples have a small fluctuation. After adding weak noise, the neighbor samples of speech will overlay with different values, and this will enlarge the difference between Fig. 1. Adding white noise is a function of Audacity neighbor samples. It means that the variance of the differential signal will become larger after adding noise. As shown in Nowadays the reported work about digital audio forensics Fig.3, the first column shows the differential signal of original

978-616-361-823-8 © 2014 APSIPA APSIPA 2014 of differential signal of original speech and the noise version, respectively, then observe the sign of dot product between differential signal and differential signal with active noise. As shown in Fig.4, the sign change rate of original speech is obviously larger than that of noise version.

(a) Differential signal of x 0.2 Δ x Δ y 0.1 sign(Δ x* Δy)

0

−0.1

−0.2 0 10 20 30 40 50 60 70 80 90 100

(b) Differential signal of x’ 0.2 Δ x’ Δ y’ 0.1 sign(Δ x’* Δy’) Fig. 2. Example of adding noise to cover evidence of forgery. 0

−0.1

−0.2 0 10 20 30 40 50 60 70 80 90 100 speech and speech with additive noise. In order to detect additive noise in speech without reference signal, we actively add white noise to two kinds of speech, and investigate the Fig. 4. Sign change of audio signal after adding noise, the case of original effect of white noise on the differential signal, as shown in the speech is at the top, and the case of noise version is at the bottom. second column of Fig.3. Since the additive white noise would flip the sign of some value of differential signal, we perform B. Theoretical Proof dot product between differential signal and the noise version. Due to the variation of speech signal, however, theoretical Then we find that there are much more negative samples in analysis of the general relation between the speech and its the dot product for the original speech, as shown in the fourth noise version is highly non-trivial. For this reason, it is often column of Fig.3. Obviously, few samples changing sign after assumed that the input speech samples are i.i.d. We denote the actively additive noise would be a very strong indication of differential signal of x and x + n2 as y1 and y2, respectively. previously adding noise. In the next section, we will introduce That is y1 =Δ(x) and y2 =Δ(x + n2). The case of signal sign change rate to measure the influence by adding noise. without additive noise: 2 2 y y −3 y *y −3 sort(y *y ) 1 2 x 10 1 2 x 10 1 2 x ∼ N(0,σ ) ⇒ y ∼ N(0, 2σ ) 0.1 0.1 20 20 0 1 0 (1)

15 15 0.05 0.05 x + n ∼ N(0,σ2 + σ2) ⇒ y ∼ N(0, 2σ2 +2σ2) 10 10 2 0 2 2 0 2 (2) 0 0 5 5 −0.05 −0.05 x + n x + n + n 0 0 We denote the differential signal of 1 and 1 2

−0.1 −0.1 −5 −5 y y y =Δ(x + n ) y = 0 1000 2000 3000 4000 0 1000 2000 3000 4000 0 1000 2000 3000 4000 0 1000 2000 3000 4000 as 3 and 4, respectively. That is 3 1 and 4 Δ(x + n + n ) y y −3 y *y −3 sort(y *y ) 3 4 x 10 3 4 x 10 3 4 1 2 . The case of signal with additive noise: 0.1 0.1 20 20

15 15 2 2 2 2 0.05 0.05 x + n1 ∼ N(0,σ0 + σ1) ⇒ y3 ∼ N(0, 2σ0 +2σ1) (3) 10 10 0 0 5 5 x + n + n ∼ N(0,σ2 + σ2 + σ2)⇒y ∼N(0, 2σ2 +2σ2 +2σ2) −0.05 −0.05 1 2 0 1 2 4 0 1 2 0 0

−0.1 −0.1 −5 −5     (4) 0 1000 2000 3000 4000 0 1000 2000 3000 4000 0 1000 2000 3000 4000 0 1000 2000 3000 4000 +∞ 0 0 +∞ E(θ1)= p(y1,y2)dy2dy1 + p(y1,y2)dy2dy1 0 −∞ −∞ 0 Fig. 3. Effect of additive noise on differential signal, the case of original (5) 2 2 speech is at the top, and the case of noise version is at the bottom. 1 −y1 −(y2 − y1) p(y1,y2)=  exp( + ) (6) 2 2 2σ2 2σ2 2π (2σ0 )(2σ2 ) 0 2  III. SIGN CHANGE RATE AND PROPOSED METHOD 1 arctan σ2/2σ2 E(θ )= − 0 2 A. Sign change rate 1 2 π (7)     Given a sequence X of length L and its processed version +∞ 0 0 +∞ Y = f(X), the number of sign change K is the number of E(θ2)= p(y3,y4)dy4dy3 + p(y3,y4)dy4dy3 (8) 0 −∞ −∞ 0 element in {i|xi ∗ yi < 0,i=1, ..., L}. The sign change rate θ θ = K/L θ 1 −y2 −(y − y )2 is defined as follow: . We use sign change rate p(y ,y )=  exp( 3 + 4 3 ) 3 4 2 2 2 2σ2 2σ2 to measure the effect of additive noise on the audio signal. To 2π (2σ0 +2σ1)(2σ2 ) 0 2 illustrate the sign change rate, we randomly select 100 samples (9)  1 arctan (σ2 + σ2)/2σ2 0 1 2 data has a significant difference. Based on the observation, E(θ2)= − (10) 2 π the threshold is selected as Th =0.2, and the accurate rate 2 2 2 AR Since σ0 <σ0 + σ1 and arctan(·) is monotone increasing, is 100%. we can show that E(θ1) >E(θ2). The difference between 2 2 2 Noise 30dB, Active Noise 35dB E(θ2) and E(θ1) depends on arctan (σ0 + σ1 )/2σ2 − 0.5 2 2 2 without additive noise arctan σ /2σ σ with additive noise 0 2.Thus the larger 1 , the larger difference be- 0.45 2 tween E(θ2) and E(θ1). The smaller σ2 , the larger difference 0.4 between E(θ2) and E(θ1). 0.35 C. Proposed Method 0.3 Based on the above investigation and theoretical analysis, 0.25 we find that the speech with additive noise can be recognized Sign Change Rate by the sign change rate. Given the audio signal x, the process 0.2 of detecting additive noise in audio is as follow. 0.15 2 1) we add additive noise n2 ∼ N(0,σ2) to x; 0.1 0 50 100 150 200 250 300 350 400 2) Δ(x) and Δ(x + n2) are calculated respectively; different audio clips 3) The zero crossing rate zc between Δ(x) and Δ(x + n2) is computed. Fig. 5. Detection result of 5s clips using 35dB active noise 4) If zc < Th, x is regarded as the one additive noise has been added, otherwise it is without additive noise. The Th Different levels of noise, Active Noise 35dB selection of threshold is based on the observation of 0.5 extensive experiment. without additive noise 0.45 with 20dB additive noise with 25dB additive noise IV. EXPERIMENTAL RESULTS with 30dB additive noise 0.4 with 35dB additive noise

To evaluate the performance of the proposed method, we test 0.35 with lots of speech clips. The data set consists of 384 speech clips. These WAV clips of 5 sec each are downloaded from two 0.3 publicly available uncompressed audio databases [11], [12]. 0.25

All these WAV clips are of 22.05 kHz , 16 bit, mono. We Sign Change Rate 0.2 experimentally examine and report if our proposed method 0.15 can correctly detect the noise version from the original audio in Section IV-A. In addition, the experiment to check if the 0.1 proposed method can work reliably when the length of audio is 0.05 0 50 100 150 200 250 300 350 400 short in Section IV-B. Similarly, if the proposed method works different audio clips 2 as different strength of active noise σ2 are used is reported in Section IV-C. In Section IV-D, we apply the proposed method Fig. 6. Detection result of different noise levels to locate audio forgery. The false positive error means that the original speech are The above experiment investigates the case of noise strength determined as noise version, while the false negative error SNR=30dB. However, is the proposed method valid for ad- represents the noise version are recognized as original ones. ditive noise with different strength? We add noise of 20dB, We denote false positive error and false negative error as fp 25dB, 35dB to those 384 speech clips respectively, and save and fn, respectively. The accurate rate AR is calculated as them as the noise ones for classification. Same with the 2−(fp+fn) AR = 2 × 100%. previous detection procedure, we actively add white gaussian noise of SNR=35dB to all speech clips, and then compute the A. Detection of Additive Noise sign change rate. The results are shown in Fig.6. When the We add white gaussian noise of SNR=30dB to the 384 additive noise is of 20dB, 25dB and 30dB, sign change rate speech clips, and then save them in the same format as the between original speech and noise version is distinct. Even original ones. Here we get two classes of speech signal: the additive noise is as weak as SNR=35dB, only 5 of 384 original ones and the noise version. Here we utilize the original speech clips are misclassified as noise version. proposed method to classify the two kinds of speech signal. First, we actively add white gaussian noise of SNR=35dB to B. Different Length of Audio all speech clips. Then we compute the sign change rate of From the above experimental results, it is shown that the these speech, as shown in Fig.5. Obviously, the sign change proposed method performed well on detecting additive noise rates of the noise version are all around 0.15. However, the of 5-sec audio clips. However, in practice, post-processing is sign change rates of the original speech are most above 0.3. only performed on the portion around the forgery position, The sign change rate between these two classes of speech which may last only 1 second. Hence, it is necessary to test if NOise 30dB, Active Noise 35dB NOise 30dB, Active Noise 30dB 0.5 0.5 without additive noise without additive noise with additive noise with additive noise 0.45 0.45

0.4 0.4

0.35 0.35 0.3 0.3 0.25 Sign Change Rate Sign Change Rate 0.25 0.2

0.15 0.2

0.1 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 different audio clips different audio clips

NOise 30dB, Active Noise 25dB 0.5 without additive noise with additive noise

0.45

0.4

0.35 Sign Change Rate

0.3

0.25 0 50 100 150 200 250 300 350 400 different audio clips

Fig. 7. Detection result of speech clips with the length of 1s (top) and 0.5s Fig. 8. Detection result of speech using 30dB (top) and 25dB (bottom) active (bottom) noise our proposed method can work for very short clips. For this D. Locating Forgeries via Detecting Additive Noise purpose, we randomly select 1 second and 0.5 second portion A potential application of detecting additive noise is forgery of each above clips for testing. localization. For example, after insertion or splicing operation, As shown in Fig.7, the same threshold Th =0.2 is a forger would add some weak noise around splicing positions applied, only 5 of 384 original speech clips are misclassified to eliminate traces. Such kind of forged audio can be identified (AR =99.3%) for speech of 1 second, but 7 clips are wrong by our proposed method. For a suspected audio, we can for speech of 0.5 second (AR =99.0%). Can we draw a separate it into several audio clips of 1 or 0.5 second(s) length. conclusion that the longer clips the more reliable detection? Using our method to identify each portion, and the forgery Yes, we can see that sign change rate is more stable when will be clearly catched. Also audio recording by combining speech is of 1 second. Since sign change rate is based on two recordings can be identified. As shown in Fig. 9, an statistics, more samples will generate more reliable result. audio recording contains two parts, which are recorded under different environments. We divide the audio recording into 2 several audio clips of 1 second, add active noise to each clip, C. The Impact of σ2 and compute the sign change rate. From Equation (7) and (10), we can find that sign change 2 V. CONCLUSIONS rate highly rely on the value of σ2 , the strength of active 2 noise. Hence, the impact of σ2 on our method should be in- In this paper, we have investigated the detection of additive vestigated. The experiment setup is the same as the first one in noise in digital audio. In the broader framework of digital 2 SectionIV-A, except the strength of active noise σ2 . This time audio forensics, we see this work as a contribution to the the active noise is of SNR=30dB and SNR=25dB, respectively. problem of examining postprocessing of audio. Via theoretical As shown in Fig.8, sign change rate between original speech analyze and extensive experiments, it show that sign change and noise version is always distinct. Comparing with Fig.5, rate is effective in additive noise detection. Further, the pro- we can see that sign change rate of noise version is upraised, posed method can be extended as a forgery localization tool which is in accord with Equation (7). At the same time, the in some audio forensics cases. However, there are still many threshold for classification should rise. This show the threshold limitations of the method, such as sign change rate of original only depends on the strength of active noise, which is under speech fluctuates a lot due to its content, only additive noise control of the detector. is analyzed. In our future work, we will consider apply the Fig. 9. splicing localization via detecting additive noise method to detect other kinds of noise. Also the quantitative criteria of the classification threshold will be studied.

ACKNOWLEDGMENT This work was supported by NSFC (61202497) and the Open Project Program of the National Laboratory of Pattern Recognition (NLPR).

REFERENCES [1] C. Grigoras, “Digital audio recording analysis: The electric network fre- quency (enf) criterion,” The International Journal of Speech Language and the Law, vol. 2, no. 1, pp. 63–76, 2005. [2] H. Farid, “Detecting digital forgeries using bispectral analysis,” MIT AI Memo AIM-1657, MIT, 1999. [3] R. Yang, Z. Qu, and J. Huang, “Exposing audio forgeries using frame offsets,” ACM Transactions on Multimedia Computing, Communications and Application, vol. 8, no. S2, pp. 35:1–35:20, 2012. [4] C. Kraetzer, A. Oermann, J. Dittmann, and A. Lang, “Digital audio forensics: A first practical evaluation on microphone and environment classification,” in Proc. of the Workshop on Multimedia and security, Dallas, Texas, USA, 2007, pp. 63–74. [5] R. Buchholz, C. Kraetzer, and J. Dittmann, “Microphone classification using fourier coefficients,” in Proc. of the International Workshop on Information Hiding, Darmstadt, Germany, June 2009. [6] H. Malik and H. Farid, “Audio forensics from acoustic reverberation,” in Proc. of the International Conference on Acoustics Speech and Signal Processing, march 2010, pp. 1710–1713. [7] S. Hicsonmez, E. Uzun, and H.T. Sencar, “Methods for identifying traces of compression in audio,” in Proc. of the1st International Conference on Communications, Signal Processing, and their Applications,Sharjah, 2013, pp. 1–6. [8] R. Yang, Y. Q. Shi, and J. Huang, “Detecting double compression of audio signal,” in Proc. of SPIE 7541, Media Forensics and Security II, 2010. [9] H. Yuan, “Blind forensics of median filtering in digital images,” IEEE Transactions on Information Forensics and Security, vol. 6, no. 4, pp. 1335–1345, 2011. [10] G. Cao, Y. Zhao, and R. Ni, “Detection of image sharpening based on histogram aberration and ringing artifacts,” in Proceedings of IEEE International Conference on Multimedia and Expo, 2009, pp. 1026– 1029. [11] sound lab @ princeton, “http://soundlab.cs.princeton.edu,” . [12] The Music-Speech Corpus, “http://www.ee.columbia.edu/ dpwe/sounds/,” .