<<

Introduction to Audio Processing Human- Interaction

Angelo Antonio Salatino [email protected] http://infernusweb.altervista.org License

This work is licensed under the Creative Commons Attribution-Noncommercial-Share Alike 4.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.

Overview

Processing; • Waveform Audio File Format; • FFmpeg; • Audio Processing with Matlab; • Doing phonetics with Praat; • Last but not least: Homework. Audio

is an engineering field that focuses on the computational methods for intentionally altering auditory or , in order to achieve a particular goal.

Output Signal Input Signal Audio Signal Processing Data with meaning Audio Processing in HCI

Some HCI applications involving audio signal processing are: • Speech Emotion Recognition • Speaker Recognition ▫ Speaker Verification ▫ Speaker Identification • Voice Commands • Speech to Text • Etc. Audio Signals

You can find audio signals represented in either digital or analog format.

• Digital – the pressure wave-form is a sequence of symbols, usually binary numbers.

• Analog – is a smooth wave of energy represented by a continuous stream of data. Analog to Digital Converter (ADC)

• Don’t worry, it’s only a fast review!!!

Sampling # bits per sample must be defined must be defined Sample Digital Signal Quantization Encoding Continuous in Time & Hold Discrete in Time Discrete in Time Discrete in Time Continuous in Continuous in Discrete in Discrete in Amplitude Amplitude Amplitude Amplitude

• For each measurement a number is assigned according to its amplitude. • Sampling frequency and the number of bits to represent a sample can be considered as main features for digital signals. • How these digital signals are stored? Waveform Audio File Format (WAV)

Byte The Wav file is an instance of Endianess Field Name Field Size Description Offeset a Resource Interchange Big 0 ChunkID 4 File Format (RIFF) defined Little 4 ChunkSize 4 RIFF Chunk Descriptor by IBM and Microsoft. Big 8 Format 4 Big 12 SubChunk1ID 4 The RIFF is a generic file Little 16 SubChunk1Size 4 container format for storing Little 20 AudioFormat 2 data in tagged chunks (basic Little 22 NumChannels 2 building blocks). It is a file Format SubChunk Little 24 SampleRate 4 structure that defines a class Little 28 ByteRate 4 of more specific file formats, Little 32 BlockAlign 2 such as: wav, avi, rmi, etc. Little 34 BitsPerSample 2 Big 36 SubChunk2ID 4 Little 40 SubChunk2Size 4 Data SubChunk Little 44 Data SubChunk2Size Waveform Audio File Format (WAV)

Byte ChunkID Endianess Field Name Field Size Description Offeset Contains the letters Big 0 ChunkID 4 «RIFF» in ASCII form Little 4 ChunkSize 4 RIFF Chunk Descriptor (0x52494646 big-endian Big 8 Format 4 form) Big 12 SubChunk1ID 4 Little 16 SubChunk1Size 4 ChunkSize Little 20 AudioFormat 2 This is the size of the rest Little 22 NumChannels 2 of the chunk following this Format SubChunk Little 24 SampleRate 4 number. The size of the Little 28 ByteRate 4 entire file in bytes minus 8 Little 32 BlockAlign 2 for the two fields not Little 34 BitsPerSample 2 included: ChunkID and Big 36 SubChunk2ID 4 ChunkSize. Little 40 SubChunk2Size 4 Data SubChunk Little 44 Data SubChunk2Size Format Contains the letters «WAVE» in ASCII form (0x57415645 big-endian form) Waveform Audio File Format (WAV)

Byte Endianess Field Name Field Size Description Offeset Big 0 ChunkID 4 SubChunk1ID Little 4 ChunkSize 4 RIFF Chunk Descriptor Contains the letters «fmt » Big 8 Format 4 in ASCII form Big 12 SubChunk1ID 4 (0x666d7420 big-endian Little 16 SubChunk1Size 4 form) Little 20 AudioFormat 2 Little 22 NumChannels 2 Format SubChunk SubChunk1Size Little 24 SampleRate 4 16 for PCM. This is the Little 28 ByteRate 4 size of the SubChunk Little 32 BlockAlign 2 which follows this Little 34 BitsPerSample 2 number. Big 36 SubChunk2ID 4 Little 40 SubChunk2Size 4 Data SubChunk Little 44 Data SubChunk2Size Waveform Audio File Format (WAV)

Byte Endianess Field Name Field Size Description AudioFormat Offeset Format Code or Big 0 ChunkID 4 compression type: Little 4 ChunkSize 4 RIFF Chunk Descriptor PCM = 0x0001 (Linear Big 8 Format 4 quantization, Big 12 SubChunk1ID 4 uncompressed) Little 16 SubChunk1Size 4 IEEE_FLOAT = 0x0003 Little 20 AudioFormat 2 Microsoft_ALAW=0x0006 Little 22 NumChannels 2 Format SubChunk Microsoft_MLAW=0x0007 Little 24 SampleRate 4 IBM_ADPCM = 0x0103 Little 28 ByteRate 4 … Little 32 BlockAlign 2 Little 34 BitsPerSample 2 Big 36 SubChunk2ID 4 Little 40 SubChunk2Size 4 Data SubChunk NumChannels Little 44 Data SubChunk2Size Mono = 1, Stereo = 2, etc. Note: Channels are interleaved Waveform Audio File Format (WAV)

Byte Endianess Field Name Field Size Description SampleRate Offeset Samplig frequency: Big 0 ChunkID 4 8000, 16000, 44100, etc. Little 4 ChunkSize 4 RIFF Chunk Descriptor Big 8 Format 4 ByteRate Big 12 SubChunk1ID 4 Average bytes per second. Little 16 SubChunk1Size 4 It is typically determined Little 20 AudioFormat 2 by the Equation 1. Little 22 NumChannels 2 Format SubChunk Little 24 SampleRate 4 BlockAlign Little 28 ByteRate 4 The number of bytes for Little 32 BlockAlign 2 one sample including all Little 34 BitsPerSample 2 channels. Big 36 SubChunk2ID 4 It is determined by the Little 40 SubChunk2Size 4 Data SubChunk Equation 2. Little 44 Data SubChunk2Size BitsPerSample 1) ByteRate = SampleRate ⋅ NumChannels ⋅ 8 BitsPerSample 2) BlockAlign = NumChannels ⋅ 8 Waveform Audio File Format (WAV)

Byte Endianess Field Name Field Size Description BitsPerSample Offeset 8 bits = 8, 16 bits = 16, etc. Big 0 ChunkID 4 Little 4 ChunkSize 4 RIFF Chunk Descriptor Big 8 Format 4 SubChunk2ID Big 12 SubChunk1ID 4 Contains the letters Little 16 SubChunk1Size 4 «data» in ASCII form Little 20 AudioFormat 2 (0x64617461 big-endian Little 22 NumChannels 2 Format SubChunk form) Little 24 SampleRate 4 Little 28 ByteRate 4 Little 32 BlockAlign 2 SubChunk2Size Little 34 BitsPerSample 2 This is the number of Big 36 SubChunk2ID 4 bytes in the Data field. Little 40 SubChunk2Size 4 Data SubChunk If AudioFormat=PCM, Little 44 Data SubChunk2Size then you can compute the number of samples (see Equation 3). 8 ⋅ SubChunk2Size 3) NumOfSamples = NumChannels ⋅ BitsPerSample Example of wave header

AudioFormat = 1 (PCM) Chunk Descriptor Fmt SubChunk 52 49 46 46 16 02 01 00 57 41 56 45 66 6d 74 20 10 00 00 00 01 00 01 00 R I F F W A V E f m t

ChunkSize = 66070 SubChunk1Size = 16 NumChannels = 1

BitsPerSample = 16 SubChunk2Size = 66034

Fmt SubChunk (cont…) Data SubChunk 80 3e 00 00 00 7d 00 00 02 00 10 00 64 61 74 61 f2 01 01 00 … . . . d a t a Data SampleRate = 16000 BloackAlign = 2 ByteRate = 32000 Exercise For the next 15 min, write a C/C++ program that takes a wav file as input and prints the following values on standard output: • Header size; • Sample rate; • Bits per sample; • Number of channels; • Number of samples.

Good work! typedef struct header_file { char chunk_id[4]; int chunk_size; Solution char format[4]; char subchunk1_id[4]; int subchunk1_size; short int audio_format; short int num_channels; int sample_rate; int byte_rate; short int block_align; short int bits_per_sample; char subchunk2_id[4]; int subchunk2_size; } header;

/************** Inside Main() **************/ header* meta = new header; ifstream infile; infile.exceptions (ifstream::eofbit | ifstream::failbit | ifstream::badbit); infile.open("foo.wav", ios::in|ios::binary); infile.read ((char*)meta, sizeof(header)); cout << " Header size: "<sample_rate <<" Hz" << endl; cout << " Bits per samples: " << meta->bits_per_sample << " bit" <num_channels << endl; long numOfSample = (meta->subchunk2_size/meta->num_channels)/(meta->bits_per_sample/8); cout << " Number of samples: " << numOfSample << endl;

However, this solution contains an error. Can you spot it? What about reading samples?

short int* pU = NULL; unsigned char* pC = NULL; gWavDataIn = new double*[meta->num_channels]; //data structure storing the samples for (int i = 0; i < meta->num_channels; i++) gWavDataIn[i] = new double[numOfSample];

wBuffer = new char[meta->subchunk2_size]; //data structure storing the bytes

/* data conversion: from byte to samples */ if(meta->bits_per_sample == 16) { pU = (short*) wBuffer; for( int i = 0; i < numOfSample; i++) for (int j = 0; j < meta->num_channels; j++) gWavDataIn[j][i] = (double) (pU[i]); } else if(meta->bits_per_sample == 8) { pC = (unsigned char*) wBuffer; for( int i = 0; i < numOfSample; i++) for (int j = 0; j < meta->num_channels; j++) gWavDataIn[j][i] = (double) (pC[i]); } else { printERR("Unhandled case"); }

This solution is available at: https://github.com/angelosalatino/AudioSignalProcessing A better solution: FFmpeg

What FFmpeg says about itself: • FFmpeg is the leading multimedia framework, able to decode, encode, transcode, mux, demux, stream, filter and play pretty much anything that humans and machines have created. It supports the most obscure ancient formats up to the cutting edge. No matter if they were designed by some standards committee, the community or a corporation. Why FFmpeg is better?

• Off-the-shelf; • Open Source; • We can read samples from different kind of formats: wav, , aac, flac and so on; • The code is always the same for all these audio formats; • It can also decode formats. A little bit of code …

Step 1 • Create AVFormatContext ▫ Format I/O context: nb_streams, filename, start_time, duration, bit_rate, audio_codec_id, video_codec_id and so on. • Open file

AVFormatContext* formatContext = NULL; av_open_input_file(&formatContext,"foo.wav",NULL,0,NULL) A little bit of code …

Step 2 • Create AVStream ▫ Stream structure; It contains: nb_frames, codec_context, duration and so on; • Association between audio stream inside the context and the new one.

// Find the audio stream (some container files can have multiple streams in them) AVStream* audioStream = NULL; for (unsigned int i = 0; i < formatContext->nb_streams; ++i) if (formatContext->streams[i]->codec->codec_type == AVMEDIA_TYPE_AUDIO) { audioStream = formatContext->streams[i]; break; } A little bit of code …

Step 3 • Create AVCodecContext ▫ Main external API structure; It contains: codec_name, codec_id and so on. • Create AVCodec ▫ Structure; It contains deep level information about codec. • Find codec availability • Open Codec

AVCodecContext* codecContext = audioStream->codec; AvCodec codec = avcodec_find_decoder(codecContext->codec_id); avcodec_open(codecContext,codec); A little bit of code … Step 4 • Create AVPacket ▫ This structure stores compressed data.

• Create AVFrame ▫ This structure describes decoded (raw) audio or video data.

AVPacket packet; av_init_packet(&packet); … AVFrame* frame = avcodec_alloc_frame(); A little bit of code …

Step 5 • Read packets ▫ Packets are read from AVContextFormat

• Decode packets ▫ Frame are decodec with CodecContext

// Read the packets in a loop while (av_read_frame(formatContext, &packet) == 0) { … avcodec_decode_audio4(codecContext, frame, &frameFinished, &packet); … src_data = frame->data[0]; } Problems with FFmpeg • Update issues (with lib update, your previous code might not work) ▫ Deprecated methods; ▫ Function name or parameters could change. • Poor documentation (until today)

Example of migration: • avcodec_open (AVCodecContext *avctx, const AVCodec *codec) • avcodec_open2 (AVCodecContext *avctx, const AVCodec *codec, AVDictionary **options)

Audio Processing with Matlab

• Matlab contains a lot of built-in functions to read, listen, manipulate and save audio files. • It also contains Signal Processing Toolbox and DSP System Toolbox

Advantages Disadvantages • Well documented; • Only wave, flac, mp3, mpeg-4 and • It works on different level of ogg formats are recognized in abstraction; audioread (Is it really a • Direct access to samples; disadvantage?); • Coding is simple. • License is expensive. Let’s code: Opening files

%% Reading file % Section ID = 1

filename = './test.wav'; [data,fs] = wavread(filename); % reads only wav file

% data = sample collection, fs = sampling frequency

% or ---> [data,fs] = audioread(filename);

Recognized formats by audioread() % write an audio file audiowrite('./testCopy.wav',data,fs)

Information and play

%% Information & play % Section ID = 2 numberOfSamples = length(data); tempo = numberOfSamples / fs;

disp (sprintf('Length: %f seconds',tempo)); disp (sprintf('Number of Samples %d', numberOfSamples)); disp (sprintf('Sampling Frequency %d Hz',fs)); disp (sprintf('Number of Channels: %d', min(size(data))));

%play file (data,fs);

% PLOT the signal time = linspace(0,tempo,numberOfSamples); plot(time,data);

Framing 푡 − 휏 푠(푡) = 푥(푡) ⋅ 푟푒푐푡 #푠푎푚푝푙푒 %% Framing % Section ID = 4

timeWindow = 0.04; % Frame length in term of seconds. Default: timeWindow = 40ms timeStep = 0.01; % seconds between two frames. Default: timeStep = 10ms (in case of OVERLAPPING)

overlap = 1; % 1 in case of overlap, 0 no overlap sampleForWindow = timeWindow * fs;

if overlap == 0; Y = buffer(data,sampleForWindow); else sampleToJump = sampleForWindow - timeStep * fs; Y = buffer(data,sampleForWindow,ceil(sampleToJump)); end

[m,n]=size(Y); % m corresponds to sampleForWindow numFrames = n;

disp(sprintf('Number of Frames: %d',numFrames));

1 푛−(푁−1) 2 2 − 2 휎(푁−1) 2 Windowing 푤퐺퐴푈푆푆(푛) = 푒 , 휎 ≤ 0.5

%% Windowing 2휋푛 % Section ID = 5 푤퐻퐴푀푀퐼푁퐺(푛) = 0.54 + 0.46 cos 푁 − 1 num_points = sampleForWindow; % some windows USE help window 2휋푛 w_gauss = gausswin(num_points); 푤퐻퐴푁푁(푛) = 0.5 1 + cos w_hamming = hamming(num_points); 푁 − 1 w_hann = hann(num_points); plot(1:num_points,[w_gauss,w_hamming, w_hann]); axis([1 num_points 0 2]); legend('Gaussian','Hamming','Hann');

old_Y = Y; for i=1:numFrames Y(:,i)=Y(:,i).*w_hann; end

%see the difference index_to_plot = 88; figure plot (old_Y(:,index_to_plot)) hold on plot (Y(:,index_to_plot), 'green') hold off clear num_points w_gauss w_hamming w_hann Energy

%% Energy % Section ID = 6

% It requires that signal is already framed % Run Section ID=4

for i=1:numFrames energy(i)=sum(abs(old_Y(:,i)).^2); end

figure, plot(energy)

푁 퐸 = |푥(푖)|2 푖=1 Fast Fourier Transform (FFT)

%% Fast Fourier Transform (sull'intero segnale) % Section ID = 7

NFFT = 2^nextpow2(numberOfSamples); % Next higher power of 2. (in order to optimize FFT computation) freqSignal = fft(data,NFFT); f = fs/2*linspace(0,1,NFFT/2+1);

% PLOT plot(f,abs(freqSignal(1:NFFT/2+1))) title('Single-Sided Amplitude Spectrum of y(t)') xlabel('Frequency (Hz)') ylabel('|Y(f)|')

clear NFFT freqSignal f

Short Term Fourier Transform (STFT)

%% Short Term Fourier Transform % Section ID = 8 % It requires that signal is already framed. Run Section ID=4 NFFT = 2^nextpow2(sampleForWindow); STFT = ones(NFFT,numFrames);

for i=1:numFrames STFT(:,i)=fft(Y(:,i),NFFT); end

indexToPlot = 80; %frame index to plot if indexToPlot < numFrames f = fs/2*linspace(0,1,NFFT/2+1); plot(f,2*abs(STFT(1:NFFT/2+1,indexToPlot))) % PLOT title(sprintf('FFT del frame %d', indexToPlot)); xlabel('Frequency (Hz)') ylabel(sprintf('|STFT_{%d}(f)|',indexToPlot)) else disp('Unable to create plot'); End % ********************************************* specgram(data,sampleForWindow,fs) % SPECTROGRAM title('Spectrogram [dB]') Auto-correlation

%% Auto-Correlazione per frames % Section ID = 9

% It requires that signal is already framed % Run Section ID=4 푁 for i=1:numFrames autoCorr(:,i)=xcorr(Y(:,i)); Rx(n) = x(i) ⋅ x(i + n) end 푖=1 indexToPlot = 80; %frame index to plot

if indexToPlot < numFrames

% PLOT plot(autoCorr(sampleForWindow:end,i)) else disp('Unable to create plot'); end

clear indexToPlot A system for doing phonetics: Praat

• PRAAT is a comprehensive speech analysis, synthesis, and manipulation package developed by Paul Boersma and David Weenink at the Institute of Phonetic Sciences of the University of Amsterdam, The Netherlands.

Pitch with Praat Formants with Praat

5th 4th

3rd

2nd

1st Other features with Praat

• Intensity • Mel-Frequency Cepstrum Coefficients (MFCC); • Linear Predictive Coefficients (LPC); • Harmonic-to-Noise Ratio (HNR); • and many others. Scripting in Praat

• Praat can run scripts containing all the different commands available in its environment and perform the operations and functionalities that they represent. fileName$ = "test.wav" Read from file... 'fileName$' name$ = fileName$ - ".wav" select Sound 'name$' To Pitch (ac)... 0.0 50.0 15 off 0.1 0.60 0.01 0.35 0.14 500.0

numFrame=Get number of frames for i to numFrame time=Get time from frame number... i value=Get value in frame... i Hertz Here is an example to perform a if value = undefined pitch listing and save it in a text value=0 endif file. path$=name$+"_pitch.txt" fileappend 'path$' 'time' 'value' 'newline$' endfor select Pitch 'name$' Remove select Sound 'name$' Remove Homework

• Exercise 1) Consider a speech signal containing silence, unvoiced and voiced regions, as showed here and write a Matlab function (or whatever language you prefer) capable to identify these sections.

Silence

Voiced • Exercise 2) Then, in voiced regions identify the Unvoiced fundamental frequency, the so called pitch.

Please, try this at home!! References and further reading

• Signal Processing ▫ http://deecom19.poliba.it/dsp/Teoria_dei_Segnali.pdf (Italian) • WAV ▫ https://ccrma.stanford.edu/courses/422/projects/WaveFormat/ ▫ http://www.onicos.com/staff/iz/formats/wav.html • MATLAB ▫ http://www.mathworks.com/products/signal/ ▫ http://www.mathworks.com/products/dsp-system/ ▫ http://homepages.udayton.edu/~hardierc/ece203/sound.htm ▫ http://www.utdallas.edu/~assmann/hcs7367/classnotes.html References and further reading • FFmpeg ▫ https://www.ffmpeg.org/ ▫ https://trac.ffmpeg.org/wiki/CompilationGuide/Ubuntu • Praat ▫ http://www.fon.hum.uva.nl/praat/ ▫ http://www.fon.hum.uva.nl/david/sspbook/sspbook. pdf ▫ http://www.fon.hum.uva.nl/praat/manual/Scripting. html • Source code ▫ https://github.com/angelosalatino/AudioSignalProces sing