Introduction to Audio Signal Processing Human-Computer Interaction
Total Page:16
File Type:pdf, Size:1020Kb
Introduction to Audio Signal Processing Human-Computer Interaction Angelo Antonio Salatino [email protected] http://infernusweb.altervista.org License This work is licensed under the Creative Commons Attribution-Noncommercial-Share Alike 4.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA. Overview • Audio Signal Processing; • Waveform Audio File Format; • FFmpeg; • Audio Processing with Matlab; • Doing phonetics with Praat; • Last but not least: Homework. Audio Signal Processing • Audio signal processing is an engineering field that focuses on the computational methods for intentionally altering auditory signals or sounds, in order to achieve a particular goal. Output Signal Input Signal Audio Signal Processing Data with meaning Audio Processing in HCI Some HCI applications involving audio signal processing are: • Speech Emotion Recognition • Speaker Recognition ▫ Speaker Verification ▫ Speaker Identification • Voice Commands • Speech to Text • Etc. Audio Signals You can find audio signals represented in either digital or analog format. • Digital – the pressure wave-form is a sequence of symbols, usually binary numbers. • Analog – is a smooth wave of energy represented by a continuous stream of data. Analog to Digital Converter (ADC) • Don’t worry, it’s only a fast review!!! Sampling Frequency # bits per sample must be defined must be defined Analog Signal Sample Digital Signal Quantization Encoding Continuous in Time & Hold Discrete in Time Discrete in Time Discrete in Time Continuous in Continuous in Discrete in Discrete in Amplitude Amplitude Amplitude Amplitude • For each measurement a number is assigned according to its amplitude. • Sampling frequency and the number of bits to represent a sample can be considered as main features for digital signals. • How these digital signals are stored? Waveform Audio File Format (WAV) Byte The Wav file is an instance of Endianess Field Name Field Size Description Offeset a Resource Interchange Big 0 ChunkID 4 File Format (RIFF) defined Little 4 ChunkSize 4 RIFF Chunk Descriptor by IBM and Microsoft. Big 8 Format 4 Big 12 SubChunk1ID 4 The RIFF is a generic file Little 16 SubChunk1Size 4 container format for storing Little 20 AudioFormat 2 data in tagged chunks (basic Little 22 NumChannels 2 building blocks). It is a file Format SubChunk Little 24 SampleRate 4 structure that defines a class Little 28 ByteRate 4 of more specific file formats, Little 32 BlockAlign 2 such as: wav, avi, rmi, etc. Little 34 BitsPerSample 2 Big 36 SubChunk2ID 4 Little 40 SubChunk2Size 4 Data SubChunk Little 44 Data SubChunk2Size Waveform Audio File Format (WAV) Byte ChunkID Endianess Field Name Field Size Description Offeset Contains the letters Big 0 ChunkID 4 «RIFF» in ASCII form Little 4 ChunkSize 4 RIFF Chunk Descriptor (0x52494646 big-endian Big 8 Format 4 form) Big 12 SubChunk1ID 4 Little 16 SubChunk1Size 4 ChunkSize Little 20 AudioFormat 2 This is the size of the rest Little 22 NumChannels 2 of the chunk following this Format SubChunk Little 24 SampleRate 4 number. The size of the Little 28 ByteRate 4 entire file in bytes minus 8 Little 32 BlockAlign 2 for the two fields not Little 34 BitsPerSample 2 included: ChunkID and Big 36 SubChunk2ID 4 ChunkSize. Little 40 SubChunk2Size 4 Data SubChunk Little 44 Data SubChunk2Size Format Contains the letters «WAVE» in ASCII form (0x57415645 big-endian form) Waveform Audio File Format (WAV) Byte Endianess Field Name Field Size Description Offeset Big 0 ChunkID 4 SubChunk1ID Little 4 ChunkSize 4 RIFF Chunk Descriptor Contains the letters «fmt » Big 8 Format 4 in ASCII form Big 12 SubChunk1ID 4 (0x666d7420 big-endian Little 16 SubChunk1Size 4 form) Little 20 AudioFormat 2 Little 22 NumChannels 2 Format SubChunk SubChunk1Size Little 24 SampleRate 4 16 for PCM. This is the Little 28 ByteRate 4 size of the SubChunk Little 32 BlockAlign 2 which follows this Little 34 BitsPerSample 2 number. Big 36 SubChunk2ID 4 Little 40 SubChunk2Size 4 Data SubChunk Little 44 Data SubChunk2Size Waveform Audio File Format (WAV) Byte Endianess Field Name Field Size Description AudioFormat Offeset Format Code or Big 0 ChunkID 4 compression type: Little 4 ChunkSize 4 RIFF Chunk Descriptor PCM = 0x0001 (Linear Big 8 Format 4 quantization, Big 12 SubChunk1ID 4 uncompressed) Little 16 SubChunk1Size 4 IEEE_FLOAT = 0x0003 Little 20 AudioFormat 2 Microsoft_ALAW=0x0006 Little 22 NumChannels 2 Format SubChunk Microsoft_MLAW=0x0007 Little 24 SampleRate 4 IBM_ADPCM = 0x0103 Little 28 ByteRate 4 … Little 32 BlockAlign 2 Little 34 BitsPerSample 2 Big 36 SubChunk2ID 4 Little 40 SubChunk2Size 4 Data SubChunk NumChannels Little 44 Data SubChunk2Size Mono = 1, Stereo = 2, etc. Note: Channels are interleaved Waveform Audio File Format (WAV) Byte Endianess Field Name Field Size Description SampleRate Offeset Samplig frequency: Big 0 ChunkID 4 8000, 16000, 44100, etc. Little 4 ChunkSize 4 RIFF Chunk Descriptor Big 8 Format 4 ByteRate Big 12 SubChunk1ID 4 Average bytes per second. Little 16 SubChunk1Size 4 It is typically determined Little 20 AudioFormat 2 by the Equation 1. Little 22 NumChannels 2 Format SubChunk Little 24 SampleRate 4 BlockAlign Little 28 ByteRate 4 The number of bytes for Little 32 BlockAlign 2 one sample including all Little 34 BitsPerSample 2 channels. Big 36 SubChunk2ID 4 It is determined by the Little 40 SubChunk2Size 4 Data SubChunk Equation 2. Little 44 Data SubChunk2Size BitsPerSample 1) ByteRate = SampleRate ⋅ NumChannels ⋅ 8 BitsPerSample 2) BlockAlign = NumChannels ⋅ 8 Waveform Audio File Format (WAV) Byte Endianess Field Name Field Size Description BitsPerSample Offeset 8 bits = 8, 16 bits = 16, etc. Big 0 ChunkID 4 Little 4 ChunkSize 4 RIFF Chunk Descriptor Big 8 Format 4 SubChunk2ID Big 12 SubChunk1ID 4 Contains the letters Little 16 SubChunk1Size 4 «data» in ASCII form Little 20 AudioFormat 2 (0x64617461 big-endian Little 22 NumChannels 2 Format SubChunk form) Little 24 SampleRate 4 Little 28 ByteRate 4 Little 32 BlockAlign 2 SubChunk2Size Little 34 BitsPerSample 2 This is the number of Big 36 SubChunk2ID 4 bytes in the Data field. Little 40 SubChunk2Size 4 Data SubChunk If AudioFormat=PCM, Little 44 Data SubChunk2Size then you can compute the number of samples (see Equation 3). 8 ⋅ SubChunk2Size 3) NumOfSamples = NumChannels ⋅ BitsPerSample Example of wave header AudioFormat = 1 (PCM) Chunk Descriptor Fmt SubChunk 52 49 46 46 16 02 01 00 57 41 56 45 66 6d 74 20 10 00 00 00 01 00 01 00 R I F F W A V E f m t ChunkSize = 66070 SubChunk1Size = 16 NumChannels = 1 BitsPerSample = 16 SubChunk2Size = 66034 Fmt SubChunk (cont…) Data SubChunk 80 3e 00 00 00 7d 00 00 02 00 10 00 64 61 74 61 f2 01 01 00 … . d a t a Data SampleRate = 16000 BloackAlign = 2 ByteRate = 32000 Exercise For the next 15 min, write a C/C++ program that takes a wav file as input and prints the following values on standard output: • Header size; • Sample rate; • Bits per sample; • Number of channels; • Number of samples. Good work! typedef struct header_file { char chunk_id[4]; int chunk_size; Solution char format[4]; char subchunk1_id[4]; int subchunk1_size; short int audio_format; short int num_channels; int sample_rate; int byte_rate; short int block_align; short int bits_per_sample; char subchunk2_id[4]; int subchunk2_size; } header; /************** Inside Main() **************/ header* meta = new header; ifstream infile; infile.exceptions (ifstream::eofbit | ifstream::failbit | ifstream::badbit); infile.open("foo.wav", ios::in|ios::binary); infile.read ((char*)meta, sizeof(header)); cout << " Header size: "<<sizeof(*meta)<<" bytes" << endl; cout << " Sample Rate "<< meta->sample_rate <<" Hz" << endl; cout << " Bits per samples: " << meta->bits_per_sample << " bit" <<endl; cout << " Number of channels: " << meta->num_channels << endl; long numOfSample = (meta->subchunk2_size/meta->num_channels)/(meta->bits_per_sample/8); cout << " Number of samples: " << numOfSample << endl; However, this solution contains an error. Can you spot it? What about reading samples? short int* pU = NULL; unsigned char* pC = NULL; gWavDataIn = new double*[meta->num_channels]; //data structure storing the samples for (int i = 0; i < meta->num_channels; i++) gWavDataIn[i] = new double[numOfSample]; wBuffer = new char[meta->subchunk2_size]; //data structure storing the bytes /* data conversion: from byte to samples */ if(meta->bits_per_sample == 16) { pU = (short*) wBuffer; for( int i = 0; i < numOfSample; i++) for (int j = 0; j < meta->num_channels; j++) gWavDataIn[j][i] = (double) (pU[i]); } else if(meta->bits_per_sample == 8) { pC = (unsigned char*) wBuffer; for( int i = 0; i < numOfSample; i++) for (int j = 0; j < meta->num_channels; j++) gWavDataIn[j][i] = (double) (pC[i]); } else { printERR("Unhandled case"); } This solution is available at: https://github.com/angelosalatino/AudioSignalProcessing A better solution: FFmpeg What FFmpeg says about itself: • FFmpeg is the leading multimedia framework, able to decode, encode, transcode, mux, demux, stream, filter and play pretty much anything that humans and machines have created. It supports the most obscure ancient formats up to the cutting edge. No matter if they were designed by some standards committee, the community or a corporation. Why FFmpeg is better? • Off-the-shelf; • Open Source; • We can read samples from different kind of formats: wav, mp3, aac, flac and so on; • The code is always the same for