The Pennsylvania State University Schreyer Honors College
Total Page:16
File Type:pdf, Size:1020Kb
THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING AUDIOSQUARE: AN OPEN-SOURCE AUDIO VISUALIZATION TOOL APOORVA SHASTRI SPRING 2020 A thesis submitted in partial fulfillment of the requirements for a baccalaureate degree in Computer Science with honors in Computer Science Reviewed and approved* by the following: Kamesh Madduri Associate Professor of Computer Science and Engineering Thesis Supervisor John Hannan Associate Professor of Computer Science and Engineering Honors Adviser * Signatures are on file in the Schreyer Honors College. i Abstract The goal of this thesis is to translate audio into an aesthetically pleasing visualization that can be used for automated classification. We visually depict various properties of an audio stream in three dimensions. The visualization is dynamically generated as the audio file is played. The final output can then be used to analyze patterns. In addition to being an open-source artifact, other distinguishing features of the software created include the use of mathematically grounded algorithms, support for user interaction, a final static visualization of the entire audio piece, and analysis of multiple audio files. ii Table of Contents List of Figures iii List of Tables iv Acknowledgements v 1 Introduction 1 1.1 Motivation ....................................... 2 1.2 Sound as Data ..................................... 3 1.3 Fast Fourier Transform ................................ 5 2 AudioSquare Design 7 2.1 Prior Work ....................................... 7 2.2 High-level Design Overview ............................. 10 3 Implementation 12 3.1 Project Description .................................. 12 3.2 Tools and Libraries Used ............................... 12 3.3 Implementation Details ................................ 13 4 Evaluation 18 4.1 Visualization and Interpretation ............................ 18 4.2 Challenges and Current Limitations ......................... 21 4.3 Future Work Directions ................................ 23 5 Conclusion 26 Bibliography 28 iii List of Figures 1 Visual representation of relationship between time and frequency domain. ..... 4 2 Sample moments in Mathew Preziott’s PartyMode visualizers. ........... 8 3 Sample moment in Malinowski’s Music Animation Machine visualization. ..... 9 4 Spiral path top view for the first 21 sample squares of a visualization. ....... 11 5 Outline of implementation procedure ......................... 14 6 Vertices and faces properties for GLMeshItem when grid width is 6. ........ 16 7 Audio visualization produced for Hello by Adele. .................. 19 8 Audio visualizations for Isn’t She Lovely by Stevie Wonder: Guitar cover by Sungha Jung [1] and Jazz Piano cover by Yohan Kim [2]. .................. 20 9 Audio visualizations for generic pop songs: Shape of You by Ed Sheeran [3] (left), Stupid Love by Lady Gaga [4] (middle). Audio visualization for edm song: Sum- mer Days (feat. Macklemore and Patrick Stump) by Martin Garrix [5] (right). ... 20 10 Audio visualization for sentimental ballads: If I Were a Boy by Beyonce [6] (left), Hello by Adele [7] (middle), Total Eclipse of the Heart by Bonnie Tyler [8] (right). 21 11 Audio visualizations for reggae songs: Three Little Birds by Bob Marley [9] (left), Is This Love by Bob Marley [10] (right). ....................... 21 iv List of Tables 1 Other notable prior work found in audio visualization. ................ 9 v Acknowledgements First and foremost, I would like to thank my thesis supervisor, Dr. Madduri, whose guidance made this entire project possible. I am truly grateful for his kind help and patience that carried me through this experience. I would also like to acknowledge my friends, family, and honors advisor, Dr. Hannan, for their continued support during my four years with the Schreyer Honors College. 1 Chapter 1 Introduction The objective of this project is to develop aesthetically pleasing computer visualizations of audible signals. We primarily consider musical compositions and songs. Music visualizations are typically generated and rendered in real time to complement the audio. Visualizations are a common feature in media player software to enhance the listening experience. We develop a new open-source visualization tool called AudioSquare [11]. The tool aims to capture several facets of a musical composition and uses a mathematical procedure to generate the visualization. The features include a three-dimensional visualization, the ability to generate a final static image of the composition, and support for user interaction. Additionally, the final output images, or even the intermediate data generated, could be used to automatically identify the music genre. We interpret visualizations of several songs. This thesis is organized as follows. We first discuss in Section 1.1 our primary motivation for audio visualization. Next, we provide relevant background information about audio signal process- ing in Sections 1.2 and 1.3. In Chapter 2, we present our design methodology after introducing prior work. Next, in Chapter 3, we give implementation details, discussing how we use existing software to build our new tool. Finally, in Chapter 4, we evaluate our software on a large collec- tion of songs, analyze results, and present ideas to further improve the visualizations and the tool’s genre classification capabilities. 2 1.1 Motivation This work is primarily inspired by our interest in the information that audio signals convey. A major research focus of the late Penn State professor Mark Ballora was expanding the capabilities and the practice of sonification of data. Sonification is the use of non-speech audio to convey in- formation [12]. In the TED Talk [13] titled “Opening Your Ears to Data,” Ballora discusses how patterns can sometimes be identified more easily through alternative mediums. For example, Bal- lora mentions the design of a software tool that expresses astrophysics data sets using audio. This software was designed so that a blind researcher could more easily study the data sets. However, colleagues without visual impairments also began to use this software, because their ears were able to pick up on patterns that their eyes were unable to. Conversely, there may be patterns in audio signals that our ears cannot easily detect. A visual representation of the audio might reveal such hidden patterns. Thus, we aimed to create an audio visualization that can be used to classify genre as well. Another motivation for this project is machine learning-based music recommendation. Gen- erally, we evaluate audio by what we hear and our interests might be subjective. However, an audio file can be considered a time-varying dataset. Music is parsed and analyzed by music stream- ing companies such as Spotify to identify factors such as tempo, acoustics, danceability, etc. [14]. When combined with other machine learning techniques, patterns found in a user’s preferred songs are used to better recommend music in the future. The visualizations generated through this work could be used as features in music recommender systems. 3 1.2 Sound as Data In this work, we primarily consider generating visualizations of music stored on computers in the Waveform Audio file (WAV)format. A WAVfile is a digital encoding of an audio stream. Other popular digital encoding formats include the MPEG Audio Layer-III (MP3) and Advanced Audio Coding (AAC) formats. But in order to parse the audio data of a musical composition, we must first understand the properties of the sound itself. Audio signals are created from sound. Sound can be defined as the oscillation of air particle displacement produced from a source. The shape of these vibrations are what make each sound unique and thus there are several important characteristics to these vibrations. Most importantly, the strength at which these vibrations occur is referred to as a signal’s am- plitude. Amplitude is a measurement of the oscillating displacement created by a sound wave [15]. This measurement corresponds to the vibration energy at each moment and is thus measured in the unit of decibels. The amplitude readings form the waveform data of an audio file and are strongly associated with what our ears perceive as the audio’s loudness at any given moment as well. A device recording sound simply picks up on this air pressure displacement, or amplitude, at different points in time to digitally represent an audio signal input. The standard representation procedure to obtain digital audio information is via pulse code modulation [16]. This involves a measurement process, known as sampling, where amplitude values are recorded at regular intervals of time. These recordings are kept in sequential order and are often stored as the elements of a one dimensional list. While recording, the number of amplitude measurements each second determines the sampling 4 rate, measured in Hertz. For example, a sampling rate of 20 kHz would mean 20,000 amplitude measurements were captured every second. This sampling rate also then determines the frequency range for the given audio. Frequency is the speed of the waveform vibration and what our ears perceive as pitch [17]. A frequency of 1 Hz means one wave cycle per second. The maximum frequency that can be read from an audio file is equal to half its sampling rate, where 20 kHz is the highest frequency audible by humans [18]. This range of hearing and more is covered by the standard sampling rate of 44.1 kHz for most distributed audio material.