The Pennsylvania State University Schreyer Honors College

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

AUDIOSQUARE: AN OPEN-SOURCE AUDIO VISUALIZATION TOOL

APOORVA SHASTRI SPRING 2020

A thesis submitted in partial fulfillment of the requirements for a baccalaureate degree in Computer Science with honors in Computer Science

Reviewed and approved* by the following:

Kamesh Madduri Associate Professor of Computer Science and Engineering Thesis Supervisor

John Hannan Associate Professor of Computer Science and Engineering Honors Adviser

* Signatures are on file in the Schreyer Honors College. i

Abstract

The goal of this thesis is to translate audio into an aesthetically pleasing visualization that can be used for automated classification. We visually depict various properties of an audio stream in three dimensions. The visualization is dynamically generated as the audio file is played. The final output can then be used to analyze patterns. In addition to being an open-source artifact, other distinguishing features of the software created include the use of mathematically grounded algorithms, support for user interaction, a final static visualization of the entire audio piece, and analysis of multiple audio files. ii

Table of Contents

List of Figures iii

List of Tables iv

Acknowledgements v

1 Introduction 1 1.1 Motivation ...... 2 1.2 Sound as Data ...... 3 1.3 Fast Fourier Transform ...... 5

2 AudioSquare Design 7 2.1 Prior Work ...... 7 2.2 High-level Design Overview ...... 10

3 Implementation 12 3.1 Project Description ...... 12 3.2 Tools and Libraries Used ...... 12 3.3 Implementation Details ...... 13

4 Evaluation 18 4.1 Visualization and Interpretation ...... 18 4.2 Challenges and Current Limitations ...... 21 4.3 Future Work Directions ...... 23

5 Conclusion 26

Bibliography 28 iii

List of Figures

1 Visual representation of relationship between time and frequency domain...... 4

2 Sample moments in Mathew Preziott’s PartyMode visualizers...... 8 3 Sample moment in Malinowski’s Music Animation Machine visualization...... 9 4 Spiral path top view for the first 21 sample squares of a visualization...... 11

5 Outline of implementation procedure ...... 14 6 Vertices and faces properties for GLMeshItem when grid width is 6...... 16

7 Audio visualization produced for Hello by Adele...... 19 8 Audio visualizations for Isn’t She Lovely by Stevie Wonder: Guitar cover by Sungha Jung [1] and Jazz Piano cover by Yohan Kim [2]...... 20 9 Audio visualizations for generic pop songs: Shape of You by Ed Sheeran [3] (left), Stupid Love by Lady Gaga [4] (middle). Audio visualization for edm song: Sum- mer Days (feat. Macklemore and Patrick Stump) by Martin Garrix [5] (right). ... 20 10 Audio visualization for sentimental ballads: If I Were a Boy by Beyonce [6] (left), Hello by Adele [7] (middle), Total Eclipse of the Heart by Bonnie Tyler [8] (right). 21 11 Audio visualizations for reggae songs: Three Little Birds by Bob Marley [9] (left), Is This Love by Bob Marley [10] (right)...... 21 iv

List of Tables

1 Other notable prior work found in audio visualization...... 9 v

Acknowledgements

First and foremost, I would like to thank my thesis supervisor, Dr. Madduri, whose guidance made this entire project possible. I am truly grateful for his kind help and patience that carried me through this experience. I would also like to acknowledge my friends, family, and honors advisor,

Dr. Hannan, for their continued support during my four years with the Schreyer Honors College. 1

Chapter 1

Introduction

The objective of this project is to develop aesthetically pleasing computer visualizations of

audible signals. We primarily consider musical compositions and songs. Music visualizations are

typically generated and rendered in real time to complement the audio. Visualizations are a common

feature in media player software to enhance the listening experience.

We develop a new open-source visualization tool called AudioSquare [11]. The tool aims to

capture several facets of a musical composition and uses a mathematical procedure to generate the visualization. The features include a three-dimensional visualization, the ability to generate a final

static image of the composition, and support for user interaction. Additionally, the final output

images, or even the intermediate data generated, could be used to automatically identify the music

genre. We interpret visualizations of several songs.

This thesis is organized as follows. We first discuss in Section 1.1 our primary motivation for

audio visualization. Next, we provide relevant background information about audio signal process-

ing in Sections 1.2 and 1.3. In Chapter 2, we present our design methodology after introducing prior work. Next, in Chapter 3, we give implementation details, discussing how we use existing

software to build our new tool. Finally, in Chapter 4, we evaluate our software on a large collec-

tion of songs, analyze results, and present ideas to further improve the visualizations and the tool’s

genre classification capabilities. 2

1.1 Motivation

This work is primarily inspired by our interest in the information that audio signals convey. A

major research focus of the late Penn State professor Mark Ballora was expanding the capabilities

and the practice of sonification of data. Sonification is the use of non-speech audio to convey in-

formation [12]. In the TED Talk [13] titled “Opening Your Ears to Data,” Ballora discusses how

patterns can sometimes be identified more easily through alternative mediums. For example, Bal-

lora mentions the design of a software tool that expresses astrophysics data sets using audio. This

software was designed so that a blind researcher could more easily study the data sets. However,

colleagues without visual impairments also began to use this software, because their ears were able

to pick up on patterns that their eyes were unable to. Conversely, there may be patterns in audio

signals that our ears cannot easily detect. A visual representation of the audio might reveal such

hidden patterns. Thus, we aimed to create an audio visualization that can be used to classify genre

as well.

Another motivation for this project is machine learning-based music recommendation. Gen-

erally, we evaluate audio by what we hear and our interests might be subjective. However, an

audio file can be considered a time-varying dataset. Music is parsed and analyzed by music stream-

ing companies such as Spotify to identify factors such as tempo, acoustics, danceability, etc. [14].

When combined with other machine learning techniques, patterns found in a user’s preferred songs

are used to better recommend music in the future. The visualizations generated through this work

could be used as features in music recommender systems. 3

1.2 Sound as Data

In this work, we primarily consider generating visualizations of music stored on computers in

the Waveform Audio file (WAV)format. A WAVfile is a digital encoding of an audio stream. Other

popular digital encoding formats include the MPEG Audio Layer-III (MP3) and Advanced Audio

Coding (AAC) formats. But in order to parse the audio data of a musical composition, we must

first understand the properties of the sound itself.

Audio signals are created from sound. Sound can be defined as the oscillation of air particle

displacement produced from a source. The shape of these vibrations are what make each sound

unique and thus there are several important characteristics to these vibrations.

Most importantly, the strength at which these vibrations occur is referred to as a signal’s am-

plitude. Amplitude is a measurement of the oscillating displacement created by a sound wave [15].

This measurement corresponds to the vibration energy at each moment and is thus measured in the

unit of decibels. The amplitude readings form the waveform data of an audio file and are strongly

associated with what our ears perceive as the audio’s loudness at any given moment as well.

A device recording sound simply picks up on this air pressure displacement, or amplitude, at

different points in time to digitally represent an audio signal input. The standard representation

procedure to obtain digital audio information is via pulse code modulation [16]. This involves a

measurement process, known as sampling, where amplitude values are recorded at regular intervals

of time. These recordings are kept in sequential order and are often stored as the elements of a one

dimensional list.

While recording, the number of amplitude measurements each second determines the sampling 4

rate, measured in Hertz. For example, a sampling rate of 20 kHz would mean 20,000 amplitude

measurements were captured every second. This sampling rate also then determines the frequency

range for the given audio.

Frequency is the speed of the waveform vibration and what our ears perceive as pitch [17]. A

frequency of 1 Hz means one wave cycle per second. The maximum frequency that can be read

from an audio file is equal to half its sampling rate, where 20 kHz is the highest frequency audible

by humans [18]. This range of hearing and more is covered by the standard sampling rate of 44.1

kHz for most distributed audio material.

The concepts of time domain and frequency domain are also related to the frequency, amplitude, and the sampling rate. Amplitudes are captured over a time period, and they form the time domain representation of the signal. From this, the frequency domain representation can be produced. As seen in Figure 1, the frequency domain plots the corresponding amplitudes or strength of each

frequency heard from a given time frame of amplitudes read.

Figure 1: Visual representation of relationship between time and frequency domain [19]. 5

1.3 Fast Fourier Transform

In order to go from the aforementioned time domain representation to a representation in the

frequency domain, we must decompose the discrete sequence of the audio wave and identify the

dominant frequencies. This can be done by applying the computational procedure known as the

Fourier Transform.

The values from the original time domain sequence 푇 are used to calculate the 푘th value of the

frequency domain sequence, represented by 퐹푘. Their mathematical relationship is shown in the

equation below: 푁−1 퐹 = 푇 푒−푖2휋푘푥/푁 푘 ∑ 푥 푥=1

To obtain the entire frequency domain sequence, this arithmetic is performed on integer values

of 푘 from 0 to 푁 − 1, where 푁 is the length of the waveform input. This calculation is known as

the Discrete Fourier Transform, or DFT [20]. Computing 푁 summation terms for each of the 푁 values of 푘 results in a running time that is quadratic in 푁 (denoted mathematically as Θ(푁2)).

This computation time, however, can be reduced. A fast Fourier transform [21, 22], or FFT, is

an efficient algorithm that will optimize this computation to be done in Θ(푁 log 푁) time instead.

One specific variant of FFT used in this work is the decimation in time FFT [23]. In this procedure,

periodicity and the complex conjugate symmetry of the summation term are taken advantage of,

in order to apply a divide and conquer approach. Until 푁 equals one, the DFT for a sequence of

length 푁 is written as the sum of two DFTs performed on sequences of length 푁/2 each. Thus,

only 푁 log 푁 computations overall are necessary for large values of 푁. However, since there are

many popular libraries to perform the FFT calculation, knowing the details of its implementation 6 will not be necessary to understand its application.

Given that the input to our fast Fourier transform will be a one-dimensional sequence of real

amplitude values, we apply a one-dimensional fast Fourier transform. The algorithm will return a

sequence of complex-valued amplitudes associated to the frequencies found in the waveform input.

This output length will be equivalent to the length of the input sequence [24]. The magnitudes of the values in the first half of this output list are positive frequency terms, where the amplitude weights listed will correspond to evenly distributed frequencies from the range zero to half the sampling rate. Due to the symmetry of the fast Fourier transform signal returned, values in the second half of its sequence will simply be complex conjugates of the first half, representing negative frequency terms. For example, if a waveform oscillates in the shape of a sine wave at a constant speed, there will be peaks in the Fourier transform signal at the frequency representing this speed and then again at that frequency’s negative counterpart. By applying a real fast Fourier transform instead, we can eliminate the redundant information in the second half and keep only the real frequency terms of the first half. 7

Chapter 2

AudioSquare Design

In this chapter, we give an overview of relevant prior work and then give high-level details of

our design.

2.1 Prior Work

In order to construct a design for our project, we first looked to existing audio visualization work for inspiration. Through the discovery of different possibilities, we were able to decide what

features we wished to capture in our project.

A key goal for our project was to be able to classify songs by their genres. To do this, we would

need to somehow analyze the final product of a generated visualization. NotesArt Studio [25] by

Tim Davis is a notable work that produces impressive three-dimensional graphic representations for songs. These visualizations are dynamically generated. However, by showcasing the visualizations in an art gallery of images, the emphasis is placed on their final static form. Davis describes the algorithms and dynamic generation as an artistic process used to create the final piece of art.

Though this project is a good example of creating an aesthetically pleasing final result, the details of generation process are not disclosed. Thus, it is not possible to reproduce these visualizations.

We encourage the reader to visit Davis’ website in order to see the visualizations.

Next, we searched for open source projects that could be of help or inspiration. We noticed that 8

many rely on waveform amplitudes alone to create a strong dynamic representation of the audible

sound. For example, strong beat or percussion sounds would be clearly visible by higher spikes

in amplitude at those points. This can be seen in Figure 2 by the different samples produced by

Mathew Preziott’s open-source project for browser-based visualizations. Preziott creates multiple visualizers using the Javascript library D3 to represent the waveform data read in each sample

taken [26]. Rather than applying complex mathematical algorithms to extract salient features of a

song, the audio is processed on a sample-by-sample basis and focuses on manipulating the Scalable

Vector Graphics (SVG) output to create unique and satisfying dynamic results. Similar to how these visualizations appear to expand at samples containing high-amplitude signals, we too, in our project,

chose to emphasize higher amplitude points heard. However, like other open-source software we

explored, there was no static representation produced for the song in Preziott’s visualizations.

Figure 2: Sample moments in Mathew Preziott’s PartyMode visualizers.

Color is another key attribute of a visual depiction. In the Music Animation Machine visualiza-

tion [27], by Stephen Malinowski, for a piano piece, lower tone bass clef notes are shown in blue

and the higher tone treble clef melody shown in red. An example visualization is shown in Figure 3.

From this work, we gained the inspiration to represent pitch using color.

Thus, we began designing our project with amplitude and frequency as the main data to repre-

sent dynamically in our samples read. Choosing to use technologies that would keep our project

open-source was also a priority. As an open-source project, our work would be able to stand out 9

Figure 3: Sample moment in Malinowski’s Music Animation Machine visualization.

by generating a strong static visualization when the software completes execution. Ultimately, our

project was able to not only produce a static image at the end, but its final state could be analyzed

and interacted with in a three dimensional space as well.

In Table 1, we briefly list and summarize three other visualizations we explored.

Table 1: Other notable prior work found in audio visualization.

Audio Visualizer Description

GLava [28] GLava is a general-purpose, open-source, highly configurable OpenGL audio spectrum visualizer for X11. projectM [29] projectM is an advanced open-source visualizer that renders visuals by reimplementing Ryan Geiss’s Winamp Milkdrop visualization in a more modern, cross-platform reusable library. Geometric Models for Musical Audio This work is an interactive web-based applica- Data [30] tion powered by Javascript and WebGL that generates a “point cloud” visualization from a geometric model and audio file.

In addition to these software tools, there is significant prior research on data visualization [31,

32, 33], sound visualization [34, 35, 36], music visualization [37, 38, 39, 40], and music genre

classification [41, 42]. The tools discussed in this section, as well as our work, follow best practices and methodologies identified in these research publications. 10

2.2 High-level Design Overview

After exploring prior audio visualization designs, we decided on the metrics we wished to rep-

resent and the goals to accomplish in our own design. Next, we had to decide how the high-level

design could be, and how we implement various features. This section describes the overall design

and the next chapter details specifics.

A key goal for our visualization is to have an end product that is representative of the entire

piece of audio heard. This generation of the product would need to be done dynamically as each

segment of the audio plays. We could then depict all the segments by visually retaining at least

some of the information from each. To do so, anything generated dynamically had to be made to

fit in a finite amount of space that could be captured at the end.

Thus we had to decide on a manner to progressively add these segments. Somehow, these seg-

ments would need to collectively form something bigger. Since the shape of any window displaying

a visualization is rectangular, we thought that might be a good overall palette shape to have. Ex-

panding or adding segments from the center in some form might be a good way to accomplish this.

By doing so, the audio heard till any given point is fully seen and represented in the visualization

so far as well.

Ultimately, the basic scheme of the final design is a dynamically growing spiral path of squares whose path radius increases as the audio plays. The camera that views and shows this growth in a

three dimensional space will dynamically follow the path at the perimeter, starting from an angle

that is level to the surface but approaching the top view seen in Figure 4. The overall surface shape will always be square and therefore easy to see entirely at conclusion from this final top view. This 11

chosen design will enable us to more easily analyze and compare our final static visualizations with

one another. Once the visualization is complete, the user would also be able to pan and zoom to see wherever they would like to from whatever proximity, creating an interactive static visualization

as well.

Figure 4: Spiral path top view for the first 21 sample squares of a visualization.

Each square in the spiral path would represent a segment, or sample, of the audio. This square visually displays the metrics of frequency and amplitude that were so widely used in prior work. The

square’s color is used to represent frequency; it will be more red when the sample has more higher

pitches present and more blue when more lower pitches are present. The z-position of the square’s vertices, or height of the surface, will then correspond to the amplitude or loudness variation within

that sample. The size of each sample square can be modified in the code to be larger but will be a width of two by default. The length of the sample, or information read per sample, will also be a

tunable argument. 12

Chapter 3

Implementation

3.1 Project Description

The AudioSquare tool we developed is available on GitHub (https://github.com/apoorvas47/

AudioVisualizer)[11]. Given an audio file in WAV format, our project will execute a program written in the Python programming language that generates a dynamic visualization as the audio

file plays. Once complete, the static visualization may be interacted with in the window according

to PyQtGraph’s specified 3D graphics mouse interactions. PyQtGraph [43] is a graphical user

interface (GUI) and scientific graphics library for Python. The final static visualization will be

captured and output as an image. In addition, the amplitude and frequency information gathered

per sample will be output to a file in Comma Separated Value (CSV) format.

3.2 Tools and Libraries Used

In order to create an aesthetically pleasing final project, we first explored visualization software to use. A frequent choice for data visualization, plotting, and graphics rendering appeared

to be MATLAB graphics, or the Python equivalent MatPlotLib. However, we discovered PyQt-

Graph, a Python package that would support our specific needs better. PyQtGraph is a graphics

and GUI library built on Qt for Python and NumPy. Since it is a pure-Python library, our project will be portable and able to run on a variety of hardware platforms. The 3D graphics system from 13

this library, which relies on Python to OpenGL bindings, will be used to render our mesh surface visualization. OpenGL is a graphics library that interacts with the GPU to render hardware accel-

erated vector graphics. From OpenGL’s graphics rendering, NumPy’s number calculations, and

Qt’s graphics view framework, PyQtGraph is able to leverage major speed boosts and rapid plot

updates. Fast display is a major advantage over MatPlotLib that will create a smoother dynamic visualization as we update our view items. PyQtGraph also provides better real time interactivity, which dramatically reduces the lag when interacting with our final static visualization.

Next, we choose which libraries to use to process the audio. A dependency for PyQtGraph,

NumPy [44], already included a Discrete Fourier Transform package which we could use to calcu-

late frequencies from our waveform. Reading the waveform of an audio file could be accomplished

simply for WAV files by using the builtin wave library. In order to play the audio, however, we

chose to use PyAudio [45], which provides Python bindings for PortAudio, a cross-platform audio

I/O library.

3.3 Implementation Details

The construction of the visualization begins by initializing our AudioVisualizer custom class

object. This class holds the window containing the mesh visualization. Once we call start() on this

object, it will begin a timer that will dynamically update the visualization while the audio plays.

The timer will stop once the audio has concluded playing. The window will remain open and

can be closed by the user once they are done observing the final static visualization. This overall

procedure and the details in each step is outlined in Figure 5 and described further in subsequent

paragraphs. 14

Figure 5: Outline of implementation procedure

During the class initialization step, the WAV file’s information is extracted and the window’s

contents are set up. A PyAudio audio stream is created from the WAV file’s in order to play its waveform aloud. A GLviewWidget object will serve as the three dimensional viewing space for

our Qt Application window. This view widget will contain a graphics item, called a GLMeshItem,

that is constructed via triangle faces whose vertices are plotted in a three dimensional space.

A mesh grid surface will be formed by the GLMeshItem to display the visualization. The width

of this grid must be decided at initialization and is determined based on the total number of samples,

or squares, that will be required. Centered at 0 for a both, corresponding size lists for the 푥 and

푦 points to be used will be created. Both lists will be roughly centered at 0 since that represents 15

the center of this view space and the point from which all camera position measurements are taken.

By iterating through all combinations of 푥 and 푦 points, a list of the all vertices on the surface, in

the form (푥, 푦, 푧) where 푧 is always 0, is created. This is the first parameter used to initialize the

mesh item. These vertices must then be used to generate our next parameter: faces. Each face for

a GLMeshItem must be a triangle and thus be composed of three vertices each. As seen by the

numbers highlighted in blue in Figure 6), vertices are referenced for each face by their index from

the vertices list. The sequence that these faces are added in is important to the construction of the

final mesh item list parameter: face colors. The nth color of the face colors list will be the color

applied to the nth triangle face of the faces list. Initially, all the colors are initialized to be black and

the mesh surface is not seen at all.

Next, the visualization is started by calling start() on the AudioVisualizer object. Doing this will launch the window and start a timer which repeatedly calls an update method. In this update

method, a sample of the audio will be read and played. Based on the sample, the GLMeshItem

parameters and window camera positions will be updated.

For each sample of music played, a square portion of the grid mesh item will be updated, pro-

gressively forming the spiral path shown prior in Figure 4. The first sample square and start of the

spiral path will be at the center of the grid. The location of each subsequent square to update is

given by an added generator function that yields the next coordinate of our path. For each update,

the 푧 values for the vertices of the square and the color for all faces of the sample square will be

modified. Their numerical values will be added as a row to a CSV file of information that can be

parsed later if desired.

The waveform amplitudes read in that update will be represented by the 푧 values for the vertices

of each sample square. Each sample reading of the audio file is given a sequence that is the same 16

Figure 6: Vertices and faces properties for GLMeshItem when grid width is 6.

amount, or frames, of amplitude signal values. Each vertex of the corresponding sample square is

given an amplitude value to be its new 푧 value. The chosen values are evenly spaced so that they

range from the beginning till the end of the sample signal.

The color of each sample square will represent the average frequency in each update. The

frequency domain is first calculated for the sample signal using the NumPy’s real fast Fourier

transform function. From this, the average frequency over that range is calculated. We do this

instead of simply choosing the loudest frequency to account for other prominent frequencies heard.

Though frequencies could range between zero to half of the frame rate, there is a range that a large 17

majority of the frequencies will fall into. The color assignment simply follows this normalized

range by a gradient of blue to red. Lower frequency samples are more blue and higher frequency

ones are more red.

The camera position parameters are also updated at each sample read and played. The camera

starts near the center and points toward the center. The angle made with the surface and center point

is then rotated based on the spiral path radius so far. This angle, known as the azimuth parameter,

allows us to follow alongside the path such that newly updated squares stay at approximately the

center of our window vision. As this is done, the distance from the center, another parameter, must

be raised slowly to accommodate the increasing size of the overall square visualization seen. Lastly,

the camera’s angle of elevation, the final parameter, is increased by an exponential growth that ends with the camera directly above the center. This ensures a top view at conclusion for the static image

captured.

Finally, when there are no longer samples to be read, the timer will be stopped and a screenshot

of the window will be saved as the static visualization. At this point, the camera positions are no

longer being set by calls to update(), so the user may play around with the visualization within the window as they please and close the application when they are done. 18

Chapter 4

Evaluation

In this chapter, we analyze visualizations generated using our tool on a collection of songs. We use these visualizations to evaluate the tool. Most of these songs are acquired from YouTube and converted to WAV format for further analysis using browser-based audio file utilities [46, 47].

4.1 Visualization and Interpretation

In Figure 7, we give a sample of the final static visualization produced for the pop ballad hit song Hello by Adele [7]. The further the sample squares are from the center, the closer they are to the end of the track. Higher peaks mean louder points in the song. The more red the color, the higher the frequency at that sample and the more blue the color, the lower the frequency. Here, we can see that the song starts at a generally lower pitch and moves toward a higher one. At the end, as seen by the squares along the border, the song gradually fades out, going from red to purple to the lowest blue pitch.

After generating numerous static visualizations, it became clear that height variations in the final surface, or amplitude progression, would not be visually distinguishable. Thus the colors, or average frequencies, are the only visible static factors. However, the average frequency should be strongly associated with the average amplitude from that sample. For example, completely silent portions will be represented by a blue color. 19

Figure 7: Audio visualization produced for Hello by Adele.

When multiple sounds are heard, our visualization will treat their combination as one overall sound. This becomes an issue in genre differentiation. For example, the low pitched bass sounds combined with the high pitched saxophone sounds of a jazz song could appear identical to the medium pitched electric guitar sounds of a pop song. In addition, genres are typically classified by audible patterns in the singing and instrumentals used but both cannot be described at once from our visualization since they are combined. Our visualizations shown in Figure 8 could clearly distinguish the segmented notes of a guitar alone from the continuous melody of a piano, however, if a loud voice is added to both, they would become basically the same.

One aspect of a song that could be captured well by our visualization though would be the overall sound progression of a song. To verify this claim, we generated static visualizations for famous songs (see Figures 9, 10, and 11) from styles of music that have distinct overall progression 20

Figure 8: Audio visualizations for Isn’t She Lovely by Stevie Wonder: Guitar cover by Sungha Jung [1] and Jazz Piano cover by Yohan Kim [2].

patterns.

Figure 9: Audio visualizations for generic pop songs: Shape of You by Ed Sheeran [3] (left), Stupid Love by Lady Gaga [4] (middle). Audio visualization for edm song: Summer Days (feat. Mackle- more and Patrick Stump) by Martin Garrix [5] (right).

Though these patterns are not always fully upheld, our visualizations still show some general

differences between these three groupings: pop, sentimental ballad, and reggae. The louder genre

of music like pop shows mostly high frequencies throughout, with the exception of the ending where most songs tend to gradually fade out. The slower ballad songs, however, tend to start lower

and grow in average frequency. This correlates with how ballad artists sing softly in the beginning

but belt higher notes by the end. Often in those songs, the chorus is louder with softer verses as well; thus there are some bluer portions in other spots that display the start of verses too. Finally, 21

Figure 10: Audio visualization for sentimental ballads: If I Were a Boy by Beyonce [6] (left), Hello by Adele [7] (middle), Total Eclipse of the Heart by Bonnie Tyler [8] (right).

Figure 11: Audio visualizations for reggae songs: Three Little Birds by Bob Marley [9] (left), Is This Love by Bob Marley [10] (right).

as for reggae songs, there is strong variation in frequency throughout, with many brief moments

of no sound at all. This represents the softer instrumentals and slower singing style that results in

more pauses.

4.2 Challenges and Current Limitations

We encountered a variety of challenges while working on this project. Some obstacles were

simply as a result of the tools and libraries chosen. In addition, we considered various tradeoffs 22

before determining the final design. From that point on, other enhancement ideas would need to

not only fit the overall project goals, but also work well with the decided design.

One challenge of choosing a less established library like PyQtGraph was simply the lack of a

larger user community and more limited documentation available. In addition to some lack of detail

for parameter descriptions, some of the major structural components of the GLMeshItem that were

not clearly expressed was the relationship between the vertices, faces, and colors, and how these

parameter lists may be modified but cannot dynamically grow after initialization. These issues were mostly figured through debugging, but some issues remain unsolved. For example, in order

to capture the final static image, a screenshot of the window is taken. However, there is no existing

method within this framework to refresh the frame buffer so that the window is not required to be visible when this screenshot is taken.

One of the many tunable parameters that would change the growth and appearance of our visu-

alization is the sample square width. Before the default value for this was chosen, the specifics for

the squares path were decided. As seen in our spiral path from Figure 4, regardless of the sample

square size, each sample only moves ahead one grid position at every update and the spiral path

radius only grows by one grid position at every new loop. This overlap is done to keep our mesh

item grid from growing too large to be handled and also to keep the growth of our visualization

gradual and less choppy. Therefore, even though a larger square width would provide a more satis-

fying dynamic visualization, it will result in the loss of more information due to this overlap. The

larger the width, the more information will be lost from the beginning spiral loops and for each

corner turn of a loop. In addition, the entire square width is kept for the last outer square loop; the

larger the width of the square, the more over represented the ending portion of the audio will be.

Therefore, in order to display more representative information in our final static visualization, two 23 was chosen as a good default number that balances this all.

One idea to add more detailed pitch information to each sample square was to use a chroma

classification. Currently, each sample square simply displays a color representing the average

frequency of that sample. By using the chroma capability provided by the Python LibROSA li-

brary [48], the energy in twelve common pitch classes could be read. This information could pos-

sibly be displayed on triangle faces of the sample square. However, with the intention to retain a

gradual spiral path growth, at best, only one grid square, or two triangle faces, from each sample

square will only be seen at the end anyway. Therefore it would be better to simply use that one grid

square to represent the average frequency.

Finally, in order to avoid the overhead of a WAV file upload, simply reading audio from the

computer output was considered as well. However, in order to initialize our mesh grid to the ap-

propriate size from the start, we must know the size of the song to be visualized. In addition, if we wished to do any pre processing in the future with our audio, we would need all of its information

readily available to do so.

4.3 Future Work Directions

Our current project only captured the audio features of frequency and amplitude. As a result,

patterns seen between visualizations were based on a sample by sample basis that considered these

two factors alone. However, sometimes at the cost of some of our current project goals, there

are other features we could extract in the future that might result in more distinguishable final visualizations.

One processing technique for audio is separating it into its distinct portions. This is done for 24

some samples of the Geometric Model of Audio work (see Table 1) by separating the visualization

area used for each component like chorus, verse, bridge etc. This could be useful in genre classi-

fication when patterns exist specifically for certain components. For example, in our evaluation

of the sentimental ballad genre earlier, we mentioned that the choruses tend to be louder than the verses. Displaying this more concretely may allow us to more easily identify this pattern. The

Spotify web API offers capabilities to perform this segmentation. It is able to convert an audio

file into a track object from which sections can be extracted. These sections are defined by large variations in rhythm or timbre and a confidence level for each determined section is given. As

mentioned, however, the use of a web API would sacrifice the portability of our application. In the

future, however, this may not be a priority for the project.

Another interesting aspect to capture would be the timbres present in a musical piece. Timbre

can be thought of as what makes each sound, instrument, or voice unique from another, even if they

intone the same note. In our visualization, we use the average frequency per sample to detect overall

pitch. The color of the different sounds heard is thus not observed. Separating the audio signals of

different instruments heard, for example, is done in Malinowski’s Music Animation Machine work.

Though the algorithms used in his work are not disclosed, timbre readings are provided by the

Spotify web API as well. Somehow visually incorporating this feature might add more singularity

between our visualizations.

Other audio processing libraries were also considered in order to extract additional audio fea-

tures. The Python Echo Nest library [49] was found first and could provide valuable information

like tempo from an audio file. Tempo was heavily leveraged to augment the aesthetics in projectM

(see Table 1) and could possibly do the same for our project. However, the Echo Nest API was

no longer available and developers were encouraged to use the Spotify’s web API [50] instead. In 25

order to retain our project’s portability and keep all audio processing local, this API’s capabilities were not employed.

The suggested enhancements thus far have simply been educated guesses as to what might help

distinguish patterns further. However, the audio data from the CSV file could be used to catch

salient features in numbers that our eyes did not. Creating models to leverage this data may allow

us to discover other distinct areas to emphasize in our visualization. 26

Chapter 5

Conclusion

In conclusion, we were able to create an open-source music visualization software as intended, with static and dynamic capabilities. As the audio file plays, our visualization grows within a three

dimensional space. Frequency and amplitude are used as the metrics to determine each display

update. Once complete, a data file and image capturing the final visualization are output.

For our design, we implement a dynamically growing spiral path on a grid mesh surface. Each

update adds a square to the path based on the sample read. The main tool used for the graphical

display was PyQtGraph, an open-source Python library. We also use other Python libraries like

NumPy to more efficiently parse data and to apply the FFT algorithm on our waveform input. The

chosen tools enabled our software to stay very portable.

Though our visualization was not able to differentiate all genres, it did capture the overall pro-

gression of an audio piece. This proved to be useful in identifying genres like ballads, for exam-

ple, that followed a fairly distinct sequence of intensity. Genres that remain high pitched or loud

throughout were difficult to distinguish, however, since each sample then captured nearly identical

amplitude and frequency information.

In the future, we hope to capture more salient features in the audio to enhance our visualization.

Further analysis of the data output of our software could determine which features would be most

useful. Some considerations include chroma classification, tempo detection, sectioning, or timbre

reading. Tools such the Spotify API can be used to extract these. 27

A more detailed design structure that incorporates these metrics would improve our overall genre classification. Since few works in audio visualization set this as their main objective, this project is a worthwhile area to explore further. 28

Bibliography [1] Isn’t She Lovely, 2017. Guitar cover by Sungha Jung based on a song by Stevie Wonder, https://www.youtube.com/watch?v=o0NJiasWrLc, accessed April 2020. [2] Isn’t She Lovely, 2018. Piano cover by Yohan Kim based on a song by Stevie Wonder, https: //www.youtube.com/watch?v=t3XFMbidu3M, accessed April 2020. [3] Ed Sheeran. Shape of You, 2017. From the music album Divide (÷), https://www.youtube. com/watch?v=_dK2tDK9grQ, accessed April 2020. [4] Lady Gaga. Stupid Love, 2020. From the music album Stupid Love, https://www.youtube. com/watch?v=ykML8A5o5bs, accessed April 2020. [5] Martin Garrix. Summer Days (feat. Macklemore and Patrick Stump), 2019. From the music album Summer Days, https://www.youtube.com/watch?v=LdvvPtIfR8w, accessed April 2020. [6] Beyoncé. If I Were a Boy, 2008. From the music album If I Were A Boy, https://www.youtube. com/watch?v=Ld6Qp8R-ATA, accessed April 2020. [7] Adele. Hello, 2015. From the music album 25, https://www.youtube.com/watch?v= YQHsXMglC9A, accessed April 2020. [8] Bonnie Tyler. Total Eclipse of the Heart, 1983. From the music album Now That’s What I Call Music, https://www.youtube.com/watch?v=dLsoCCCxEUs, accessed April 2020. [9] Bob Marley. Three Little Birds, 1977. From the music album Exodus, https://www.youtube. com/watch?v=LanCLS_hIo4, accessed April 2020. [10] Bob Marley. Is This Love, 1978. From the music album Kaya, https://www.youtube.com/ watch?v=m-2RHvGSvYQ, accessed April 2020. [11] Apoorva Shastri. AudioSquare, 2020. https://github.com/apoorvas47/AudioVisualizer, accessed April 2020. [12] Pennsylvania State University. What is Sonification?, 2015. https://aaresearch.psu.edu/artist/ sonifications-of-the-universe-and-more/what-is-sonification, accessed April 2020. [13] Mark Ballora. Opening Your Ears to Data, 2011. TEDx Talk, https://www.youtube.com/ watch?v=aQJfQXGbWQ4, accessed April 2020. [14] Jonathan Cabreira. A music taste analysis using Spotify API and Python, 2019. https://towardsdatascience.com/a-music-taste-analysis-using-spotify-api-and-python- e52d186db5fc, accessed April 2020. [15] Jeffrey Hass. Introduction to Computer Music: Volume One, 2013. https://cecm.indiana.edu/ etext/acoustics/chapter1_amplitude.shtml, accessed April 2020. [16] Audio signals. https://www.hackaudio.com/computer-programming/audio-basics/audio- signals/, accessed April 2020. [17] How Music Works website. Amplitude and frequency. https://www.howmusicworks.org/103/ Sound-and-Music/Amplitude-and-Frequency, accessed April 2020. [18] Audacity. Sample Rates, 2019. https://manual.audacityteam.org/man/sample_rates.html, accessed April 2020. [19] John VE6EY. Signal analysis for a morse decoder, 2017. http://play.fallows.ca/wp/radio/ham- 29

radio/signal-analysis-morse-decoder/, accessed April 2020. [20] Eric W. Weisstein. Discrete fourier transform. From MathWorld–A Wolfram Web Resource. https://mathworld.wolfram.com/DiscreteFourierTransform.html, accessed April 2020. [21] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to algorithms. MIT press, 2009. [22] William T. Cochran, James W. Cooley, David L. Favin, Howard D. Helms, Reginald A. Kaenel, William W. Lang, George C. Maling, David E. Nelson, Charles M. Rader, and Pe- ter D. Welch. What is the fast Fourier transform? Proceedings of the IEEE, 55(10):1664–1674, 1967. [23] Douglas L. Jones. Decimation-in-time (DIT) Radix-2 FFT. https://cnx.org/contents/ [email protected]:zmcmahhR@7/Decimation-in-time-DIT-Radix-2-FFT, accessed April 2020. [24] Kartik Chaudhary. Understanding Audio data, Fourier Transform, FFT and Spectrogram features for a Speech Recognition System. https://towardsdatascience.com/understanding- audio-data-fourier-transform-fft-spectrogram-and-speech-recognition-a4072d228520, accessed April 2020. [25] Tim Davis. NotesArt Studio: Translating Music Into Visual Art. http://www.notesartstudio. com/, accessed April 2020. [26] Mathew Preziotte. PartyMode: an open-source music visualizer for your browser, 2014. https: //github.com/preziotte/party-mode, accessed April 2020. [27] Stephen Malinowski. Music Animation Machine : Chopin, Etude, opus 25 no. 11, A minor (”Winter Wind”), 2016. https://www.youtube.com/watch?v=RUmLy2CKs5k, accessed April 2020. [28] GLava: OpenGL audio spectrum visualizer. https://github.com/jarcode-foss/glava, accessed April 2020. [29] projectM - the most advanced open-source music visualizer. https://github.com/projectM- visualizer/projectm, accessed April 2020. [30] Paul Bendich, Ellen Gasparovic, John Harer, and Christopher Tralie. Geometric models for musical audio data. In 32nd International Symposium on Computational Geometry (SoCG 2016). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2016. [31] Chun-houh Chen, Wolfgang Karl Härdle, and Antony Unwin. Handbook of data visualization. Springer Science & Business Media, 2007. [32] Evgeniy Yur’evich Gorodov and Vasiliy Vasil’evich Gubarev. Analytical review of data visualization methods in application to big data. Journal of Electrical and Computer Engineering, 2013, 2013. [33] Natalia Andrienko, Gennady Andrienko, and Peter Gatalsky. Exploratory spatio-temporal visualization: an analytical review. Journal of Visual Languages & Computing, 14(6):503– 541, 2003. [34] Hans G Kaper, Elizabeth Wiebel, and Sever Tipei. Data sonification and sound visualization. Computing in science & engineering, 1(4):48–58, 1999. [35] Yang-Hann Kim and Jung-Woo Choi. Sound visualization and manipulation. John Wiley & Sons, 2013. 30

[36] Jimmy Azar, Hassan Abou Saleh, and Mohamad Al-Alaoui. Sound visualization for the hearing impaired. International Journal of Emerging Technologies in Learning (iJET), 2(1), 2007. [37] Ondřej Kubelka. Interactive music visualization. In Central European Seminar on Computer Graphics, volume 4, 2000. [38] Franklin B. Zimmerman. Music visualization system utilizing three dimensional graphical representations of musical characteristics, 2002. US Patent 6,411,289. [39] Robyn Taylor, Pierre Boulanger, and Daniel Torres. Real-time music visualization using responsive imagery. In 8th International Conference on Virtual Reality, pages 26–30, 2006. [40] Jia Li. Music analysis through visualization. In Proc. of the Int. Conf. on Technologies for Music Notation and Representation, pages 220–225, 2016. [41] Tao Li, Mitsunori Ogihara, and Qi Li. A comparative study on content-based music genre classification. In Proc. 26th Annual International ACM SIGIR conference on Research and development in information retrieval, pages 282–289, 2003. [42] Tao Li and Mitsunori Ogihara. Music genre classification with taxonomy. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 5, pages v–197. IEEE, 2005. [43] PyQtGraph, 2012. http://www.pyqtgraph.org/, accessed January 2020. [44] Travis Oliphant. Numpy, 1995. https://pypi.org/project/numpy/, accessed April 2020. [45] PyAudio, 2006. https://pypi.org/project/PyAudio/, accessed January 2020. [46] Online-Convert.com. https://www.online-convert.com/, accessed April 2020. [47] FLV2MP3. https://www.flv2mp3.by/en9/, accessed April 2020. [48] LibROSA, 2015. https://pypi.org/project/librosa/, accessed January 2020. [49] The Echo Nest. https://pypi.org/project/pyechonest/, accessed April 2020. [50] Spotify Web API. https://developer.spotify.com/documentation/web-api/, accessed April 2020. 31

ACADEMIC VITA

Apoorva Shastri : aus501547@gmail.com / [email protected]

TECHNICAL WORK EXPERIENCE: EDUCATION:

Capital One, Technology Internship Program: Software Engineer (Plano, TX) Jun ‘19 – Aug ‘19 Pennsylvania State University • Individually implemented a feature on Capital One’s website that prompts users who are prequalifying (Schreyer Honors College) for auto refinance to save their form information entered so far so that they may return to it later. University Park, PA o Used Angular for the micro frontend implementation and streamed data via Apache Kafka May 2020 Graduation through a Java Spring Boot backend (under a BFF (backend for frontend) architecture) • B.S. in Computer Science • Completed 3 team hackathon projects under Capital One: 1) an iOS app to teach kids about how (with Korean Language Minor) credit cards work, 2) a chrome extension that indicates each emails’ urgency on a Gmail inbox, and 3) a way to preview and persist uploaded documents during the auto refinance web application process. Yonsei University, Seoul, Korea Fall 2018 Study Abroad Google, Software Engineering Intern (New York City, NY) May ’18 – Aug ’18 • Worked individually on a feature in the Google Slides Android app that allows users to multitask by SKILLS: controlling any ongoing Slides presentation from a small floating window (using Android’s Picture-in- • iOS (Swift & Objective-C) Picture API). [Feature is in production currently] • o Wrote engineering design document, implementation (in Java for Android), unit tests (using Web (Angular (Typescript), Mockito and Robolectric), QA testing plan, and impression reporting (to track usage). HTML, & CSS) • Android (Java & XML) Google, Engineering Practicum Intern (Mountain View, CA) May ’17 – Aug ’17 • Python • Worked with another intern on a feature in the Google Maps iOS app that encourages users to walk • Java instead of drive by showing the calories and equivalent mini-cupcakes that would be burned by • Git walking. [Feature was in production but later removed due to strong public user feedback] • C o Wrote engineering design document, implementation (in Objective-C), testing notes, and metrics logging while assisting in product design decisions as well. • SQL

o Conducted user study testing to gauge customer attitude toward feature. • Individually added open/closed status to preview information for applicable locations (e.g. stores). AWARDS:

• Grace Hopper Celebration RESEARCH: Capital One Sponsorship

Undergraduate Honors Thesis: Visualization Songs as Graphs (PSU) Fall ‘19 – Spring ‘20 Aug ’19 • PSU Women In Engineering • Will research dimensionality reduction strategies, graph drawing algorithms, sparse matrix Leadership Scholarship representations, and computer graphics software to develop algorithms and an open source Dec ’17 software that generates static and dynamic graph visualizations of any musical composition. • Schreyer Academic Excellence LEADERSHIP & INVOLVEMENT: Scholarship

Each College Year Women in Engineering Program: Matrices Math Course Facilitator (PSU) Fall ’17 – Spring ‘18 • Grace Hopper Celebration • Prepared weekly review sessions and exam prep for girls in the entrance to engineering major course. Google Sponsorship Aug ’17 Association of Women in Computing (PSU) Spring ‘17 • PSU College of Engineering Girls Who Code Python Teaching Assistant International Travel Grant • Taught 8th grade girls the basics of coding in Python while also offering advice about opportunities in Sept ‘18 the computer science field. • Schreyer Ambassador Travel

Marketing Committee member Grant Sept ‘18 • Regularly updated website with relevant weekly information and contributed ideas to weekly meetings.

Global Programs: International Student Orientation Leader (PSU) Aug ‘19 LANGUAGES:

• Led and engaged incoming international students in various small groups through orientation activities English around campus, answering the questions of students and parents alike. Korean

Havyaka Association of Americas: Youth Camp Founder & Organizer (Toronto, ON) Summer ‘19 Kannada

• Co-founded, planned, and led first overnight youth event (as an extension of the 18th biennial HAA Hindi convention) to connect the next Havyaka generation with one another and their heritage / culture. French

Schreyer Success Center: Success Coach (PSU) Spring ‘19 – Spring ‘20 ASL • Provide individual mental wellness support and mental health awareness in the scholar community.

Korean Pop Music & Dance Club: Dance Coordinator (PSU) Spring ‘17 – Spring ‘20

Penn State THON: Entertainment Committee Member (PSU) Fall ‘19 – Spring ‘20