Google's Next Generation Music Recognition 1
Total Page:16
File Type:pdf, Size:1020Kb
Google’s Next Generation Music Recognition By: Yash Dadia www.attuneww.com All Rights Reserved Table of Content Google’s “Now Playing” Introduction How to set Now Playing on your Device Now Playing versus Sound Search The Core Matching Process of Now Playing Increasing Up Now Playing for the Sound Search Server Updated Overview of Now Playing About Attune Google's Next Generation Music Recognition 1 Google’s “Now Playing” Introduction ● In 2017 Google launched Now Playing on the Pixel 2, using deep neural networks to bring low-power, always-on music recognition to mobile devices. In developing Now Playing, Google’s goal was to create a small, efficient music recognizer which requires a very small fingerprint for each track in the database, allowing music recognition to be run fully on-device without an internet connection. ● As it turns out, Now Playing was not only useful for an on-device music recognizer, but also greatly exceeded the accuracy and efficiency of Google’s then-current server-side system, Sound Search, which was built before the large use of deep neural networks. Naturally, Google wondered if they could bring the same technology that powers Now Playing to the server-side Sound Search, with the goal of making Google’s music recognition capabilities the best in the world. ● Recently, Google introduced a new version of Sound Search that is powered by some of the same technology used by Now Playing. You can use it through the Google Search app or the Google Assistant on any Android Device. Just start a voice query, and if there’s music playing near you, a “What’s this song?” suggestion will pop up just you have to press.You can also ask, “Hey Google, what’s this song?” in the latest version of Sound Search, you’ll get faster, more accurate results than ever before! Google's Next Generation Music Recognition 2 How to set Now Playing on your Device ● If you have used Google to identify a song with your device, you’ve probably seen how to find all those past discoveries. ● They are saved for you, but you have to know where to look. It's a very good but lesser-known feature of Android: Google search can identify songs just like Songza or Soundhound. ● Simply say “OK Google” (if you’re using the Google Now Launcher), just as you would to give a Google Now voice command. If music is playing around you, a music icon will be seen. Touch that and within a few moments Google will deliver up the song, album, and artist information, along with a link to the Play Store. ● To get this on your Android device you need to add the Sound Search widget on your home screen. In stock Android, you touch and hold the home screen, select the widgets section, and then swipe through the choices until you come to Sound Google's Next Generation Music Recognition 3 Search. Then touch and hold to pick up the widget and drag it to an free space on the home screen. Now Playing versus Sound Search ● Now Playing makes it a smaller music recognition technology such that it was small and efficient enough to be run continuously on a mobile device with negligible battery impact. To do this Google has developed a new system using convolutional neural networks to turn a few seconds of audio into a unique “fingerprint.” This fingerprint is then compared with an on-device database holding tens of thousands of songs, which is regularly updated to add newly released tracks and remove those that are not popular. Instead, the server-side Sound Search system is very different, having to match against ~1000x as many songs as Now Playing. Making Sound Search both faster and more accurate with a larger musical library presented several unique challenges. But before we go in detail, a few briefing on how Now Playing works. Google's Next Generation Music Recognition 4 ● Google Assistant can know what songs are playing near you, in an update that is available now to all devices that have Google Assistant. ● After making efforts of Google Assistant, you can ask “what song is this?” or “what song is playing?,” and the Assistant will tell the song with the name of the song, The artist, lyrics, and YouTube, Google Play Music (of course), and Spotify streaming links. ● VentureBeat has gone to Google for additional comment on how Sound Search will work on Android devices. The Core Matching Process of Now Playing ● Now Playing produce the musical “fingerprint” by projecting the musical features of an eight-second portion of audio into a line of low-dimensional embedding spaces consisting of seven two-second clips at 1 second intervals, giving a segmentation like this: ● Now Playing then finds the on-device song database, which was generated by processing popular music with the same neural network, for similar embedding Google's Next Generation Music Recognition 5 sequences. The database finds using two phase algorithm to identify matching songs, where the first phase uses a fast but less accurate algorithm which searches the whole song database to find a few likely candidates, and the second phase does a detailed analysis of each candidate to finds out which song, if any, is the right one. 1. Matching, phase 1: Finding good candidates: For every embedding, Now Playing performs a nearest neighbour search on the on-device database of songs for similar embeddings. The database uses a spatial partitioning and vector quantization to efficiently search through millions of embedding vectors. Because the audio buffer is a bit noisy, this search is approximate, and not every embedding will find a nearby match in the database for the correct song. However, over the whole clip, the chances of finding nearby embeddings for the correct song are very high, so the search is narrowed to a small set of songs which got multiple hits. 2. Matching, phase 2: Final matching: Because the database search used above is approximate, Now Playing may not find song embeddings which are nearby to some embeddings in our query. Therefore, in order to calculate an accurate similarity, Now Playing retrieves all embeddings for each song in the database which might be relevant to fill in the “gaps”. Then, given the sequence of embeddings from the audio buffer and another sequence of embeddings from a song in the on-device database, Now Playing gives their similarity pairwise and adds up the perfect to get the final matching score. ● The accuracy of Now Playing is sto use a sequence of embeddings rather than a single embedding. The fingerprinting neural network is less accurate enough to Google's Next Generation Music Recognition 6 allow identification of a song from a single embedding alone — each embedding will generate false positive results. However, combining the results from multiple embeddings allows the false positives to be easily removed, as the correct song will be a match to every embedding, while false positive matches will be close to one or two embeddings from the given input audio. Increasing Up Now Playing for the Sound Search Server ● Google have gone into some detail of how Now Playing matches songs to an on-device database. The biggest challenge in Now Playing with tens of thousands of songs, to Sound Search, with tens of millions, is that there are a thousand times songs which could give a false positive result. To operate for this without any other changes, Google would have to increase the recognition threshold, which would mean needing more audio to get a confirmed match. However, the goal of the new Sound Search server was to be able to match faster, than Now Playing, so Google didn’t want people to wait 10+ seconds for a result. ● As Sound Search is a server-side system, it is not limited by processing and storage constraints in the same way Now Playing is. Therefore, Google made two major changes to how they do fingerprinting, both of which increased accuracy at the expense of server resources: ○ Google increased by fourfold the size of the neural network used, and increased each embedding from 96 to 128 dimensions, which reduces the amount of work the neural network has to pack the high-dimensional input audio into a low-dimensional embedding. It was critical in improving the quality of phase two, which is very dependent on the accuracy of the raw neural network output. Google's Next Generation Music Recognition 7 ○ Google doubled the density of our embeddings — it turns out that fingerprinting audio every 0.5s instead of every 1s doesn’t reduce the quality of the individual embeddings very much, and gives us a huge boost by doubling the number of embeddings we can use for the match. ● Google also decided to weight our index based on song popularity - in effect, for popular songs, we lower the matching threshold, and we raise it for obscure songs. Overall, this means that we can keep adding more (obscure) songs almost indefinitely to our database without slowing our recognition speed too much. Updated Overview of Now Playing ● With Now Playing, Google originally set out to use machine learning to create a robust audio fingerprint compact enough to run entirely on a phone. It turned out that they had, in fact, created a very good all-round audio fingerprinting system, and the ideas developed there carried over very well to the server-side Sound Search system, even though the challenges of Sound Search are quite different.Additional improvements to Google’s music identification speed and accuracy in noisy environments are also in the works.