Machine learning is one of the most amazing new things our smartphones can do, but it’s a term that’s often used and seldom understood. In a blog post, Google took the time to explain in detail how machine learning algorithms were used and implemented specifically in the new Recorder app for Pixel phones, specifically how machine learning makes this the best recording app you’ve ever used in your life.
Recorder’s simple interface is deceiving. In the back-end is a collection of code that’s designed to listen to, understand, transcribe, and even classify the speech and other audio that’s heard by your phone when recording with the Recorder app. While recording audio, you’ll immediately notice a few things: aside from the wavelength and timeline presented, you’ll also see different colors and categories appear on screen in the main tab, while the words being said are located in the transcription tab and appear in real-time.
Recorder is able to provide this real-time transcription because it’s back-end code analyzes the audio coming in and cross-references it with different types of audio it’s been taught to understand. Examples of understood audio categories include music, speech, whistling, a dog barking, and plenty of other common sounds. Each sound category is represented visually by using unique colors that help users quickly identify what’s being heard during playback without having to actually listen to the audio. That makes a huge difference when trying to find something after the recording has finished, as you’ll no longer have to sit and scrub through audio just to find what you’re looking for.
Recorder checks every 50ms for sound profiles but, since there are 1000 milliseconds in a second, that means the classification would constantly change and vary wildly depending on what’s identified as the primary audio. To avoid this sort of crazy scatter-brained categorization of audio, Google has developed a filtering method that tosses out the junk data by cross-referencing it with longer samples of the audio that’s being recorded, thus, helping better classify sounds by not constantly switching their category during listening.
During recording, Recorder identifies words spoken via an on-device machine learning algorithm. That means no data is sent to Google servers (or anywhere else, for that matter), as the processor onboard is able to check against a sort of on-device dictionary to ensure the correct words. Words are checked against a decision tree that includes the filtering of things like swear words. This model is so advanced it’s even able to identify grammatical roles of words, better helping it form full sentences for later use.
These words are then assembled into a timeline in sentence structure and assigned a position on the timeline. Words can be visually scrolled through and searched for after recording has been finished. Users can even click each word to be taken to that specific time in the recording, helping better understand context and meaning. By utilizing these categories, as well as word-recognition, Google can even provide three tags for use at the end of a recording to more quickly and accurately help name the recording.