

For every frequency of sine wave, the magnitude and amount of power represented by that sine wave that makes up this original signal. Then take the log of the power at each frequency. The FFT converts this little signal into the frequency domain. These audio signals are made up of a combination of different frequencies of sine waves. To process this audio clip the first thing cut out a little window that’s typically about 20ms long. The spectrogram is sort of like a frequency domain representation but instead of representing this entire signal in terms of frequencies, represent a small window in terms of frequencies. You lose a little bit of information when you do this.

There are a couple of ways to start the, first is to convert to a simple spectrogram. Once you have an audio clip you’ll do a little bit of pre-processing. Each element is a floating-point number that extracted from this 8 or 16-bit sample. If you had a one-second audio clip this vector would have a length of either say 8000 or 16000 samples. This being broke down into samples x1 x2 and so forth. You could just think of that as a 1D vector. When we represent this audio signal that’s going to go into our pipeline. There are a bunch of different formats for audio but typically this one-dimensional wave something like 8k samples per second or 16k samples per second.Įach wave is quantized into 8 or 16 bits. This should be pretty straightforward, unlike a two-dimensional image where we normally have a 2D grid of pixels audio is just a 1D signal. Finally a bit about decoding and language models which is sort of an addendum to the current acoustic models that we can build that make them perform a lot better. It turns out that one of the fundamental problems of doing speech recognition is how do I build a neural network that can map this audio signal to a transcription that can have a variable length. The CTC is one highly mature method for doing this. Second, the Connectionist Temporal Classification(CTC) is the most mature piece of sequence learning technologies for deep learning right now. The first part I’m just going to introduce an about pre-processing and encoding. We have divided the speech recognition process into three part.

The goal of building a speech pipeline is input raw audio wave and build a speech recognizer that can do this very simple task of printing out hello world when I input “Hello World” wave. You think of this is like one of those sorts of consummate AI tasks. For Machines, this has historically hard. A human can quickly turn audio into words and word into meaning effortlessly. Deep learning has been playing an increasingly large role in speech recognition.
