Mel frequency cepstral coefficients
What do all those words mean?
Mel frequency cepstral coefficients have always fascinated me. They are incredibly abstruse. I’ve never found any mathematical intuitions for why they’re effective at what they do. Yet they work!
MFCCs were originally developed in the 1970s, where their use was primarily considered for speech recognition. They have persisted through several generations of machine intelligence development, from early work in genre classificationG. Tzanetakis and P. Cook, “Musical genre classification of audio signals,” in IEEE Transactions on Speech and Audio Processing, vol. 10, no. 5, pp. 293–302, July 2002. to modern deep learning networks.
Definition of MFCCs
MFCCs are computed as follows:
- Compute the spectrum of a signal via discrete Fourier transform (DFT):
, for - For each frequency component of the Fourier transform, square it to get the power spectrum:
- Map the power spectrum onto the mel scale using triangular overlapping windows:
, for
and
- Take the logs of the powers at each mel frequency:
- Take the discrete cosine transform of the log powers.
- The MFCCs are the real amplitudes of the resulting spectrum.
Lets examine each step in more detail.
Discrete Fourier transform
The discrete Fourier transform is the alpha and the omega of signal analysis.Its called the “discrete” Fourier transform because it deals with signals whose levels are sampled at discrete points in time. This is a requirement of digital systems, where audio is segmented into a series of sample levels. There is a continuous Fourier transform (simply called the Fourier transform, without additional qualification) that deals with continuous analog signals. Its defined as:
, for
where is the input audio signal and is the complex amplitude of that frequency in the input signal. Complex amplitude means a value representing both “real” amplitude, which is roughly perceived as loudness, and phase, which represents a shift of the signal in time but has no specific perceptual correlation. It is complex in the sense that it is mathematically modeled using a complex number, i.e. .
You may have also heard people speak of the “FFT” (fast Fourier transform). The FFT refers a specific class of algorithms
Effectively, its multiplying the input signal by a complex sinusoidA complex sinusoid is any function of the form , which is equivalent to . This function is periodic with frequency and phase shift . at each frequency and summing the resulting set of values. This can be thought of as a mathematical projection from the time domain (where the signal can be measured in the first place, and later auditioned) to the frequency domain (sometimes called the spectral domain).5
But wait — this formula is different than the one we looked d
The effect of squaring the signal in step 2 above is to convert to its signal power, which has some relationship to perceptual and physical properties.However, taking the later in the process means that the effect of the squaring will be mostly inconsequential, since . The end result after applying the DCT is that all of the MFCCs are just scaled up by 2, so theres no difference in their relative proportions if you square or do not square. Evidently, power is still used for historical reasons and by convention.
The Mel Scale
The mel scale is
Human hearing is more sensitive to differences in lower frequencies. The conversion from frequency in Hz to mel is:
The inverse transformation is:
The DCT Step
The final step uses the Discrete Cosine Transform. For filter banks, the -th MFCC coefficient is:
where is the log energy output of the -th filter bank and is the total number of filter banks.
Typically only the first 12–13 coefficients are kept, since they capture the spectral envelope while discarding fine spectral details.
-
Complex amplitude means a value representing both “real” amplitude, which is roughly perceived as loudness, and phase, which represents a shift of the signal in time but has no specific perceptual correlation. It is complex in the sense that it is mathematically modeled using a complex number, i.e. .
↩ -
In fact, in the digital domain, its is exactly a mathematical projection where the rows of are simply in a discretized form.
↩