Mel frequency cepstral coefficients

What do all those words mean?

Mel frequency cepstral coefficients have always fascinated me. They are incredibly abstruse. I’ve never found any mathematical intuitions for why they’re effective at what they do. Yet they work!

MFCCs were originally developed in the 1970s, where their use was primarily considered for speech recognition. They have persisted through several generations of machine intelligence development, from early work in genre classificationG. Tzanetakis and P. Cook, “Musical genre classification of audio signals,” in IEEE Transactions on Speech and Audio Processing, vol. 10, no. 5, pp. 293–302, July 2002. to modern deep learning networks.

Definition of MFCCs

MFCCs are computed as follows:

  1. Compute the spectrum of a signal via discrete Fourier transform (DFT):
    X(fk)=n=0N1x(n)ei2πfknTX(f_k) = \displaystyle\sum_{n = 0}^{N-1} x(n) e^{-i 2\pi f_k n T}, for
  2. For each frequency component of the Fourier transform, square it to get the power spectrum:
    X^(f)=X(f)2\hat{X}(f) = X(f)^2
  3. Map the power spectrum onto the mel scale using triangular overlapping windows:
    M(k)=fHk(f)X^(f)M(k) = \sum_{f} H_k(f) \cdot \hat{X}(f), for
    Hk(f)={ffk1fkfk1fk1f<fkfk+1ffk+1fkfkf<fk+10otherwiseH_k(f) = \begin{cases} \frac{f - f_{k-1}}{f_k - f_{k-1}} & f_{k-1} \leq f < f_k \\ \frac{f_{k+1} - f}{f_{k+1} - f_k} & f_k \leq f < f_{k+1} \\ 0 & \text{otherwise} \end{cases} and
    m(f)=2595log10(1+f700)m(f) = 2595 \log_{10}\left(1 + \frac{f}{700}\right)
  4. Take the logs of the powers at each mel frequency:
    M^(k)=log(M(k))\hat{M}(k) = \log(M(k))
  5. Take the discrete cosine transform of the log powers.
  6. The MFCCs are the real amplitudes of the resulting spectrum.

Lets examine each step in more detail.

Discrete Fourier transform

The discrete Fourier transform is the alpha and the omega of signal analysis.Its called the “discrete” Fourier transform because it deals with signals whose levels are sampled at discrete points in time. This is a requirement of digital systems, where audio is segmented into a series of sample levels. There is a continuous Fourier transform (simply called the Fourier transform, without additional qualification) that deals with continuous analog signals. Its defined as:

X(fk)=n=0N1x(n)ei2πfknTX(f_k) = \displaystyle\sum_{n = 0}^{N-1} x(n) e^{-i 2\pi f_k n T}, for

where x(n)x(n) is the input audio signal and X(fk)X(f_k) is the complex amplitude of that frequency in the input signal. Complex amplitude means a value representing both “real” amplitude, which is roughly perceived as loudness, and phase, which represents a shift of the signal in time but has no specific perceptual correlation. It is complex in the sense that it is mathematically modeled using a complex number, i.e. x+iyx + i y.

You may have also heard people speak of the “FFT” (fast Fourier transform). The FFT refers a specific class of algorithms

Effectively, its multiplying the input signal by a complex sinusoidA complex sinusoid is any function of the form Ae2πift+ϕA e^{2\pi i f t + \phi}, which is equivalent to Acos(2πft+ϕ)+iAsin(2πft+ϕ)A cos(2\pi f t + \phi) + i A sin(2\pi f t + \phi). This function is periodic with frequency ff and phase shift ϕ\phi. at each frequency and summing the resulting set of values. This can be thought of as a mathematical projection from the time domain (where the signal can be measured in the first place, and later auditioned) to the frequency domain (sometimes called the spectral domain).5

But wait — this formula is different than the one we looked d

The effect of squaring the signal in step 2 above is to convert to its signal power, which has some relationship to perceptual and physical properties.However, taking the loglog later in the process means that the effect of the squaring will be mostly inconsequential, since log(X2)=2log(X)log(X^2) = 2 log(X). The end result after applying the DCT is that all of the MFCCs are just scaled up by 2, so theres no difference in their relative proportions if you square or do not square. Evidently, power is still used for historical reasons and by convention.

The Mel Scale

The mel scale is
Human hearing is more sensitive to differences in lower frequencies. The conversion from frequency ff in Hz to mel mm is:

m=2595log10(1+f700)m = 2595 \log_{10}\left(1 + \frac{f}{700}\right)

The inverse transformation is:

f=700(10m/25951)f = 700\left(10^{m/2595} - 1\right)

The DCT Step

The final step uses the Discrete Cosine Transform. For NN filter banks, the nn-th MFCC coefficient is:

cn=k=1KSkcos[n(k12)πK]c_n = \sum_{k=1}^{K} S_k \cos\left[n\left(k - \frac{1}{2}\right)\frac{\pi}{K}\right]

where SkS_k is the log energy output of the kk-th filter bank and KK is the total number of filter banks.

Typically only the first 12–13 coefficients are kept, since they capture the spectral envelope while discarding fine spectral details.


  1. Complex amplitude means a value representing both “real” amplitude, which is roughly perceived as loudness, and phase, which represents a shift of the signal in time but has no specific perceptual correlation. It is complex in the sense that it is mathematically modeled using a complex number, i.e. x+iyx + i y.

  2. In fact, in the digital domain, its is exactly a mathematical projection X=xMX = x \cdot M where the rows of MM are simply e2πifte^{-2\pi i f t} in a discretized form.