next up previous contents
Next: Speex narrowband mode Up: The Speex Codec Manual Previous: Introduction to Speex   Contents

Subsections

Introduction to CELP Coding

Speex is based on CELP, which stands for Code Excited Linear Prediction. This section attempts to introduce the principles behind CELP, so if you are already familiar with CELP, you can safely skip to section 3. The CELP technique is based on three ideas:

  1. The use of a linear prediction (LP) model to model the vocal tract
  2. The use of (adaptive and fixed) codebook entries as input (excitation) of the LP model
  3. The search performed in closed-loop in a ``perceptually weighted domain''

Linear Prediction

The linear prediction model represents each speech sample as linear combination of past samples, plus an error signal called the excitation (or residual).

$\displaystyle x(n)=\sum _{i=1}^{N}a_{i}x(n-i)+e(n)$

In the z-domain, this can be expressed as

$\displaystyle x(z)=\frac{1}{A(z)}\: e(z)$

where $ A(z)$ is defined as

$\displaystyle A(z)=1-\sum _{i=1}^{N}a_{i}z^{-i}$

We usually refer to $ A(z)$ as the analysis filter and $ 1/A(z)$ as the synthesis filter.

The $ A(z)$ filter is computed using the Levinson-Durbin algorithm, which starts from the auto-correlation $ R(m)$ of the signal $ x(n)$.

$\displaystyle r(m)=\sum _{i=0}^{N-1}x(i)x(i-m)$

For an order $ N$ filter, we have:

$\displaystyle \mathbf{R}=\left[\begin{array}{cccc}
r(0) & r(1) & \cdots & r(N-1...
... \vdots & \ddots & \vdots \\
r(N-1) & r(N-2) & \cdots & r(0)\end{array}\right]$

$\displaystyle \mathbf{r}=\left[\begin{array}{c}
r(1)\\
r(2)\\
\vdots \\
R(N)\end{array}\right]$

The filter coefficients $ a_{i}$ are found by solving the system $ \mathbf{Ra}=\mathbf{r}$. What the Levinson-Durbin algorithm does here is making the solution to the problem $ \mathcal{O}\left(N^{2}\right)$ instead of $ \mathcal{O}\left(N^{3}\right)$ by exploiting the fact that matrix $ \mathbf{R}$ is toeplitz hermitian. Also, it can be proved that all the roots of $ A(z)$ are withing the unit circle, which means that $ 1/A(z)$ is always stable. This is in theory; in practice because of finite precision, there are two commonly used techniques to make sure we have a stable filter. First, we multiply $ r(0)$ by a number slightly above one (such as 1.0001), which is equivalent to adding noise to the signal. Also, we can apply a window the the auto-correlation, which is equivalent to filtering in the frequency domain, reducing sharp resonances.

Pitch Prediction

During voiced segments, the speech signal is very periodic, so it is possible to take advantage of that by expressing the excitation signal $ e(n)$ as

$\displaystyle e(n)=\beta e(n-T)+c(n)$

where $ T$ is the pitch period, $ \beta $ is the pitch gain and $ c(n)$ is taken from the innovation codebook. In the z-domain, the excitation can be expressed as:

$\displaystyle e(z)=\frac{1}{1-\beta z^{-T}}\: c(z)$

Innovation Codebook

This is where most of the bits in a CELP codec are allocated. It represents the information that couldn't be obtained either from linear prediction or pitch prediction.

Analysis-by-Synthesis and Error Weighting

Most (if not all) modern audio codecs attempt to shape the noise so that it is the hardest to detect with the ear. That means that more noise can be tolerated in parts of the spectrum that are louder and vice versa. That's why the error is minimized for the perceptually weighted signal

$\displaystyle X_{w}(z)=W(z)X(z)$

where $ W(z)$ is the weighting filter, usually of the form

$\displaystyle W(z)=\frac{A\left(\frac{z}{\gamma _{1}}\right)}{A\left(\frac{z}{\gamma _{2}}\right)}$ (1)

with control parameters $ \gamma _{1}>\gamma _{2}$. If the noise is white in the perceptually weighted domain, then in the signal domain its spectral shape will be of the form

$\displaystyle A_{noise}(z)=\frac{1}{W(z)}=\frac{A\left(\frac{z}{\gamma _{2}}\right)}{A\left(\frac{z}{\gamma _{1}}\right)}$

If a filter $ A(z)$ has (complex) poles at $ p_{i}$ in the $ z$-plane, the filter $ A(z/\gamma )$ filter will have its poles at $ p_{i}^{'}=\gamma p_{i}$, making it a flatter version of $ A(z)$.


next up previous contents
Next: Speex narrowband mode Up: The Speex Codec Manual Previous: Introduction to Speex   Contents
Jean-Marc Valin 2002-08-27