Business apps performance

Measuring speech quality

The International Telecommunication Union (ITU) has established a recommendation for assessing the perceptual quality of speech in vocal communications.

Measuring the quality of voice communications

Of the various methods, the full reference or FR method is the most widely used. It yields the most reliable results, but one constraint is the need to have a distortion-free reference file for comparison.

Full reference methods use algorithms to evaluate speech samples by simulating the process of the human ear listening to reference audio files. Next, they compare the samples to determine the audible difference. The data then undergoes a process, called the cognitive model, comparable to the way the human brain would process such data. Lastly, a picture of overall voice quality is generated.

Over the years, several models for measuring the quality of voice over IP have been developed, such as PSQM (Perceptual Speech Quality Measure), recommended by the ITU from 1996 to 2001, PAMS (Perceptual Analysis Measurement System), and PESQ (Perceptual Evaluation of Speech Quality), the currently recommended model, is an optimized combination of PAMS and PSQM.


The diagram below represents the full reference model.

Evaluation method at ip-label

At ip-label, the main method of assessment is PESQ. This method mainly combines the psychoacoustic and cognitive PSQM model with a time alignment algorithm.

The PESQ algorithm is represented in the diagram below:

The algorithm supplies a mean opinion score known as the MOS, on a scale of 1 (bad) to 5 (excellent).

The table below sets forth the scale defined by the ITU:

quality and score table

The PESQ method can retrieve the following secondary speech indicators as well:

  • the noise index corresponds to the quantity of additional data (in frequency) when the degraded file presents an offset,
  • the loss index corresponds to the quantity of missing data when there is an offset with respect to the reference file,
  • the offset index corresponds to the delay between utterances.

These three indicators are expressed as a percentage with respect to the reference file.

The MOS can be calculated using:

  • a Newtest for Voice robot simulating real user calls on any type of voice network:
    – Public Switched Telephone Network (PSTN)
    – Integrated Services Digital Network (ISDN)
    – Global System for Mobile Communications (GSM)
    – Voice over IP (VoIP)
  • a classic Newtest robot equipped with a softphone like Skype or X-Lite.

The dashboard for a MOS test is shown below: