The Corelatus Blog
Mostly narrowband E1/T1 telecommunications
Archives
2011 2010 2009
Categories
19th February 2009
SOX parameters for downsampling to 8kHz Alaw

A technician working for an operator mailed me a few days ago wondering why the recorded voice clips they use for their IVR sound so bad, "like they're coming from the bottom of a deep well". It turned out that the clips actually sounded OK on a telephone, just not through his laptop's speaker. He asked if I recommend any specific filter parameters when converting audio from 44.1kHz wav to 8kHz Alaw voice clips.

An example

I took this audio snippet from the introduction to an audio book. It was originally a .ogg file. I converted it to a .wav file with a 44.1kHz sampling rate and 16 bits per sample. For my purposes any artefacts from ogg vorbis are negligible.

1_mono.wav (44.1kHz, 16 bit linear samples)

Next, I converted it to 8kHz Alaw using sox. 8kHz Alaw is what runs on the fixed telephone network in most of the world. (The US uses a minor variant, μlaw):

sox 1_mono.wav -A -r 8000 2_8kHz_alaw.wav
2_8kHz_alaw.wav (8kHz, 8 bit Alaw samples)

That sounds a bit less clear than the original, but it's OK. It's what you'd expect coming out of a telephone. There's some weirdness though. The audible difference between the two files varies from one PC to another and even one playback program to another. Why? Because laptop speakers vary in quality and because playback programs usually quietly convert everything back to 48kHz or 44.1kHz sampling rates, and they do it with different approaches. For fun, I resampled to 44.1kHz:

sox 2_8kHz_alaw.wav -r 44100 -s 3_resampled.wav
3_resampled.wav (44.1kHz, 16 bit linear samples)

2_8kHz_alaw.wav and 3_resampled.wav should sound almost the same. But on some PCs they sound markedly different.

The GTH just plays octets (bytes)

The GTH has a simple approach to playing back audio. It just copies the bytes you give it to the destination timeslot. No format or rate conversion happens, though the GTH does make sure the data is played out at the E1's frame rate (8000Hz). The downside of that is that you have to convert all the files for your IVR system before giving them to a GTH, e.g. using sox. The upside is that it's simple. Nothing happens behind your back.

What are the best SOX options to use?

I don't know. I used to suggest the following as a reasonable starting point:

sox original.wav -r 8000 -c 1 -A -t raw gth.raw resample -q

As of a few years ago, sox improved and the 'resample' effect got deprecated. So now I suggest just letting sox do what it thinks is best:

sox original.wav -r 8000 -c 1 -A -t raw gth.raw

At the time of writing, it uses its 'rate' effect with reasonable default parameters for the bandwidth and filter characteristics. I experimented a bit with the -m, -h, -v and -s switches for the 'rate' effect. I could not reliably hear a difference, let alone decide that one sounded better.

Why does the phone system use 8kHz anyway?

There's a certain sound quality level expected in telephone networks, and part of that is that the network carries everything up to about 3500Hz. Analog local loop specifications mention that, and pretty much all digital telephone systems use an 8kHz sampling rate, which is what you need to be able to carry audio up to 3.5kHz. Even the GSM and AMR codecs start off with the assumption that the incoming audio is limited to 3500Hz.

So the bar is set pretty low. I haven't come across any systems which set out to provide higher quality, e.g. even skype compresses the hell out of the audio to save bandwidth. Even when both parties in a conversation have huge amounts of it. Surprising, why not aim for VOIP to sound much better than a regular telephone?

Permalink | gth, questions-from-customers.