Let us define speech as a product of human communication that can be described in terms of acoustics. This is a rather narrow definition, but it is adequate in the context of this article.
Because recording speech these days almost always involves digitization, I am going to assume that the product of your capture is going to be a digital audio file, such as MS WAVE. We need to make sure that the audio file contains as much information about the speech signal as possible. This is true whether we are recording speech with a microphone or digitizing an existing analog recording. In terms of acoustics, the digital audio file must be able reproduce the frequency range and dynamic range of speech as well as possible. Table 1 shows effects of sample rate and bit-depth on the frequency response and dynamic range of digital speech files.
Effects of sample rate and bit-depth on speech quality
16-bit, 48,000 Hz
8-bit, 8,000 Hz
Table 1. Effects of sample rate and bit-depth on the frequency response and dynamic range of digital speech files
Frequency
Sampling
Human speech spans a wide range of frequencies. Adult males typically produce speech sounds at lower frequencies than adult females and children. The entire range spans from approximately 50 to 20,000 Hz. Only the highest frequency fricatve consonants such as /s/ and /sh/ can reach up to 20,000 while most speech-relevant information is contained quite a bit lower on the frequency scale.
Figure 1 shows a spectrogram of an adult female and an adult male talker saying the phrase "to some smooth jazz" (click buttons to hear audio). Note that the female talker's fricative /s/ reaches almost all the way up to 20,000 Hz. By contrast, the male pronunciation shows the maximum frequencies reaching only approximately 15,000 Hz.
Figure 1. Spectrograms of female and male talkers showing a wide range of frequencies DOWNLOAD FILES
The obvious conclusion is that we need to have all of the frequencies up to 20,000 Hz adequately represented in a digital audio file. Sample rate is the parameter that determines the frequency response of an audio file, and by Nyquist theorem, the maximum frequency reproduced in an audio file is exactly half the sample rate. Figure 2 shows the effect of sample rate reduction on the range of speech frequencies (in the phrase "smooth jazz") reproduced in an audio file. As you listen to each file, please not the decrease in perceived quality. This is largely due to the reduction in frequency response. You can read more about this in my post on analog-to-digital conversion.
Content on this page requires a newer version of Adobe Flash Player.
Figure 2. The effect of sample rate reduction on the range of speech frequencies reproduced in an audio file
Keeping the frequency range broad and flat
The sampling theorem assumes no hardware limitations. However, one needs to recognize the limitations of current hardware's electrical and acoustical performance. The low end of the frequency scale is particularly problematic for modern hardware, esp. microphones. You should try to find a microphone that has a flat frequency response throughout its range. You can read more about this in the section on microphones.
Dynamic range
Speech contains sounds spanning a wide range of amplitudes. Some sounds are naturally softer (carry less energy) than others. The typical range between the softest and the loudest sound in speech (dynamic range) is about 40 dB. Dynamic range is determined in A/D conversion by quantization. Even 16-bit quantization (e.g., the audio CD standard) is capable of capturing the entire dynamic range of speech. Figure 3 shows the waveform (a plot of amplitude over time) of a phrase "Cathy just wanted to go to the pet store." Note how the amplitude varies from phoneme to phoneme, as indicated by the little red ball (click Play button to hear audio).
Content on this page requires a newer version of Adobe Flash Player.
Figure X. An illustration of the changing amplitude of speech sounds over time DOWNLOAD FILE
The sampling limitations are only one aspects of controlling dynamic range. The best way to reproduce the subtle changes in amplitude is to place the microphone close to the talker's lips and try to maximize signal-to-noise ratio.
Conclusion
As we saw in the examples above, the digital audio file must have the sample rate of at least 48,000 Hz and a 16-bit bit-depth in order to capture the entire frequency range and dynamic range of speech. The good news is that most digital recorders these days meet these specifications. You should, however, bear in mind that specifications alone do not guarantee good recordings. I encourage you to browse through the article on this site, as they might be helpful in learning how to capture high-quality speech signals by means of portable field recording equipment and technique.