Time stretching is the process of changing the speed of an
audio signal without affecting its
pitch.
Pitch scaling (often called
pitch shifting, but this is a misnomer) is the reverse: the process of changing the pitch without affecting the speed.
This introduction was copied from Damian Yerrick's E2 writeup (http://everything2.com/index.pl?node_id=1074923). It will need to be rewritten from an encyclopedic neutral point of view.
OK, I have a song stored as 2-channel, 16-bit linear PCM on my reasonably fast computer. I want to slow down the tempo because I'm trying to remix with another song.
"Re-perform it!" No, I don't have the source score or samples, and I don't have the vocal training; all I have is this wav file I extracted from a CD[?].
"Resample it!" No, resampling digital audio has an effect analogous to that of slowing down a phonograph turntable: it transposes the song to a lower key and makes the singer sound like an ogre.
One way of stretching a signal is to build a
phase vocoder[?] after Flanagan, Golden, and Portnoff.
Basic steps: compute the frequency/time relationship of the signal by taking the
Fast Fourier Transform of each windowed block of 2,048 samples (assuming 44 KHz input), do some processing of the frequencies' amplitudes and phases, and perform the inverse FFT.
A good algorithm will give good results at compression/expansion ratios of ± 25%;
beyond that, the
pre-echo[?] and other
smearing[?] artifacts of frequency domain interpolation on
transient[?] ("beat") waveforms, which are not
localized at all in the frequency domain, begin to take a toll on perceived audio quality.
Rabiner and Schafer in 1978 put forth an alternate solution: work in the
time domain, attempt to find the
period[?] of a given section of the fundamental wave with the
autocorrelation function, and
crossfade[?] one period into another.
This is called
time domain harmonic scaling[?] or
synchronized overlap-add method[?] and performs somewhat faster than the phase vocoder on slower machines but fails when the autocorrelation misunderestimates the period of a signal with complicated harmonics (such as
orchestral pieces).
Cool Edit Pro[?] seems to solve this by looking for the period closest to a center period that the user specifies, which should be an integer multiple of the tempo, and between 30 Hz and the lowest bass frequency. For a 120 bpm tune, use 48 Hz because 48 Hz = 2,880 cycles/minute = 24 cycles/beat * 120 bpm.
High-end commercial audio processing packages combine the two techniques, using wavelet techniques to separate the signal into sinusoid[?] and transient waveforms, applying the phase vocoder to the sinusoids, and processing transients in the time domain, producing the highest quality time stretching.
These techniques can also be used to scale the pitch of an audio sample while holding time constant.
(Note that the technique is properly called pitch
scaling, not "shifting," as pitch shifting by
amplitude modulation with a complex exponential does
not preserve the ratios of the
harmonic frequencies that determine the sound's
timbre.)
Time domain processing works much better here, as smearing is less noticeable, but scaling vocal samples distorts the
formants into a sort of
Alvin and the Chipmunks[?]-like effect, which may be desirable or undesirable.
To preserve the formants and character of the voice, you can use a "regular"
channel vocoder[?] keyed to the signal's
fundamental frequency.
(Following a single voice's fundamental is straightforward; put a note in
Damian's talk page if you want more information.)
See also:
All Wikipedia text
is available under the
terms of the GNU Free Documentation License