Redirected from Fast Fourier transform
Let x0, ...., xn-1 be complex numbers. The DFT is defined by the formula
Evaluating these sums directly would take O(n2) arithmetical operations (see Big O notation). An FFT is an algorithm to compute the same result in only O(n log n) operations.
Since the inverse DFT is the same as the DFT, but with the sign of the exponent flipped and a 1/n factor, any FFT algorithm can easily be adapted for it as well.
|
The most common FFT is the Cooley-Tukey algorithm. In its most basic form, this method first computes the Fourier transforms of the even-indexed numbers x0, x2,...,xn-2, and of the odd-indexed numbers x1, x3,....,xn-1, and then combines those two results to produce the Fourier transform of the whole sequence. This idea can then be performed recursively to reduce the overall runtime to O(n log n). This simplified form assumes that n is a power of two; since the number of sample points n can usually be chosen freely by the application, this is often not an important restriction.
We write n' = n/2 and denote the discrete Fourier transform of the even indexed numbers x'0 = x0, x'1 = x2,....,x'n'-1 = xn-2 by f0',...,f 'n'-1, and the transform of the odd indexed numbers x''0 = x1,x''1 = x3,...,x''n'-1 = xn-1 by f0'', ..., f ''n'-1. Then it follows:
f_j & = & \sum_{k=0}^{\frac{n}{2}-1} x_{2k} e^{-\frac{2\pi i}{n} j(2k)}
+ \sum_{k=0}^{\frac{n}{2}-1} x_{2k+1} e^{-\frac{2\pi i}{n} j(2k+1)} \\ \\
& = & \sum_{k=0}^{n'-1} x'_{k} e^{-\frac{2\pi i}{n'} jk}
+ e^{-\frac{2\pi i}{n}j} \sum_{k=0}^{n'-1} x_k e^{-\frac{2\pi i}{n'} jk} \\ \\
& = & \left\{ \begin{matrix}
f'_j + e^{-\frac{2\pi i}{n}j} f_j & \mbox{if } j<n' \\ \\
f'_{j-n'} - e^{-\frac{2\pi i}{n}(j-n')} f_{j-n'} & \mbox{if } j \geq n' \end{matrix} \right.
\end{matrix} </math>
A form of this trick, including its recursive application, was already known around 1805 to Carl Friedrich Gauss, who used it to interpolate the trajectories of the asteroids Pallas and Juno, but was not widely recognized (being published only posthumously and in Latin); Gauss did not analyze the asymptotic complexity, however. Various limited forms were also rediscovered several times throughout the 19th and early 20th centuries. FFTs became popular after J. W. Cooley of IBM and J. W. Tukey of Princeton published a paper in 1965 reinventing the algorithm and describing how to perform it conveniently on a computer (including how to arrange for the output to be produced in the natural ordering).
This process is an example of the general technique of divide and conquer algorithms; in many traditional implementations, however, the explicit recursion is avoided, and instead one traverses the computational tree in breadth-first fashion.
The above re-expression of a size-n DFT as two size-n/2 DFTs is sometimes called the Danielson-Lanczos lemma, since the identity was noted by those two authors in 1942 (influenced by Runge's 1903 work). They applied their lemma in a "backwards" recursive fashion, repeatedly doubling the DFT size until the transform spectrum converged (although they apparently didn't realize the logarithmic asymptotic complexity they had achieved). The Danielson-Lanczos work predated widespread availability of computing machines and required hand calculation; they reported a computation time of 140 minutes for a size-64 DFT operating on real inputs (see below) to 3-5 significant digits. Cooley and Tukey's 1965 paper reported a running time of 0.02 minutes for a size-2048 complex DFT on an IBM 7094 (probably in 36-bit single precision, ~8 digits). Rescaling the time by n log n, this corresponds roughly to a speedup factor of around 800,000. The more modern FFT library FFTW, on a 2GHz Pentium-IV in 64-bit double precision (~16 digits), can compute a size-64 real-input DFT in 1μs and a size-2048 complex DFT in 100μs, speedups of about 8,000,000,000 and 10,000 over Danielson & Lanczos and Cooley & Tukey, respectively, not even including the considerable improvements in accuracy.
(140 minutes for size 64 may sound like a long time, but it corresponds to an average of at most 16 seconds per floating-point operation, around 20% of which are multiplications...this is a fairly impressive rate for a human being to sustain for almost two and a half hours, especially when you consider the bookkeeping overhead.)
More generally, Cooley-Tukey algorithms recursively re-express a DFT of a composite size n = n1n2 as n1 DFTs of size n2, followed by multiplication by complex roots of unity called twiddle factors, followed by n2 DFTs of size n1. Typically, either n1 or n2 is a small factor, called the radix. If n1 is the radix, it is called a decimation in time (DIT) algorithm, whereas if n2 is the radix, it is decimation in frequency (DIF, also called the Sande-Tukey algorithm). The version presented above is a radix-2 DIT algorithm; in the final expression, the phase multiplying the odd transform is the twiddle factor, and the +/- combination (butterfly) of the even and odd transforms is a size-2 DFT. (The radix's small DFT is sometimes known as a butterfly, so-called because of the shape of the data-flow diagram for the radix-2 case.) Gauss used a radix-3 DIT (or radix-4 DIF) step in a 12-point DFT example.
There are many other variations on the Cooley-Tukey algorithm. Mixed-radix implementations handle composite sizes with a variety of (typically small) factors in addition to two, usually (but not always) employing the O(n2) algorithm for the prime base cases of the recursion. Split-radix merges radices 2 and 4, exploiting the fact that the first transform of radix 2 requires no twiddle factor, in order to achieve a minimal operation count for power-of-two sizes. (On present-day computers, performance is determined more by cache and CPU pipeline considerations than by strict operation counts; well optimized FFT implementations often employ larger radices and/or hard-coded base-case transforms of significant size.) Another way of looking at the Cooley-Tukey algorithm is that it re-expresses a size n one-dimensional DFT as an n1 by n2 two-dimensional DFT (plus twiddles), where the output matrix is transposed. The net result of all of these transpositions, for a radix-2 algorithm, corresponds to a bit reversal of the input (DIF) or output (DIT) indices. If, instead of using a small radix, one employs a radix of roughly √n and explicit input/output matrix transpositions, it is called a four-step (also six-step) algorithm, initially proposed for cache/locality optimization (Gentleman and Sande, 1966).
The general Cooley-Tukey factorization rewrites the indices j and k as j = n2 j1 + j2 and k = n1 k2 + k1, respectively, where the indices ja and ka run from 0..na-1 (for a of 1 or 2). That is, it re-indexes the input (k) and output (j) as n1 by n2 two-dimensional arrays in column-major and row-major order, respectively. When this reindexing is substituted into the DFT formula for jk, the n2j1n1k2 cross term vanishes (its exponential is unity), and the remaining terms give
\sum_{k_1=0}^{n_1-1} \left[ e^{-\frac{2\pi i}{n} j_2 k_1 } \right] \left( \sum_{k_2=0}^{n_2-1} x_{n_1 k_2 + k_1} e^{-\frac{2\pi i}{n_2} j_2 k_2 } \right) e^{-\frac{2\pi i}{n_1} j_1 k_1 }
</math>
where the inner sum is a DFT of size n2, the outer sum is a DFT of size n1, and the [...] bracketed term is the twiddle factor.
The 1965 Cooley-Tukey paper noted that one can employ an arbitrary radix r (as well as mixed radices), but failed to realize that the radix butterfly is itself a DFT that can use FFT algorithms. Hence, they reckoned the complexity to be O(r2 n/r logrn), and erroneously concluded that the optimal radix was 3 (the closest integer to e). (Gauss also derived the algorithm for arbitrary radices, and gave explicit examples of both radix-3 and radix-6 steps.)
There are other FFT algorithms distinct from Cooley-Tukey. For relatively prime n1 and n2, one can use the Prime-Factor (Good-Thomas) algorithm, based on the Chinese Remainder Theorem, to factorize the DFT similarly to Cooley-Tukey but without the twiddle factors. The Rader-Brenner algorithm is a Cooley-Tukey-like factorization but with purely imaginary twiddle factors, reducing multiplications at the cost of increased additions and reduced numerical stability. Algorithms that recursively factorize the DFT into smaller operations other than DFTs include the Bruun and QFT algorithms. (The Rader-Brenner and QFT algorithms were proposed for power-of-two sizes, but it is possible that they could be adapted to general composite n. Bruun's algorithm was adapted to the mixed-radix case for even n by H. Murakami.) Bruun's algorithm, in particular, is based on interpreting the FFT as a recursive factorization of the polynomial zn - 1, here into real-coefficient polynomials of the form zm - 1 and z2m + azm + 1. A similar polynomial viewpoint is also exploited by the Winograd algorithm, which factorizes zn - 1 into cyclotomic polynomials—these often have coefficients of 1, 0, or -1, and therefore require few (if any) multiplications, so Winograd can be used to obtain minimal-multiplication FFTs and is often used to find efficient algorithms for small prime factors. In particular, Winograd also makes use of a general algorithm by Rader for FFTs of prime sizes. Rader's algorithm, exploiting the existence of a generator for the multiplicative group modulo prime n, expresses a DFT of prime size n as a cyclic convolution of (composite) size n - 1, which can then be computed by a pair of ordinary FFTs via the convolution theorem (although Winograd uses other convolution methods). Another prime-size FFT is due to L. I. Bluestein, and is sometimes called the chirp-z algorithm; it also re-expresses a DFT as a convolution, but this time of the same size (which can be zero-padded to a power of two and evaluated by radix-2 Cooley-Tukey FFTs, for example), via the identity jk = -(j-k)2/2 + j2/2 + k2/2.
FFT algorithms specialized for real and/or symmetric data
In many applications, the input data for the DFT are purely real, in which case the outputs satisfy the symmetry
It was once believed that real-input DFTs could be more efficiently computed by means of the Discrete Hartley transform (DHT), but this was subsequently disproved: a specialized real-input DFT algorithm (FFT) can typically be found that requires fewer operations than the corresponding DHT algorithm (FHT) for the same number of inputs.
There are further FFT specializations for the cases of real data that have even/odd symmetry, in which case one can gain another factor of ~2 in time/space and the DFT becomes the discrete cosine/sine transform(s) (DCT/DST). Instead of directly modifying an FFT algorithm for these cases, DCTs/DSTs can also be computed via FFTs of real data combined with O(n) pre/post processing.
All of the FFT algorithms discussed so far compute the DFT exactly (in exact arithmetic, i.e. neglecting floating-point errors). A few "FFT" algorithms have been proposed, however, that compute the DFT approximately, with an error that can be made arbitrarily small at the expense of increased computations. Such algorithms trade the approximation error for increased speed or other properties. For example, an approximate FFT algorithm by Edelman et al. (1999) achieves lower communication requirements for parallel computing with the help of a fast-multipole method. A wavelet-based approximate FFT by Guo and Burrus (1996) takes sparse inputs/outputs (time/frequency localization) into account more efficiently than is possible with an exact FFT. Another algorithm for approximate computation of a subset of the DFT outputs is due to Shentov et al. (1995). Only the Edelman algorithm works equally well for sparse and non-sparse data, however, since it is based on the compressibility (rank deficiency) of the Fourier matrix itself rather than the compressibility (sparsity) of the data.
Even the "exact" FFT algorithms have errors when finite-precision floating-point arithmetic is used, but these errors are typically quite small; most FFT algorithms, e.g. Cooley-Tukey, have excellent numerical properties. The upper bound on the relative error for the Cooley-Tukey algorithm is O(ε log n), compared to O(ε n3/2) for the naive DFT formula (Gentleman and Sande, 1966), where ε is the machine floating-point relative error. In fact, the average errors are much better than these upper bounds, being only O(&epsilon √log n) for Cooley-Tukey and O(ε √n) for the naive DFT (Schatzman, 1996). These results, however, are very sensitive to the accuracy of the twiddle factors used in the FFT (i.e. the trigonometric function values), and it is not unusual for incautious FFT implementations to have much worse accuracy, e.g. if they use unstable trigonometric recurrence formulas. Some FFTs other than Cooley-Tukey, such as the Rader-Brenner algorithm, are intrinsically less stable.
Search Encyclopedia
|
Featured Article
|