## Thursday, April 16, 2009

### The Cooley-Tukey FFT Algorithm

I'm currently a little fed up with number theory, so its time to change topics completely. Specially since the post on basic integer factorization completes what I believe is a sufficient toolkit to tackle a very cool subject: the fast Fourier transform (FFT).

I have some mixed feelings about how does the Fourier transform qualify for the "uncomplicated complexity" rule I imposed on myself when starting this blog. There certainly is a lot of very elaborate math behind Fourier analysis, but I think that if you skip all the mathematical subtleties and dive head first into a "the Fourier transform gives you the frequencies present in the signal" type of intuitive description, the journey from the naive implementation of the discrete Fourier transform (DFT) to the many flavors of the FFT is a very enjoyable one, which requires only a limited amount of complication, but gentle doses of complexity. Which is exactly what I'm after...

So I will take the definition of the DFT for granted, and concentrate on the algorithms to compute it efficiently. There will have to be a few more leaps of faith, but I will try to mark them clearly, so you can whisper "amen" and keep going without a second thought. Although things may unfold differently, I think there is material for about half a dozen posts. Towards the end I may devote some time to optimizing the code, with things such as avoiding recursion, or doing the calculations in place, but most of my efforts will be devoted to optimizing the math behind the code, not the code itself. So the resulting code will probably be suboptimal: if you spot a potential improvement your comments are welcome and encouraged.

The Discrete Fourier Transform
Without further explanation, we will begin by writing down the analytical expression of the DFT,

and of its corresponding inverse transform,

With python's built-in support for complex arithmetic, there really isn't much mistery in turning these two formulas into two python functions, or as I have chosen, one with an inverse switch:

from __future__ import divisionimport mathimport timedef dft(x, inverse = False, verbose = False) :    t = time.clock()    N = len(x)    inv = -1 if not inverse else 1    X =[0] * N    for k in xrange(N) :        for n in xrange(N) :            X[k] += x[n] * math.e**(inv * 2j * math.pi * k * n / N)        if inverse :            X[k] /= N    t = time.clock() - t    if verbose :        print "Computed","an inverse" if inverse else "a","DFT of size",N,        print "in",t,"sec."    return X

Of course, having two nested loops of the size of the input means that computation time will grow with N2, which is extremely inefficient.

The Radix-2 Cooley-Tukey Algorithm with Decimation in Time
How can that be improved? If the size of the input is even, we can write N = 2·M and it is possible to split the N element summation in the previous formulas into two M element ones, one over n = 2·m, another over n = 2·m + 1. By doing so, we arrive at

for the direct transform, and

for the inverse.

If you look carefully at this formulas, you'll notice we have just split our size N transform, be it direct or inverse, into two others of half the size, one over even indexes, the other over odd ones. The factor multiplying the second partial transform is known as a twiddle factor, and introduces a little overhead to the total timing, but overall we have just managed to cut the total time in half. And if the input size happens to be a power of two, we need not stop here, for as long as it is divisible by two, we can repeat the process iteratively, leading to a time of N log N.

To implement this recursive approach we need to take care of a few more details, though. First, if you look at the previous formulas, variable k in the direct transform, or n in the inverse, runs all the way to N = 2·M, rather than to M, as one would expect in the standard DFT. By taking advantage of the fact that the exponential function is periodic with period 2πi, it is quite easy to show that, for k (or n) larger than M, the resulting value will be the same as the one obtained for k-M (or n-M).

The other open question is at what point to stop the recursion. The simplest approach is to keep going until the input size is odd, and then call the naive DFT function. If the size happens to be a power of two, this call will not happen until the input size comes down to one. Since the DFT of a single sample signal is the same signal unchanged, that is a wasteful way of coding. But since it clarifies the logic behind the algorithm, I will leave those optimizations for a later stage.

With this caveats in mind, this FFT algorithm can be coded in python as follows:

from __future__ import divisionimport mathimport timedef fft_CT(x, inverse = False, verbose = False) :    t = time.clock()    N = len(x)    inv = -1 if not inverse else 1    if N % 2 :        return dft(x, inverse, False)    x_e = x[::2]    x_o  = x[1::2]    X_e = fft_CT(x_e, inverse, False)    X_o  = fft_CT(x_o, inverse, False)    X = []    M = N // 2    for k in range(M) :        X += [X_e[k] + X_o[k] * math.e ** (inv * 2j * math.pi * k / N)]    for k in range(M,N) :        X += [X_e[k-M] - X_o[k-M] * math.e ** (inv * 2j * math.pi * (k-M) / N)]    if inverse :        X = [j/2 for j in X]    t = time.clock() - t    if verbose :        print "Computed","an inverse" if inverse else "a","CT FFT of size",N,        print "in",t,"sec."    return X

The speed gain with this approach, restricted for now to power of two sizes, are tremendous even for moderately large input sizes. For instance, a 220 (~106) item input can be processed with the FFT approach in a couple of minutes, while the projected duration using the naive DFT will be more like a couple of months...

1. Jaime,

Your first equation shows up in Google Reader as "tex: X_k \pi" but the other four equations show up as nice graphics. The first equation looks fine when I go directly to your web site, just not in my reader. Is something different about your first TeX expression?

2. I'm afraid I hit the "publish" button a little too fast, and you saw a big work-in-progress mess...

I have started using MathTran (http://www.mathtran.org) to display TeX on the blog, since plain text wasn't an alternative for FFTs. It does have the reader problem, similar to what goes on with the syntax highlighter, which only works in the blog, not in the reader.

I'm not too happy with this solution as it is, so I may go back to generating image files for the formulas off-line and uploading them (the ones you saw nicely on your reader), although it turns producing a math intensive post into a nightmare, specially if you have, as I do, a tendency to spot mistakes very late...

3. Just to point out a typo: it's "Tukey" not "Turkey". The co-inventor of the FFT was called John Tukey:

http://en.wikipedia.org/wiki/John_Tukey

4. I find your explanation a little confusing. At first it isn't clear why splitting the DFT sum into two pieces saves us any work... eventually you mention the "detail" that the periodicity of the transform means we only have to calculate the two sums for k up to M-1 rather than N-1. But it is precisely this fact which cuts the work in half.

Also I'm pretty sure in the first code listing, lines 13 and 14 should be indented one level deeper. That way all elements of X get divided (for the inverse transform) instead of just the final one!

5. hi!
i need to calculate a fft of a matrix. actually i need it as a part of my program.

i read really a lot of different books and inernet pages on the subject of fft, but i still don't know how to use it.
i understand why you split the dft, and all related to that, but when it comes to the exact calculations... i just don't know what to do with it.

this is my email:

kviki_kiki@yahoo.com

thanks

6. Couple years late but this page has been great help for a university project!

This is my implementation of your functions, I find it runs around 25% faster simply by referencing methods outside of loops:

http://pastebin.com/BmHuMwEq

I'm still trying to figure out a way of using reduce/list-comprehension to get rid of as much of the loop overhead as possible.

Oh, and finally, dft(x) is kinda a moot point, as I'm padding everything I feed to fft_CT(x) with 0s to make the list length a power of 2.

7. This comment has been removed by the author.