I have had to find it out the hard way, but blogger sucks when it comes to math blogging. It wouldn't be all that bad, were it not because on the other side of the road things are soooo very much easier. And while I will miss google analytics, the native support for latex and syntax highlighting make up for it more than enough. So I'm moving this blog to wordpress...

You can find the new blog here, and suscribe to the RSS feed here.

For starters I have redone the much improvable first post of the series on the FFT, which I have split in two, one on the DFT, the other starting discussion on the FFT, plus an unconnected post on computing the GCD...

I will be moving all the content in this blog over to the new one, but have no intention of deleting anything from this one.

## Thursday, April 30, 2009

## Thursday, April 16, 2009

### The Cooley-Tukey FFT Algorithm

I'm currently a little fed up with number theory, so its time to change topics completely. Specially since the post on basic integer factorization completes what I believe is a sufficient toolkit to tackle a very cool subject: the fast Fourier transform (FFT).

I have some mixed feelings about how does the Fourier transform qualify for the "uncomplicated complexity" rule I imposed on myself when starting this blog. There certainly is a lot of very elaborate math behind Fourier analysis, but I think that if you skip all the mathematical subtleties and dive head first into a "the Fourier transform gives you the frequencies present in the signal" type of intuitive description, the journey from the naive implementation of the discrete Fourier transform (DFT) to the many flavors of the FFT is a very enjoyable one, which requires only a limited amount of complication, but gentle doses of complexity. Which is exactly what I'm after...

So I will take the definition of the DFT for granted, and concentrate on the algorithms to compute it efficiently. There will have to be a few more leaps of faith, but I will try to mark them clearly, so you can whisper "amen" and keep going without a second thought. Although things may unfold differently, I think there is material for about half a dozen posts. Towards the end I may devote some time to optimizing the code, with things such as avoiding recursion, or doing the calculations in place, but most of my efforts will be devoted to optimizing the math behind the code, not the code itself. So the resulting code will probably be suboptimal: if you spot a potential improvement your comments are welcome and encouraged.

The Discrete Fourier Transform

Without further explanation, we will begin by writing down the analytical expression of the DFT,

and of its corresponding inverse transform,

With python's built-in support for complex arithmetic, there really isn't much mistery in turning these two formulas into two python functions, or as I have chosen, one with an

Of course, having two nested loops of the size of the input means that computation time will grow with N

The Radix-2 Cooley-Tukey Algorithm with Decimation in Time

How can that be improved? If the size of the input is even, we can write N = 2·M and it is possible to split the N element summation in the previous formulas into two M element ones, one over n = 2·m, another over n = 2·m + 1. By doing so, we arrive at

for the direct transform, and

for the inverse.

If you look carefully at this formulas, you'll notice we have just split our size N transform, be it direct or inverse, into two others of half the size, one over even indexes, the other over odd ones. The factor multiplying the second partial transform is known as a twiddle factor, and introduces a little overhead to the total timing, but overall we have just managed to cut the total time in half. And if the input size happens to be a power of two, we need not stop here, for as long as it is divisible by two, we can repeat the process iteratively, leading to a time of N log N.

To implement this recursive approach we need to take care of a few more details, though. First, if you look at the previous formulas, variable k in the direct transform, or n in the inverse, runs all the way to N = 2·M, rather than to M, as one would expect in the standard DFT. By taking advantage of the fact that the exponential function is periodic with period 2πi, it is quite easy to show that, for k (or n) larger than M, the resulting value will be the same as the one obtained for k-M (or n-M).

The other open question is at what point to stop the recursion. The simplest approach is to keep going until the input size is odd, and then call the naive DFT function. If the size happens to be a power of two, this call will not happen until the input size comes down to one. Since the DFT of a single sample signal is the same signal unchanged, that is a wasteful way of coding. But since it clarifies the logic behind the algorithm, I will leave those optimizations for a later stage.

With this caveats in mind, this FFT algorithm can be coded in python as follows:

The speed gain with this approach, restricted for now to power of two sizes, are tremendous even for moderately large input sizes. For instance, a 2

I have some mixed feelings about how does the Fourier transform qualify for the "uncomplicated complexity" rule I imposed on myself when starting this blog. There certainly is a lot of very elaborate math behind Fourier analysis, but I think that if you skip all the mathematical subtleties and dive head first into a "the Fourier transform gives you the frequencies present in the signal" type of intuitive description, the journey from the naive implementation of the discrete Fourier transform (DFT) to the many flavors of the FFT is a very enjoyable one, which requires only a limited amount of complication, but gentle doses of complexity. Which is exactly what I'm after...

So I will take the definition of the DFT for granted, and concentrate on the algorithms to compute it efficiently. There will have to be a few more leaps of faith, but I will try to mark them clearly, so you can whisper "amen" and keep going without a second thought. Although things may unfold differently, I think there is material for about half a dozen posts. Towards the end I may devote some time to optimizing the code, with things such as avoiding recursion, or doing the calculations in place, but most of my efforts will be devoted to optimizing the math behind the code, not the code itself. So the resulting code will probably be suboptimal: if you spot a potential improvement your comments are welcome and encouraged.

The Discrete Fourier Transform

Without further explanation, we will begin by writing down the analytical expression of the DFT,

and of its corresponding inverse transform,

With python's built-in support for complex arithmetic, there really isn't much mistery in turning these two formulas into two python functions, or as I have chosen, one with an

`inverse`

switch:

from __future__ import division

import math

import time

def dft(x, inverse = False, verbose = False) :

t = time.clock()

N = len(x)

inv = -1 if not inverse else 1

X =[0] * N

for k in xrange(N) :

for n in xrange(N) :

X[k] += x[n] * math.e**(inv * 2j * math.pi * k * n / N)

if inverse :

X[k] /= N

t = time.clock() - t

if verbose :

print "Computed","an inverse" if inverse else "a","DFT of size",N,

print "in",t,"sec."

return X

Of course, having two nested loops of the size of the input means that computation time will grow with N

^{2}, which is extremely inefficient.The Radix-2 Cooley-Tukey Algorithm with Decimation in Time

How can that be improved? If the size of the input is even, we can write N = 2·M and it is possible to split the N element summation in the previous formulas into two M element ones, one over n = 2·m, another over n = 2·m + 1. By doing so, we arrive at

for the direct transform, and

for the inverse.

If you look carefully at this formulas, you'll notice we have just split our size N transform, be it direct or inverse, into two others of half the size, one over even indexes, the other over odd ones. The factor multiplying the second partial transform is known as a twiddle factor, and introduces a little overhead to the total timing, but overall we have just managed to cut the total time in half. And if the input size happens to be a power of two, we need not stop here, for as long as it is divisible by two, we can repeat the process iteratively, leading to a time of N log N.

To implement this recursive approach we need to take care of a few more details, though. First, if you look at the previous formulas, variable k in the direct transform, or n in the inverse, runs all the way to N = 2·M, rather than to M, as one would expect in the standard DFT. By taking advantage of the fact that the exponential function is periodic with period 2πi, it is quite easy to show that, for k (or n) larger than M, the resulting value will be the same as the one obtained for k-M (or n-M).

The other open question is at what point to stop the recursion. The simplest approach is to keep going until the input size is odd, and then call the naive DFT function. If the size happens to be a power of two, this call will not happen until the input size comes down to one. Since the DFT of a single sample signal is the same signal unchanged, that is a wasteful way of coding. But since it clarifies the logic behind the algorithm, I will leave those optimizations for a later stage.

With this caveats in mind, this FFT algorithm can be coded in python as follows:

from __future__ import division

import math

import time

def fft_CT(x, inverse = False, verbose = False) :

t = time.clock()

N = len(x)

inv = -1 if not inverse else 1

if N % 2 :

return dft(x, inverse, False)

x_e = x[::2]

x_o = x[1::2]

X_e = fft_CT(x_e, inverse, False)

X_o = fft_CT(x_o, inverse, False)

X = []

M = N // 2

for k in range(M) :

X += [X_e[k] + X_o[k] * math.e ** (inv * 2j * math.pi * k / N)]

for k in range(M,N) :

X += [X_e[k-M] - X_o[k-M] * math.e ** (inv * 2j * math.pi * (k-M) / N)]

if inverse :

X = [j/2 for j in X]

t = time.clock() - t

if verbose :

print "Computed","an inverse" if inverse else "a","CT FFT of size",N,

print "in",t,"sec."

return X

The speed gain with this approach, restricted for now to power of two sizes, are tremendous even for moderately large input sizes. For instance, a 2

^{20}(~10^{6}) item input can be processed with the FFT approach in a couple of minutes, while the projected duration using the naive DFT will be more like a couple of months...## Thursday, April 9, 2009

### Naive Integer Factorization

After three posts (1, 2, 3) on calculating prime numbers, it is probably worth putting that knowledge to a more useful task. As we will see in a near future, integer factorization, i.e. breaking down a (composite) number into its prime factors is one such task. In purity, factoring a number n is simply decomposing it as the product of two smaller non-trivial, i.e. different from 1 and n itself, divisors. But by repeatedly factoring the divisors one will eventually come up with a unique set of primes which, when multiplied together, render the original number, or so says the fundamental theorem of arithmetic... The point is, we will consider factorization a synonym of prime decomposition, be it formally correct or not.

There are some very sophisticated methods to factor very large numbers, but they use a lot of extremely complex math, so I doubt they will ever find their way onto this blog. So we are going to be left with the naive, straightforward approach as our only option, although I will try to give it an efficiency boost. What is this naive approach? Trial division, of course: given a number n, we know that its smallest factor will be smaller than the square root of n, so we can simply try and see if any of those numbers divide it. No, I will not try to code that yet... If you have read the entries on determining prime numbers, it should come as no surprise that we really do not need to do trial division by all numbers smaller than the square root of n, but only by the primes within. This is a consequence of the fact that, if a composite number divides n, then each of the prime factors of that composite number will also divide n. According to the prime number theorem the number of primes below x is asymptotic to x / log x. So by limiting our trials to prime numbers we can reduce the number of tests from n

While this function will be about as good as we can make it for numbers which are the product of two large prime factors, it will be terribly inefficient for most numbers. Consider, as an extreme example, that we are trying to factor 2

This new code will very often be much faster than the other one, but at times it will be just about as slow as in the other case, or even slower, since the mixing of both codes introduces some inefficiencies. The most extreme examples of such cases would be a prime number, or the square of a prime number on one side, and a power of 2 on the other one.

The graph above plots times to calculate the factors of numbers between 10

If the above graph is not enough to convince you of the benefits of the second approach, maybe this timings for very large numbers will:

There are some very sophisticated methods to factor very large numbers, but they use a lot of extremely complex math, so I doubt they will ever find their way onto this blog. So we are going to be left with the naive, straightforward approach as our only option, although I will try to give it an efficiency boost. What is this naive approach? Trial division, of course: given a number n, we know that its smallest factor will be smaller than the square root of n, so we can simply try and see if any of those numbers divide it. No, I will not try to code that yet... If you have read the entries on determining prime numbers, it should come as no surprise that we really do not need to do trial division by all numbers smaller than the square root of n, but only by the primes within. This is a consequence of the fact that, if a composite number divides n, then each of the prime factors of that composite number will also divide n. According to the prime number theorem the number of primes below x is asymptotic to x / log x. So by limiting our trials to prime numbers we can reduce the number of tests from n

^{1/2}to something around 2 n^{1/2}/ log n. If we rescue the`primeListSofE`

function from the post on the sieve of Erathostenes, a python implementation of naive factorization could look something like this...

from time import clock

def factor(n, verbose = False) :

"""Returns all prime factors of n, using trial division by prime

numbers only. Returns a list of (possibly repeating) prime factors

"""

t = clock()

ret =[]

nn = n

maxFactor = int(n**0.5)

primes = primeListSofE(maxFactor, verbose)

for p in primes :

while nn % p == 0 :

nn //= p

ret += [p]

if nn == 1 :

break

if nn != 1 :

ret += [nn]

t = clock() - t

if verbose :

print "Calculated factors of",n,"in",t,"sec."

return ret

While this function will be about as good as we can make it for numbers which are the product of two large prime factors, it will be terribly inefficient for most numbers. Consider, as an extreme example, that we are trying to factor 2

^{55}~ 3.6·10^{16}. We would first calculate all primes up to 1.9·10^{8}, a challenging feat in itself with our available tools, only to find out that we only needed the first of those primes, i.e. 2. Taking into account that 50% of all numbers are divisible by 2, 33% are divisible by 3, 20% are divisible by 5... it doesn't seem wise to disregard the potential time savings. What we can do to profit from this is to do the trial division checks at the same time as we determine the prime numbers, updating the largest prime to test on the fly. This has to be done in two stages, the first while we sieve up to n^{1/4}, the second while we search the rest of the sieve up to n^{1/2}searching for more primes. The following Franken-code has been written mostly by cut-and-paste from`primeListSofE`

and `factor`

, which hopefully hasn't affected its readability much:

from time import clock

def factorAndSieve(n, verbose = False) :

"""Returns all prime factors of n, using trial division while sieving

for primes. Returns a list of (possibly repeating) prime factors

"""

t = clock()

ret =[]

nn = n

while nn % 2 == 0 : # remove 2's first, as 2 is not in sieve

nn //= 2

ret += [2]

maxFactor = int(nn**0.5)

maxI = (maxFactor-3) // 2

maxP = int(maxFactor**0.5)

sieve = [True for j in xrange(maxI+1)]

i = 0

for p in xrange(3, maxP+1,2) : # we then sieve as far as needed

if p > maxP :

break

i = (p-3) // 2

if sieve[i] :

while nn % p == 0 :

nn //= p

ret += [p]

maxFactor = int(nn**0.5)

maxI = (maxFactor-3) // 2

maxP = int(maxFactor**0.5)

if nn == 1 :

break

else :

i2 = (p*p - 3) // 2

for k in xrange(i2, maxI+1, p) :

sieve[k] = False

index = i

for i in xrange(index, maxI+1) : # and inspect the rest of the sieve

if i > maxI :

break

if sieve[i] :

p = 2*i + 3

while nn % p == 0 :

nn //= p

ret += [p]

maxFactor = int(nn**0.5)

maxI = (maxFactor-3) // 2

maxP = int(maxFactor**0.5)

if nn == 1 :

break

if nn != 1 :

ret += [nn]

t = clock() - t

if verbose :

print "Calculated factors of",n,"in",t,"sec."

print "Stopped trial division at",2*i+3,"instead of",int(n**0.5)

return ret

This new code will very often be much faster than the other one, but at times it will be just about as slow as in the other case, or even slower, since the mixing of both codes introduces some inefficiencies. The most extreme examples of such cases would be a prime number, or the square of a prime number on one side, and a power of 2 on the other one.

The graph above plots times to calculate the factors of numbers between 10

^{6}and 10^{6}+ 100. Prime numbers in this interval stick out as the red dots among the blue ones: 10^{6}+3, +33, the twin primes +37 and +39, +81 and +99. And numbers with many small prime factors populate the bottom of the red cloud.If the above graph is not enough to convince you of the benefits of the second approach, maybe this timings for very large numbers will:

>>> factor(10**15+37,True)

Calculated primes to 31622776 in 6.760 sec.

Calculated factors of 1000000000000037 in 8.466 sec.

[1000000000000037L]

>>> factorAndSieve(10**15+37,True)

Calculated factors of 1000000000000037 in 8.666 sec.

Stopped trial division at 31622775 instead of 31622776

[1000000000000037L]

>>> factor(2**55,True)

Calculated primes to 189812531 in 42.811 sec.

Calculated factors of 36028797018963968 in 43.261 sec.

[2, ..., 2]

>>> factorAndSieve(2**55,True)

Calculated factors of 36028797018963968 in 8.632e-05 sec.

Stopped trial division at 3 instead of 189812531

[2, ..., 2]

Labels:
factorization,
number theory,
primes,
python

## Tuesday, March 24, 2009

### Prime Numbers 3. Wheel Factorization

In an earlier post, I described the sieve of Eratosthenes, a very fast algorithm to calculate all primes below a given limit, n. At least when programming it in python, speed is not an issue at all with this algorithm, because we start having memory problems much earlier, due to the fact that we need to keep in memory a list of size n...

This is specially problematic with python, because although all we need is a list of booleans (is prime / is not prime), which could fit in a bit each, I'm afraid python assigns several bytes to each. How many of them is probably machine dependent, but I'm almost certain it is at least 4 of them. So it is very likely that we could improve memory usage by a factor of 32, maybe more, if we could get around this. There are not-so-standard libraries that do just that, but I've sworn not to use them, and one day I may expose my failed attempts at using very long ints as bit arrays to the world, but python is definitely not the best language to mess around with low level optimizations, so I'm going to leave this line of thinking aside for the time being.

I also presented a very naive optimization, which simply discards all even numbers, which we already know are not prime, and thus cuts in half the memory requirements. More formally, what was being done there was to partition all numbers into two groups: those of the form 2·n, and those of the form 2·n+1, and then discard the former, since no prime of that form exists, and keep the latter only. This line of thinking is much more promising, as it leads to a very neat way of further reducing the memory required, known as wheel factorization.

For some odd reason that escapes my shallow understanding, the standard description of this algorithm places much emphasis on setting all numbers in circles, and then eliminating what it calls spokes. I personally don't see any benefit of such a mental image, specially since the number are going to end up arranged in a rectangular array. So for the sake of clarity, I will not honor tradition, and spare you the circles and the spokes.

We begin by choosing a first few prime numbers. Say the first three: 2, 3 and 5. We multiply them together to get M = 30. We are now going to partition the set of all numbers into M groups, of the form M·m + k

If M = 30, of the 30 potential k

Of course we are still going to have to sieve each of our subsets, in much the same way we did with the sieve of Eratothenes. But sieving all multiples of a prime p when you have several subsets to consider is anything but trivial. Lets see how to tackle it...

The question to answer is, what integer values of m fulfill the equation M·m + k

Last, but not least, we want to start sieving at the smallest multiple of p in each subset that is larger than or equal to p

Lets try to put all this ideas together in a python script. To make it work, you will need the

You can play around with this algorithm and look at the time required. It probably depends on the value of n, but there seems to be a sweet spot for speed somewhere around m = 7 (and thus M = 210) or m = 11 (M = 2310). This being a general purpose wheel factorization algorithm, it has speeds that are faster than the unoptimized sieve of Eratosthenes code, but it cannot beat the optimized code, which in fact is a particular case of wheel factorization, for m = M = 2.

On the other hand, using m = 7 I have managed to compute all primes up to 10

This is specially problematic with python, because although all we need is a list of booleans (is prime / is not prime), which could fit in a bit each, I'm afraid python assigns several bytes to each. How many of them is probably machine dependent, but I'm almost certain it is at least 4 of them. So it is very likely that we could improve memory usage by a factor of 32, maybe more, if we could get around this. There are not-so-standard libraries that do just that, but I've sworn not to use them, and one day I may expose my failed attempts at using very long ints as bit arrays to the world, but python is definitely not the best language to mess around with low level optimizations, so I'm going to leave this line of thinking aside for the time being.

I also presented a very naive optimization, which simply discards all even numbers, which we already know are not prime, and thus cuts in half the memory requirements. More formally, what was being done there was to partition all numbers into two groups: those of the form 2·n, and those of the form 2·n+1, and then discard the former, since no prime of that form exists, and keep the latter only. This line of thinking is much more promising, as it leads to a very neat way of further reducing the memory required, known as wheel factorization.

For some odd reason that escapes my shallow understanding, the standard description of this algorithm places much emphasis on setting all numbers in circles, and then eliminating what it calls spokes. I personally don't see any benefit of such a mental image, specially since the number are going to end up arranged in a rectangular array. So for the sake of clarity, I will not honor tradition, and spare you the circles and the spokes.

We begin by choosing a first few prime numbers. Say the first three: 2, 3 and 5. We multiply them together to get M = 30. We are now going to partition the set of all numbers into M groups, of the form M·m + k

_{i}, where the k_{i}are the numbers 1 through M, both inclusive. Most of these subsets happen to hold no prime number, and thus need not be considered at all. To find out which of them, remember that M is the product of several prime numbers, M = p_{1}·p_{2}... p_{n}. Therefore, if k_{i}is a multiple of any of those prime numbers, i.e. k_{i}= p_{j}·k'_{i}, we can always extract the common factor p_{j}from both summands in M·m + k_{i}, which means that all numbers in that subset are composite. These subsets, which we discard, are very closely related to the spokes I was telling you about a while ago.If M = 30, of the 30 potential k

_{i}we are left with just 8 (1, 7, 11, 13, 17, 19, 23 and 29), so we have cut memory required to about 27% of the straightforward implementation of the sieve of Eratosthenes. If we use the first 4 primes, we then have M = 210, and of the 210 potential k_{i}we only need to keep 48. This improves memory use, which now is roughly 23% of the original. But you can probably see a law of diminishing returns at wortk here: the improvement in memory use gets smaller and smaller with every prime you add to the calculation of M. You can find some more data regarding this at the end of this link.Of course we are still going to have to sieve each of our subsets, in much the same way we did with the sieve of Eratothenes. But sieving all multiples of a prime p when you have several subsets to consider is anything but trivial. Lets see how to tackle it...

The question to answer is, what integer values of m fulfill the equation M·m + k

_{i}= p·n? Using modular arithmetic, this equation is equivalent to finding M·m = -k_{i}(mod p). This requires calculating M^{-1}(mod p), the modular multiplicative inverse of M modulo p, a subject already treated in a previous post. Given that, we can simply write m_{i}= M^{-1}·(-k_{i}) (mod p), and therefore determine all multiples of p in the subset of k_{i}as m = p·j + m_{i}.Last, but not least, we want to start sieving at the smallest multiple of p in each subset that is larger than or equal to p

^{2}, and so we want to increment m_{i}with p·j_{i}, where j_{i}>= (p^{2}-k_{i}- M·m_{i}) / (m·p).Lets try to put all this ideas together in a python script. To make it work, you will need the

`primeListSofE`

function from this post, as well as the `extEuclideanAlg`

and `modInvEuclid`

ones from this other post...

def primeListWF(n, m, verbose = True) :

"""Returns a list of prime numbers smaller than or equal to n,

using wheel factorization, with a wheel size of the product of

all primes smaller than or equal to m

"""

t = time()

# We use the standard sieve of Eratosthenes first...

primes = primeListSofE(m)

M = 1

for p in primes :

M *= p

if verbose :

print "Size of factorization wheel is",M

# We now sieve out all multiples of these primes,including

# the primes themselves, to get the k[i]

sieve = [True] * M

for p in primes :

sieve[p-1] = False

for j in xrange(p*p,M+1,p) :

sieve[j-1] = False

k = [j+1 for j in range(M) if sieve[j]]

if verbose :

print "Memory use is only",len(k),"/",M,"=",len(k)/M

N = n

if N %M :

N += M

N //= M

maxP = int(sqrt(M*N))

if verbose:

print "Primes will be calculated up to",M*N,"not",n

sieve = [[True]*N for j in range(len(k))]

sieve[0][0] = False # Take care of the non primality of 1

for row in range(N) :

baseVal = M * row

for subset in range(len(k)) :

if sieve[subset][row] :

p = baseVal + k[subset]

primes += [p]

if p > maxP :

continue

# Sieve all multiples of p...

invMp = modInvEuclid(M,p)

for i in range(len(k)) :

mi = invMp * (-k[i] % p)

mi %= p

# ...starting at p**2

ji = (p*p - k[i] - M*mi)

if ji % (M*p) :

ji += M*p

ji //= M*p

mi+= ji*p

for r in range(mi,N,p) :

sieve[j][r] = False

if verbose :

print "Calculated primes to",n,"in",time()-t,"sec."

return primes

You can play around with this algorithm and look at the time required. It probably depends on the value of n, but there seems to be a sweet spot for speed somewhere around m = 7 (and thus M = 210) or m = 11 (M = 2310). This being a general purpose wheel factorization algorithm, it has speeds that are faster than the unoptimized sieve of Eratosthenes code, but it cannot beat the optimized code, which in fact is a particular case of wheel factorization, for m = M = 2.

On the other hand, using m = 7 I have managed to compute all primes up to 10

^{8}, a feat unreachable for the previous code, at least without finding a workaround the`MemoryError`

they produce. 10^{9}is still out of reach, though...
Labels:
number theory,
primes,
python,
sieving

## Friday, March 20, 2009

### Modular Multiplicative Inverse

Now that we are all proficient with modular exponentiation thanks to my previous post, it is time to tackle a more complicated issue, still in the realm of modular arithmetic.

While addition, subtraction and multiplication are "compatible with the congruence relation" introduced by modular arithmetic, the same doesn't happen with division. This means that solving a simple equation such as a·x = 1 (mod m) is anything but trivial. In fact, determining x = a

There is, as always, the brute force approach, which I always like to visit first, so that I can later congratulate myself on how much the efficiency has improved with my deep insight. In this particular case, brute force means trying every number between 1 and m-1, and see if the m-modulo of its product with a is 1. :

Do notice that the possibility of no multiplicative inverse existing is contemplated in the code. This will actually be the case if a and m have any common factors, i.e. if they are not coprime. So the previous code could serve as a moron's approach to determining if two numbers are relatively prime. Not that I recommend it, though, as it should be clear that the code I have just produced will be very inefficient for large values of m.

The Extended Euclidean Algorithm

As stated, we are after a number x that fulfills a·x = 1 (mod m). This can of course be written as well as a·x = 1 + m·y, which rearranges neatly into a·x - m·y = 1. Since x and y need not be positive, we can write it as well in the standard form, a·x + m·y = 1. If a and m are coprime, this is a particular instance of a·x + b·y = gcd(a,b), where gcd(a,b) is the greatest common divisor of a and b. This equatiin is known as Bézout's identity, and can be efficiently solved using the extended Euclidean algorithm. To find a solution to our particular equation, this algorithm would proceed as follows:

The gains in speed are tremendous with this new approach. Again, write your own testing function as an exercise, but mine spits out the following results...

So while brute force is good enough, or even the best choice, for very small values of m, as soon as it grows, the benefits of the more elaborate algorithm are all too obvious. According to wikipedia, the algorithm we have coded has complexity O(log(m)

While addition, subtraction and multiplication are "compatible with the congruence relation" introduced by modular arithmetic, the same doesn't happen with division. This means that solving a simple equation such as a·x = 1 (mod m) is anything but trivial. In fact, determining x = a

^{-1}(mod m) that fulfills that equation is the whole object of this post...There is, as always, the brute force approach, which I always like to visit first, so that I can later congratulate myself on how much the efficiency has improved with my deep insight. In this particular case, brute force means trying every number between 1 and m-1, and see if the m-modulo of its product with a is 1. :

def modInv1(a,m) :

"""Computes the modular multiplicative inverse of a modulo m,

using brute force

"""

a %= m

for x in range(1,m) :

if a*x%m == 1 :

return x

return None

Do notice that the possibility of no multiplicative inverse existing is contemplated in the code. This will actually be the case if a and m have any common factors, i.e. if they are not coprime. So the previous code could serve as a moron's approach to determining if two numbers are relatively prime. Not that I recommend it, though, as it should be clear that the code I have just produced will be very inefficient for large values of m.

The Extended Euclidean Algorithm

As stated, we are after a number x that fulfills a·x = 1 (mod m). This can of course be written as well as a·x = 1 + m·y, which rearranges neatly into a·x - m·y = 1. Since x and y need not be positive, we can write it as well in the standard form, a·x + m·y = 1. If a and m are coprime, this is a particular instance of a·x + b·y = gcd(a,b), where gcd(a,b) is the greatest common divisor of a and b. This equatiin is known as Bézout's identity, and can be efficiently solved using the extended Euclidean algorithm. To find a solution to our particular equation, this algorithm would proceed as follows:

- Rewrite the equation as a
_{n}·x_{n}+ m_{n}·y_{n}= 1, starting with n = 0 - Compute the quotient, q
_{n}, and the remainder, r_{n}, of a_{n}divided by m_{n}, and rewrite the original equation as (m_{n}·q_{n}+ r_{n})·x_{n}+ m_{n}·y_{n}=1 - Rearrange the terms in the equation to get m
_{n}·(q_{n}·x_{n}+y_{n}) + r_{n}·x_{n}= 1 - Do the changes of variables
- x
_{n+1}= q_{n}·x_{n}+y_{n}, - y
_{n+1}= x_{n}, - a
_{n+1}= m_{n,}and - m
_{n+1}= r_{n}

- x
- Repeat the process, until a
_{n}= 1, and m_{n}= 0. - Such equations has the trivial solution x
_{n}= 1, y_{n}= 0. - Work your way back to x
_{0}and y_{0}, noting that- x
_{n-1}= y_{n}, - y
_{n-1}= x_{n}- q_{n-1}·y_{n}

- x
- x
_{0}is the number we are after.

_{n}= 1 and m_{n}= 0. It is in fact not always the case, as if a and m are not coprime, then the values we will arrive at are m_{n}= 0, but a_{n}= gcd(a,m). As a matter of fact, the gcd is normally computed using the Euclidean algorithm, dropping the adjective extended, which basically does the same as above, but without the back-tracking of the x_{n}'s and the y_{n}'s. With all of this in mind, we can write two functions, one calling the other, as follows:

def extEuclideanAlg(a, b) :

"""Computes a solution to a x + b y = gcd(a,b), as well as gcd(a,b)

"""

if b == 0 :

return 1,0,a

else :

x, y, gcd = extEuclideanAlg(b, a % b)

return y, x - y * (a // b),gcd

def modInvEuclid(a,m) :

"""Computes the modular multiplicative inverse of a modulo m,

using the extended Euclidean algorithm

"""

x,y,gcd = extEuclideanAlg(a,m)

if gcd == 1 :

return x % m

else :

return None

The gains in speed are tremendous with this new approach. Again, write your own testing function as an exercise, but mine spits out the following results...

>>> testModInv(1234,5,10000)

Got 4 using method 1 in 0.0160000324249 sec.

Got 4 using Euclidean algorithm in 0.0309998989105 sec.

>>> testModInv(1234,56789,10000)

Got 31800 using method 1 in 59.4000101089 sec.

Got 31800 using Euclidean algorithm in 0.047000169754 sec.

So while brute force is good enough, or even the best choice, for very small values of m, as soon as it grows, the benefits of the more elaborate algorithm are all too obvious. According to wikipedia, the algorithm we have coded has complexity O(log(m)

^{2}), i.e. time required to compute the modular multiplicative inverse is proportional to log(m)^{2}.
Labels:
modular arithmetic,
number theory,
primes,
python

### Modular Exponentiation

I have a post cooking on wheel factorization, yet another sieving algorithm for the computation of all primes below a given number n. It turns out that while implementing it, one comes across a number of not so obvious preliminary questions. It is a rule of this blog not to resort to any non standard library, and I feel it would be breaking the spirit of that rule to post code which hasn't been thoroughly explained in advance. So in order not to make that soon-to-be post on prime numbers longer than anyone could bear, I find myself in need of covering a couple of preliminaries, and so here comes this thrilling post on modular exponentiation...

What we are after is determining a

The problems begin if a and b are both big. Python's integers don't overflow, and we all love it for it, but the memory juggling it has to do to accommodate a huge number in memory does come at a speed cost. Specially if m is small, it definitely seems a little overkill to keep a zillion digit number in memory, given that the answer will be smaller than m...

We can, and will, use modular arithmetic to our advantage. See, wikipedia says that "the congruence relation (introduced by modular arithmetic) on the integers is compatible with the multiplication operation of the ring of integers," and if it's on the internet it has to be true, right? Translated to plain English, the idea is that if we want to take the product of two numbers, a and b, modulo m, we can first multiply them, and then take the m-modulo of the result, or we can take the m-modulo of a and/or b, prior to multiplication, and the m-modulo of the result will still be the right answer. So the least a conscious programmer should do is to rewrite the previous function as:

Of course, if b is large enough, you may still find yourself in the realm of zillion digit numbers, which is a place we don't really want to go. It turns out then that it may be a good option to remember that exponentiation is just repeated multiplication. If we multiply a by itself b times, and take the m-modulo of the result, the larger number we will ever have in memory will be m

The only reason why this last function deserves a place here, is because it helps introduce the really cool algorithm that this whole post really is about, binary exponentiation. In John D. Cook's blog there is a great explanation on how it could be implemented iteratively, based on the binary representation of the exponent, but I'll go for a recursive version, which is essentially the same, but which I find easier for my brain's wiring...

So what we are about to do is, when asked to compute a

Now that we have all four versions of the same function written, we can do a little testing, I will leave writing the testing code as an exercise for the reader (I've been wanting to say this since my first year of university...), but when computing 1234 to the power 5678, modulo 90, a 1000 times with each function, these is the performance I get:

Results which really don't require much commenting...

EDIT: There is some very interesting information in the comments: python comes with a built-in

What we are after is determining a

^{b}(mod m), or to put it in plain English, the remainder of a to the power of b, divided by m. Doesn't seem much, does it? We could just go ahead and write:

def modExp1(a, b, m) :

"""Computes a to the power b, modulo m, directly

"""

return a**b % m

The problems begin if a and b are both big. Python's integers don't overflow, and we all love it for it, but the memory juggling it has to do to accommodate a huge number in memory does come at a speed cost. Specially if m is small, it definitely seems a little overkill to keep a zillion digit number in memory, given that the answer will be smaller than m...

We can, and will, use modular arithmetic to our advantage. See, wikipedia says that "the congruence relation (introduced by modular arithmetic) on the integers is compatible with the multiplication operation of the ring of integers," and if it's on the internet it has to be true, right? Translated to plain English, the idea is that if we want to take the product of two numbers, a and b, modulo m, we can first multiply them, and then take the m-modulo of the result, or we can take the m-modulo of a and/or b, prior to multiplication, and the m-modulo of the result will still be the right answer. So the least a conscious programmer should do is to rewrite the previous function as:

def modExp2(a, b, m) :

"""Computes a to the power b, modulo m, a little less directly

"""

return (a%m)**b % m

Of course, if b is large enough, you may still find yourself in the realm of zillion digit numbers, which is a place we don't really want to go. It turns out then that it may be a good option to remember that exponentiation is just repeated multiplication. If we multiply a by itself b times, and take the m-modulo of the result, the larger number we will ever have in memory will be m

^{2}. It doesn't really seem right to put a loop of b iterations into the code, since what we are worried about are large b's, but lets do it anyway:

def modExp3(a, b, m) :

"""Computes a to the power b, modulo m, by iterative multiplication

"""

ret = 1

a %= m

while b :

ret *= a

ret %= m

b -= 1

return ret

The only reason why this last function deserves a place here, is because it helps introduce the really cool algorithm that this whole post really is about, binary exponentiation. In John D. Cook's blog there is a great explanation on how it could be implemented iteratively, based on the binary representation of the exponent, but I'll go for a recursive version, which is essentially the same, but which I find easier for my brain's wiring...

So what we are about to do is, when asked to compute a

^{b}, check if b is even, in which case we will compute it as the square of a^{b/2}. If b is odd, we will compute it as a a^{b-1}. If b is 0, we will of course return 1. If we also throw in all the modulo m stuff I have been discussing previously, we finally arrive at THE function to do modular exponentiation:

def modExp(a, b, m) :

"""Computes a to the power b, modulo m, using binary exponentiation

"""

a %= m

ret = None

if b == 0 :

ret = 1

elif b%2 :

ret = a * modExp(a,b-1,m)

else :

ret = modExp(a,b//2,m)

ret *= ret

return ret%m

Now that we have all four versions of the same function written, we can do a little testing, I will leave writing the testing code as an exercise for the reader (I've been wanting to say this since my first year of university...), but when computing 1234 to the power 5678, modulo 90, a 1000 times with each function, these is the performance I get:

>>> testModExp(1234,5678,90,1000)

Got 46 using method 1 in 3.73399996758 sec.

Got 46 using method 2 in 0.546999931335 sec.

Got 46 using method 3 in 1.25 sec.

Got 46 using THE method in 0.0150001049042 sec.

Results which really don't require much commenting...

EDIT: There is some very interesting information in the comments: python comes with a built-in

`pow````
function, which can take two or three arguments, the third being, precisely, the modulo. It probably implements an algorithm similar to what has been presented, but is more than an order of magnitude faster.
```

```
Labels:
modular arithmetic,
number theory,
python
```

```
```

```
```

```
```## Monday, March 16, 2009

###
Prime Numbers 2. The Sieve of Erathostenes

The problem with the algorithms described in the previous post lies on the very large amount of redundant testing that goes on. Because once we have checked the primality of a certain number p, we know that 2p, 3p, 4p, ..., np are composites. So what is the point of checking those numbers at all? When you follow this line of thinking to the very end, it turns out you don't have to do any trial division at all! The resulting algorithm is attributed to Erathostenes, and it employs a technique, known as sieving, which may come in handy in many other problems. But enough for a presentation, lets get going with the Sieve of Erathostenes.

All the information we need to begin with, is that 2 is the first prime. This means, following the previous approach, that no other even number can be prime. To keep track of this information, we begin by writing a list of all the numbers between 2 and the desired limit, say N:

We now draw a circle around 2, because it is prime, and then strike out all even numbers, i.e. all multiples of 2:

Now, there is a good and a bad way to strike out all multiples of a number from a list. The wrong way is to do trial division to find out if it is a multiple. The right way is to use the fact that two consecutive multiples of a number differ by that exact number. Writing it out in python, you can check the tremendous gain for yourself:

from time import time

def strikeOut(mul, n) :

# note that list index i corresponds to number i+2

sieve = [True for j in xrange(2,n+1)]

# The bad way to sieve

t = time()

for j in xrange(2,n+1) :

if j % mul == 0 :

sieve[j-2] = False

t = time() - t

print "Sieved (bad!!!) all multiples of",mul,"below",n,"in",t,"sec."

sieve = [True for j in xrange(2,n+1)]

# The good way to sieve

t = time()

for j in xrange(mul,n+1,mul) :

sieve[j-2] = False

t = time() - t

print "Sieved (good!!!) all multiples of",mul,"below",n,"in",t,"sec."

No, I'm not taking you, dear reader, for an idiot: you can find sieving algorithms on fist pages of google searches that do sieving this very wrong way...

But lets get back to primes. We had our list, with 2 circled and all its multiples crossed out. The first unmarked number we come across is 3. Just a coincidence that 3 is a prime number? Not really: it it is unmarked because it is not a multiple of any smaller number. Or to put it another way: it is only divisible by 1 and itself, which is hte very definition of prime... So we circle 3 and cross out all its multiples:

I've underlined the multiples of three that where already crossed out. The first number which gets crossed out for the first time is 9. Will it be just a coincidence that 9 = 3^{2}? Of course not! We have already crossed out all multiples of numbers smaller than 3, and 3^{2} is the first multiple of 3 which is not a multiple of any smaller number.

So we keep going down our list, and find 5 to be the next unmarked number in the list, which of course is prime, so we circle it, and then cross out all of its multiples. But as we saw before, rather than starting at 10, and crossing out the already crossed out 10, 15 and 20, we proceed directly to 5^{2}, that is 25.

This goes on until we've searched the whole list, circling numbers and crossing out its multiples. But since we start crossing out at the square of the new found prime, we only need to do the crossing out for primes smaller than the square root of N: once we get there, any uncrossed number up to N will be prime. This is what the beginning of our list will look like once we are through:

So lets try and write our very own sieve of Erathostenes in python, with all this square and square root tricks pointed out above...

from time import time

from math import sqrt

def primeListSofE(n, verbose = True) :

t = time()

sieve = [True for j in xrange(2,n+1)]

for j in xrange(2,int(sqrt(n))+1) :

i = j-2

if sieve[i]:

for k in range(j*j,n+1,j) :

sieve[k-2] = False

ret = [j for j in xrange(2,n+1) if sieve[j-2]]

if verbose :

print "Calculated primes to",n,"in",time()-t,"sec."

return ret

Play around with it and once more feel the thrill of speed. On my computer it does all primes below 1,000,000 almost 5 times faster than the best trial division approach from the previous post. But the real beauty is in how the time now scales with n, because you can no longer find a suitable fit of the form n^{q}. We haven't simply reduced the size of the exponent, but are treading uncharted territory, as it turns out that the time required to calculate all primes below using this sieve requires (n.log(n))(log(log(n))) time. Actually, for the first 10^{10} numbers, this is basically indistinguishable from linear time, i.e proportional to n.

This speed does come at a memory cost, though: you need to store an array of size n. And while the algorithm could produce all primes below 10^{10} in just a few minutes, the program will very likely crash giving an out of memory error. Some memory saving can be done by not storing even numbers at all. Without boring you with the details, below is the python function I use to compute primes at work, which does 10^{7} almost three times faster than the previous one, and uses half the memory, but still crashes when asked for all primes below 10^{9}...

from time import time

from math import sqrt

def primeListSofE(n, verbose = False) :

"""Returns a list of prime numbers smaller than or equal to n,

using an optimized sieve of Erathostenes algorithm.

"""

t = time()

maxI = (n-3) // 2

maxP = int(sqrt(n))

sieve = [True for j in xrange(maxI+1)]

for p in xrange(3,maxP+1,2) :

i = (p-3) // 2

if sieve[i] :

i2 = (p*p-3) // 2

for k in xrange(i2,maxI+1,p) :

sieve[k] = False

ret = [2]+[2*j+3 for j in xrange(len(sieve)) if sieve[j]]

if verbose :

print "Calculated primes to",n,"in",time()-t,"sec."

return ret

There are slightly faster ways of sieving, and we could memoize precalculated primes, to speed up repeated calls to the function. But this is probably enough for today's post...
Labels:
number theory,
primes,
python,
sieving
## Thursday, March 5, 2009

###
Prime Numbers 1. The Wrong Way

A prime number is a natural number larger than 1 which is only divisible by 1 and itself. Simple, isn't it? Euclid already proved quite a while ago that there are infinitely many of them, so lets try and write a function that returns all primes below a given number, using the most straightforward approach. Whta would that be? Well, there is nothing less sophisticated than try and divide each number by every integer larger than 1 and smaller than the number itself, and if we never get a zero remainder then the number is prime. Putting it in python:

from time import time

def primeList1(n, verbose = True) :

t = time()

ret = []

for j in range(2,n+1) :

jIsPrime = True

for k in range(2,j) :

if j % k == 0 :

jIsPrime = False

break

if jIsPrime :

ret += [j]

if verbose :

print "Calculated primes to",n,"in",time()-t,"sec."

return ret

Do notice the feeble attempt at code optimization, with the `break`

statement that ends the loop as soon as a divisor is found... If you try it out, it will work fine. Use it to calculate all prime numbers below 1,000, and it will probably give you an instant answer. 10,000 delays a second or two, nothing too worrying. If you try it with 100,000, you'll have time to take your coffee break before it finishes, and adding any more zeroes will require leaving your script running overnight... Not too surprising really, since to determine that, for example, 9973 is indeed a prime number, our program is having to do trial division by every number between 2 and 9972, both inclusive. All in all, the time required grows with n^{2}.

There are several very simple improvements that can speed things up dramatically, and which require a little, but just a little, mathematical insight...

First, we don't need to test every number between 2 and 9972 to determine that 9973 is prime; we can stop after the largest integer smaller than the square-root of 9973, i.e. 99. Why? Well, if a number is divisible by a number larger than its square-root, it will also be divisible by the quotient of that division, which is necesaarily smaller than the square-root. And since we have already tried (and failed) to divide by those smaller numbers, there is no need to keep going...

The resulting code is very similar to the previous one:

from time import time

from math import sqrt

def primeList2(n, verbose = True) :

t = time()

ret = []

for j in range(2,n+1) :

jIsPrime = True

kMax = int(sqrt(j))

for k in range(2,kMax+1) :

if j % k == 0 :

jIsPrime = False

break

if jIsPrime :

ret += [j]

if verbose :

print "Calculated primes to",n,"in",time()-t,"sec."

return ret

Do try it, and feel the thrill of the new speed of the algorithm. If you are after all primes below 10,000, this new function will probably get them for you... a hundred times faster!!! If you try it for larger numbers, you will find even better speed increases, because the time required by our algorithm now grows with n^{1.4}.

But we are still doing a lot of superfluous checking. Think again of 9973. Once we have tried to divide it by 2 and failed, there is no point in checking with 4, 6, 8 or any other multiple of 2. So yes, we could double the speed of our algorithm by checking only odd numbers, but lets spend a little more time thinking before going back to coding... Because, after failing with 2, once we turn our attention to 3 and again fail, what's the point in trying 6, 9, 12 or any other multiple of 3? If you keep the reasoning going, it becomes obvious that you only need to do trial division by prime numbers, since by the time we test whether a composite number divides 9973, we have already tested all its prime factors, and if those have failed, this one will as well.

Now wait a minute! Isn't finding whether a number is prime or not precisely what we are after? How are we going to figure out what numbers to do trial division with? Well, we already saw in our first optimization that our division trials only need to go up to the square root of the number being tested, so we only need to know the first prime number, 2, and afterwards we can use the list of prime numbers that we are building to generate the additional items... The code to do that would look something like this:

from time import time

from math import sqrt

def primeList3(n, verbose = True) :

t = time()

ret = []

for j in range(2,n+1) :

jIsPrime = True

kMax = sqrt(j)

for k in ret :

if k > kMax :

break

if j % k == 0 :

jIsPrime = False

break

if jIsPrime :

ret += [j]

if verbose :

print "Calculated primes to",n,"in",time()-t,"sec."

return ret

If you test this code you'll surely notice the new speed increase. This last optimization doesn't make it a hundred times faster, but on my computer the new code is 3.5 faster than the previous one when calculating all primes below 100,000 and 5.5 times faster when calculating all primes below 1,000,000. And again, what really makes the new code more efficient is not that 4x increase in speed, but the fact that, the larger our input number gets, the better the new code is. It seems that now time required grows with n^{1.25}...

THis last version holds the fundamental ideas of really fast versions, i.e. skipping multiples of primes, but still relies on the slow ans tie consuming operation of trial division. Getting rid of it is the next optimization, which we will see in the next post.
## Wednesday, March 4, 2009

###
Excusatio Non Petita

The year probably was 1994, so I guess its understandable that I don't exactly recall how did I come across a copy of Numerical Recipes in C: The Art of Scientific Computing. But while skimming through the pages, I eventually got to the part on evaluating polynomials, and read something that struck me very deeply. The paragraph that described what I now know is Horner's scheme read as follows:

We assume that you know enough never to evaluate a polynomial this way:

p=c[0]+c[1]*x+c[2]*x*x+c[3]*x*x*x +c[4]*x*x*x*x;

or (even worse!),

p=c[0]+c[1]*x+c[2]*pow(x,2.0)+c[3]*pow(x,3.0)+c[4]*pow(x,4.0);

Come the (computer) revolution, all persons found guilty of such criminal behavior will be summarily executed, and their programs won’t be! It is a matter of taste, however, whether to write

p=c[0]+x*(c[1]+x*(c[2]+x*(c[3]+x*c[4])));

or

p=(((c[4]*x+c[3])*x+c[2])*x+c[1])*x+c[0];

Boy was I guilty! Who would have ever thought there was another way of evaluating a polynomial than the obvious one? And that it would be so much more elegant and efficient?

Would be nice if it was just a polynomial thing, but of course it isn't, and the more complicated the math you are trying to get your computer to do, the larger the chance you are doing it terribly bad. And although much has been written about algorithms, the gospel doesn't seem to be spreading too swiftly: as I am writing this, the first match for a search of "prime numbers python " that google returns is this. And he is not alone...

So what am I in for? The idea is to bridge the gap between straightforward brute force (a sin I have been as guilty as anyone else of) and the elegant simplicity of efficient code. Hopefully, this will allow me to escape the guillotine when the Jacobins take over...

There are some simple rules that I intend to follow, at least until I decide not to:

**Uncomplicated complexity.** The math content will be kept reasonably simple. Or maybe simple isn't the word, as things may get utterly complex. But I will try to stay out of unnecessary complications, so a high school diploma should be all you need to come along on the ride...

**Plain python.** Sure, C++ is a zillion times faster, but python takes care of most gory details, and makes it easier to concentrate on the algorithm. Plus, it's so cool and trendy!

**No libraries**. Sage, SciPy, PARI, qhull, FFTW... whatever it is you are trying to achieve computationally, there is a faster way than replicating the code you find in this blog. And interesting as that may be, it would take a lot of the charm of writing this blog, if it turned into a description of somebody else's API. So yes, we will even code our own FFTs!

```
Subscribe to:
Posts (Atom)
```

```
```