Thursday, April 9, 2009

Naive Integer Factorization

After three posts (1, 2, 3) on calculating prime numbers, it is probably worth putting that knowledge to a more useful task. As we will see in a near future, integer factorization, i.e. breaking down a (composite) number into its prime factors is one such task. In purity, factoring a number n is simply decomposing it as the product of two smaller non-trivial, i.e. different from 1 and n itself, divisors. But by repeatedly factoring the divisors one will eventually come up with a unique set of primes which, when multiplied together, render the original number, or so says the fundamental theorem of arithmetic... The point is, we will consider factorization a synonym of prime decomposition, be it formally correct or not.

There are some very sophisticated methods to factor very large numbers, but they use a lot of extremely complex math, so I doubt they will ever find their way onto this blog. So we are going to be left with the naive, straightforward approach as our only option, although I will try to give it an efficiency boost. What is this naive approach? Trial division, of course: given a number n, we know that its smallest factor will be smaller than the square root of n, so we can simply try and see if any of those numbers divide it. No, I will not try to code that yet... If you have read the entries on determining prime numbers, it should come as no surprise that we really do not need to do trial division by all numbers smaller than the square root of n, but only by the primes within. This is a consequence of the fact that, if a composite number divides n, then each of the prime factors of that composite number will also divide n. According to the prime number theorem the number of primes below x is asymptotic to x / log x. So by limiting our trials to prime numbers we can reduce the number of tests from n1/2 to something around 2 n1/2 / log n. If we rescue the primeListSofE function from the post on the sieve of Erathostenes, a python implementation of naive factorization could look something like this...

from time import clock

def factor(n, verbose = False) :
"""Returns all prime factors of n, using trial division by prime
numbers only. Returns a list of (possibly repeating) prime factors
"""
t = clock()
ret =[]
nn = n
maxFactor = int(n**0.5)
primes = primeListSofE(maxFactor, verbose)
for p in primes :
while nn % p == 0 :
nn //= p
ret += [p]
if nn == 1 :
break
if nn != 1 :
ret += [nn]
t = clock() - t
if verbose :
print "Calculated factors of",n,"in",t,"sec."
return ret

While this function will be about as good as we can make it for numbers which are the product of two large prime factors, it will be terribly inefficient for most numbers. Consider, as an extreme example, that we are trying to factor 255 ~ 3.6·1016. We would first calculate all primes up to 1.9·108, a challenging feat in itself with our available tools, only to find out that we only needed the first of those primes, i.e. 2. Taking into account that 50% of all numbers are divisible by 2, 33% are divisible by 3, 20% are divisible by 5... it doesn't seem wise to disregard the potential time savings. What we can do to profit from this is to do the trial division checks at the same time as we determine the prime numbers, updating the largest prime to test on the fly. This has to be done in two stages, the first while we sieve up to n1/4, the second while we search the rest of the sieve up to n1/2 searching for more primes. The following Franken-code has been written mostly by cut-and-paste from primeListSofE and factor, which hopefully hasn't affected its readability much:

from time import clock

def factorAndSieve(n, verbose = False) :
"""Returns all prime factors of n, using trial division while sieving
for primes. Returns a list of (possibly repeating) prime factors
"""
t = clock()
ret =[]
nn = n
while nn % 2 == 0 : # remove 2's first, as 2 is not in sieve
nn //= 2
ret += [2]
maxFactor = int(nn**0.5)
maxI = (maxFactor-3) // 2
maxP = int(maxFactor**0.5)
sieve = [True for j in xrange(maxI+1)]
i = 0
for p in xrange(3, maxP+1,2) : # we then sieve as far as needed
if p > maxP :
break
i = (p-3) // 2
if sieve[i] :
while nn % p == 0 :
nn //= p
ret += [p]
maxFactor = int(nn**0.5)
maxI = (maxFactor-3) // 2
maxP = int(maxFactor**0.5)
if nn == 1 :
break
else :
i2 = (p*p - 3) // 2
for k in xrange(i2, maxI+1, p) :
sieve[k] = False
index = i
for i in xrange(index, maxI+1) : # and inspect the rest of the sieve
if i > maxI :
break
if sieve[i] :
p = 2*i + 3
while nn % p == 0 :
nn //= p
ret += [p]
maxFactor = int(nn**0.5)
maxI = (maxFactor-3) // 2
maxP = int(maxFactor**0.5)
if nn == 1 :
break
if nn != 1 :
ret += [nn]
t = clock() - t
if verbose :
print "Calculated factors of",n,"in",t,"sec."
print "Stopped trial division at",2*i+3,"instead of",int(n**0.5)
return ret

This new code will very often be much faster than the other one, but at times it will be just about as slow as in the other case, or even slower, since the mixing of both codes introduces some inefficiencies. The most extreme examples of such cases would be a prime number, or the square of a prime number on one side, and a power of 2 on the other one.



The graph above plots times to calculate the factors of numbers between 106 and 106 + 100. Prime numbers in this interval stick out as the red dots among the blue ones: 106 +3, +33, the twin primes +37 and +39, +81 and +99. And numbers with many small prime factors populate the bottom of the red cloud.

If the above graph is not enough to convince you of the benefits of the second approach, maybe this timings for very large numbers will:

>>> factor(10**15+37,True)
Calculated primes to 31622776 in 6.760 sec.
Calculated factors of 1000000000000037 in 8.466 sec.
[1000000000000037L]
>>> factorAndSieve(10**15+37,True)
Calculated factors of 1000000000000037 in 8.666 sec.
Stopped trial division at 31622775 instead of 31622776
[1000000000000037L]

>>> factor(2**55,True)
Calculated primes to 189812531 in 42.811 sec.
Calculated factors of 36028797018963968 in 43.261 sec.
[2, ..., 2]
>>> factorAndSieve(2**55,True)
Calculated factors of 36028797018963968 in 8.632e-05 sec.
Stopped trial division at 3 instead of 189812531
[2, ..., 2]

No comments:

Post a Comment