A Look at R Algo Optimization: Sieve of Eratosthenes

In the course of working on problem #3 at Project Euler, I decided to take a detour to do a comparison of various implementations of the sieve of Eratosthenes. The problem asks:

What is the largest prime factor of the number 600851475143 ?

I decided to begin approaching this by generating all primes less than this. So how do you generate the primes? The Sieve of Eratosthenes jumps to mind:

To find all the prime numbers less than or equal to a given integer n by Eratosthenes’ method:

Create a list of consecutive integers from 2 through n: (2, 3, 4, …, n).

Initially, let p equal 2, the smallest prime number.

Enumerate the multiples of p by counting to n from 2p in increments of p, and mark them in the list (these will be 2p, 3p, 4p, …; the p itself should not be marked).

Find the first number greater than p in the list that is not marked. If there was no such number, stop. Otherwise, let p now equal this new number (which is the next prime), and repeat from step 3.

When the algorithm terminates, the numbers remaining not marked in the list are all the primes below n.

Now, 6e11 is a pretty big number but… To put this in perspective, my computer has 16 GB of memory, which is only 16e9 bytes… but whatever (now that I’m this far, its obvious this is the long way around, but I’m having fun!). I have to admit that at this point I was feeling a bit intimidated, and decided to google “sieve of eratosthenes R” to get started on the right foot.

First hit:

sieve_1 <- function(n){
  values <- rep(TRUE, n)
    values[1] <- FALSE
    last.prime <- 2
    for(i in last.prime:sqrt(n)){
        values[seq.int(2 * last.prime, n, last.prime)] <- FALSE
        last.prime <- last.prime + min(which(values[(last.prime + 1) : n]))
    }
    return(which(values))
}

And a more refined one within this second hit:

sieve_2 <- function(n)
{
  n <- as.integer(n)
  if(n > 1e9) stop("n too large")
  values <- rep(TRUE, n)
  values[1] <- FALSE
  last.prime <- 2L
  fsqr <- floor(sqrt(n))
  while (last.prime <= fsqr)
  {
    values[seq.int(2L*last.prime, n, last.prime)] <- FALSE
    sel <- which(values[(last.prime+1):(fsqr+1)])
    if(any(sel)){
      last.prime <- last.prime + min(sel)
    }else last.prime <- fsqr+1
  }
  which(values)
}

(note that I’ve modified the code in both to use the same names to make the remaining differences clear)

A quick performance comparison:

library("microbenchmark")

microbenchmark(
  sieve_1(1e4),
  sieve_2(1e4),
  times = 10
)

## Unit: microseconds
##            expr      min       lq      mean   median        uq      max
##  sieve_1(10000) 7487.455 8007.842 16901.255 8716.036 17497.713 51967.28
##  sieve_2(10000)  160.821  163.374  2107.552  200.388   222.086 18588.81
##  neval
##     10
##     10

Suffice it to say, the second one is way better. sieve_2 is running about 40X faster. I tried other values, and the differences are magnified the higher you get. Now, the two implementations are actually very similar, so what are the differences that are creating these large performance differences? Things I see:

sieve_2 uses a while loop instead of a for loop. This is important because you don’t actually know how many iterations are going to be necessary before you start. sieve_1 includes a line that looks like maybe it is intended to jump forward when a prime is identified to the next non-prime, but it actually doesn’t because once the for loop starts, you can’t modify its counter.

A modified version of sieve_1 to instead use a while loop might look like this:

sieve_1_v2 <- function(n){
  values <- rep(TRUE, n)
    values[1] <- FALSE
    last.prime <- 2
    
    while(last.prime < sqrt(n)){
        values[seq.int(2 * last.prime, n, last.prime)] <- FALSE
        last.prime <- last.prime + min(which(values[(last.prime + 1) : n]))
    }
    return(which(values))
}

microbenchmark(
  sieve_1(1e4),
  sieve_1_v2(1e4),
  times = 10
)

## Unit: milliseconds
##               expr       min       lq      mean    median        uq
##     sieve_1(10000) 15.632049 17.20051 18.312319 17.693178 18.890762
##  sieve_1_v2(10000)  3.908924  5.04342  6.462725  5.063113  6.058668
##       max neval
##  23.15160    10
##  16.55321    10

This minor change has created an improvement of a factor of 3. To explain this, I tracked how many times each one iterates.

## [1] "sieve_1 iterated 99 times."

## [1] "sieve_1_v2 iterated 25 times."

## [1] "sieve_2 iterated 25 times."

I’m assuming there’s a shmancy way to do this, but I just inserted counters inside the loops and then printed at the end. It was a little annoying, because I didn’t want to mess up the functions, so I made alternate “tracked” versions of them. From here, what else can we improve?

Between the improved version and sieve_2, the better one pre-stores some values that are re-used. For example, sieve_2 only calculates sqrt(n) a single time prior to the loop beginning, whereas sieve_1_v2 does it for each iteration of the loop.
sieve_2 also specifically labels some of the variables as integers that might otherwise be doubles. This is a little awkward, because one strength of R is that you don’t need to worry about types, but I guess once you start optimizing things, you actually do.
The formulas in sieve_2 seem a little bit different, like some pieces have been factored out to avoid repetition. This will take some closer looking. But I don’t have the time for these right now.

Now, none of these are going to be good enough for Euler #3. To get all the primes up to where we need, we either need to split up the task into bits, or a different approach that works better for big numbers, or a different strategy entirely. But all that is probaby for another time. I would like to say that I will update this post with some enhancements later on, but I don’t fully know if blogdown lets you do that.

There are also some optimizations of the sieve algorithm. For example, you can start by omitting the even numbers, and I think something to do with multiples of 30. I don’t immediately know how that would interect with the strategy these implementations are using of using a vector of logicals. More things to explore!

P.S., I wanted to highlight the specific changes that I made within the code, but I don’t know how to do that within rmarkdown. There must be a way. Let me know if you know what it is! Or if you know a good way to go about updating posts. Like, do you need to change both the date field and the filename? Is there a way to use commit references from git? I dunno, but I can tell there are things to learn there.