In the course of working on problem #3 at Project Euler, I decided to take a detour to do a comparison of various implementations of the sieve of Eratosthenes. The problem asks:

What is the largest prime factor of the number 600851475143 ?

I decided to begin approaching this by generating all primes less than this. So how do you generate the primes? The Sieve of Eratosthenes jumps to mind:

To find all the prime numbers less than or equal to a given integer n by Eratosthenes’ method:

- Create a list of consecutive integers from 2 through n: (2, 3, 4, …, n).
- Initially, let p equal 2, the smallest prime number.
- Enumerate the multiples of p by counting to n from 2p in increments of p, and mark them in the list (these will be 2p, 3p, 4p, …; the p itself should not be marked).
- Find the first number greater than p in the list that is not marked. If there was no such number, stop. Otherwise, let p now equal this new number (which is the next prime), and repeat from step 3.
- When the algorithm terminates, the numbers remaining not marked in the list are all the primes below n.

Now, 6e11 is a pretty big number but… To put this in perspective, my computer has 16 GB of memory, which is only 16e9 bytes… but *whatever* (now that I’m this far, its obvious this is the long way around, but I’m having *fun*!). I have to admit that at this point I was feeling a bit intimidated, and decided to google “sieve of eratosthenes R” to get started on the right foot.

```
sieve_1 <- function(n){
values <- rep(TRUE, n)
values[1] <- FALSE
last.prime <- 2
for(i in last.prime:sqrt(n)){
values[seq.int(2 * last.prime, n, last.prime)] <- FALSE
last.prime <- last.prime + min(which(values[(last.prime + 1) : n]))
}
return(which(values))
}
```

And a more refined one within this second hit:

```
sieve_2 <- function(n)
{
n <- as.integer(n)
if(n > 1e9) stop("n too large")
values <- rep(TRUE, n)
values[1] <- FALSE
last.prime <- 2L
fsqr <- floor(sqrt(n))
while (last.prime <= fsqr)
{
values[seq.int(2L*last.prime, n, last.prime)] <- FALSE
sel <- which(values[(last.prime+1):(fsqr+1)])
if(any(sel)){
last.prime <- last.prime + min(sel)
}else last.prime <- fsqr+1
}
which(values)
}
```

(note that I’ve modified the code in both to use the same names to make the remaining differences clear)

A quick performance comparison:

```
library("microbenchmark")
microbenchmark(
sieve_1(1e4),
sieve_2(1e4),
times = 10
)
```

```
## Unit: microseconds
## expr min lq mean median uq max
## sieve_1(10000) 7487.455 8007.842 16901.255 8716.036 17497.713 51967.28
## sieve_2(10000) 160.821 163.374 2107.552 200.388 222.086 18588.81
## neval
## 10
## 10
```

Suffice it to say, the second one is way better. sieve_2 is running about 40X faster. I tried other values, and the differences are magnified the higher you get. Now, the two implementations are actually *very similar*, so what are the differences that are creating these large performance differences? Things I see:

- sieve_2 uses a
*while loop instead of a for loop*. This is important because you don’t actually know how many iterations are going to be necessary before you start. sieve_1 includes a line that looks like maybe it is intended to jump forward when a prime is identified to the next non-prime, but it actually doesn’t because once the for loop starts, you can’t modify its counter.

A modified version of sieve_1 to instead use a while loop might look like this:

```
sieve_1_v2 <- function(n){
values <- rep(TRUE, n)
values[1] <- FALSE
last.prime <- 2
while(last.prime < sqrt(n)){
values[seq.int(2 * last.prime, n, last.prime)] <- FALSE
last.prime <- last.prime + min(which(values[(last.prime + 1) : n]))
}
return(which(values))
}
microbenchmark(
sieve_1(1e4),
sieve_1_v2(1e4),
times = 10
)
```

```
## Unit: milliseconds
## expr min lq mean median uq
## sieve_1(10000) 15.632049 17.20051 18.312319 17.693178 18.890762
## sieve_1_v2(10000) 3.908924 5.04342 6.462725 5.063113 6.058668
## max neval
## 23.15160 10
## 16.55321 10
```

This minor change has created an improvement of a factor of 3. To explain this, I tracked how many times each one iterates.

`## [1] "sieve_1 iterated 99 times."`

`## [1] "sieve_1_v2 iterated 25 times."`

`## [1] "sieve_2 iterated 25 times."`

I’m assuming there’s a shmancy way to do this, but I just inserted counters inside the loops and then printed at the end. It was a little annoying, because I didn’t want to mess up the functions, so I made alternate “tracked” versions of them. From here, what else can we improve?

Between the improved version and sieve_2, the better one pre-stores some values that are re-used. For example, sieve_2 only calculates sqrt(n) a single time prior to the loop beginning, whereas sieve_1_v2 does it for each iteration of the loop.

sieve_2 also specifically labels some of the variables as integers that might otherwise be doubles. This is a little awkward, because one strength of R is that you don’t need to worry about types, but I guess once you start optimizing things, you actually do.

The formulas in sieve_2 seem a little bit different, like some pieces have been factored out to avoid repetition. This will take some closer looking. But I don’t have the time for these right now.

Now, none of these are going to be good enough for Euler #3. To get all the primes up to where we need, we either need to split up the task into bits, or a different approach that works better for big numbers, or a different strategy entirely. But all that is probaby for another time. I would like to say that I will update this post with some enhancements later on, but I don’t fully know if blogdown lets you do that.

There are also some optimizations of the sieve algorithm. For example, you can start by omitting the even numbers, and I think something to do with multiples of 30. I don’t immediately know how that would interect with the strategy these implementations are using of using a vector of logicals. More things to explore!

P.S., I wanted to highlight the specific changes that I made within the code, but I don’t know how to do that within rmarkdown. There must be a way. Let me know if you know what it is! Or if you know a good way to go about updating posts. Like, do you need to change both the date field and the filename? Is there a way to use commit references from git? I dunno, but I can tell there are things to learn there.