December, 2008

Dec 08

Choosing best/efficient algorithm implementation for the task

So as I'm progressing studying collective intelligence concepts and algorithms, I couldn't help but think about the most efficient/performant (no such word, but you know what I mean) ways of implementing statistical algorithms that perform calculations on very large data sets .  As most programers proficient in a particular language(s) would do, is find a library or implement that algorithm in that particular technology, without looking outside of the box.  I'd probably implement most of these in some rather efficient language, but wait…

I was searching for some statistical/mathematical tools for prototyping and ran across R language.  So I decided to see how say python performs against R for the simplest of algorithms, mean.  I'm sure there is a way to tune the python code, but I even looked at some python statistical packages and they all implement the mean in the same way.  So here is a benchmark.  I hate to compare languages these ways, this means nothing, other than you should really look outside of the box for a tool that fits a particular problem.

#### Python with mean algorithm Time: 24.5236210823 seconds #####

import time

def mean (inlist):
 sum = 0
 for item in inlist:
     sum = sum + item
 return sum/float(len(inlist))

start = time.time()
result = mean([i for i in range(1,50000000)])
end = time.time()
print "Result: %s, Start: %s, End: %s, Time elapsed: %s\n" % (result,
start, end, end – start)

#### Python with using the R-lang interface.  It dispatches to R libs behind the scenes.  Time: 14.780577898 seconds #####

import rpy2.robjects as robjects
import time

r = robjects.r
start = time.time()
result = r.mean(robjects.FloatVector(range(1,50000000)))
end = time.time()
print "Result: %s, Start: %s, End: %s, Time elapsed: %s\n" %
(result[0], start, end, end – start)

#### R-lang using the mean function.  Time: 0.654 seconds (Yes, that's 654 milliseconds!!!) #####


Here I ran mean against an array of 50 million 1..50,000,000 numbers and I got some absolute astounding results.  I'm almost wondering if I'm completely missing something.


I changed the python implementation to use the scipy library, which also uses numpy.  Scipy has a mean function which can be passed a numpy array range.

result = scipy.mean(numpy.arange(1,50000000));
Time elapsed: 1.02892494202

Wow, way better.