So as I'm progressing studying collective intelligence concepts and algorithms, I couldn't help but think about the most efficient/performant (no such word, but you know what I mean) ways of implementing statistical algorithms that perform calculations on very large data sets . As most programers proficient in a particular language(s) would do, is find a library or implement that algorithm in that particular technology, without looking outside of the box. I'd probably implement most of these in some rather efficient language, but wait…
#### Python with mean algorithm Time: 24.5236210823 seconds #####
import time
def mean (inlist):
sum = 0
for item in inlist:
sum = sum + item
return sum/float(len(inlist))start = time.time()
result = mean([i for i in range(1,50000000)])
end = time.time()
print "Result: %s, Start: %s, End: %s, Time elapsed: %s\n" % (result,
start, end, end – start)
#### Python with using the R-lang interface. It dispatches to R libs behind the scenes. Time: 14.780577898 seconds #####
import rpy2.robjects as robjects
import timer = robjects.r
start = time.time()
result = r.mean(robjects.FloatVector(range(1,50000000)))
end = time.time()
print "Result: %s, Start: %s, End: %s, Time elapsed: %s\n" %
(result[0], start, end, end – start)
#### R-lang using the mean function. Time: 0.654 seconds (Yes, that's 654 milliseconds!!!) #####
print(system.time(print(mean(array(1:50000000)))))