Hadoop Summit

hadoop-logo.jpg

Last month Yahoo! held the first Apache Hadoop Summit in Santa Clara, CA. I really wanted to go but had scheduled our family vacation to Austin, TX for that same week months before. Daniel was able to go in my place for Lookery and my friend Chris Gillett, who was on my team at Compete, also attended.

“Hadoop is a software platform that lets one easily write and run applications that process vast amounts of data.”

hadoop-architecture.gif

Hadoop implements Google’s MapReduce programming model to create a framework that breaks up large data into small chunks that are then processed in parallel across a cluster of commodity servers.

The framework is still at a very early stage but is already being used by Facebook, Google, Visible Measures, Yahoo, The New York Times and by us at Lookery.

Back when we started Compete in 2000 there were no Open Source options like Hadoop. Forgot about finding developers that had experience dealing with terabyte-scale data.

We ended up evaluating most of the supercomputing software that was being used mostly within government and academic settings at the time. Software like PBS, MPI, PVM, Torque and Condor where the state of the art at the time. The only option was to create our own solution for dealing with our massive clickstream “database”.

Here are the slides from the presentations Chris gave at the PyCon 2005 conference that describe some of the data processing apps we came up with at Compete.

Cool to see that the paradigms we used are being carried on with the Hadoop, PIG and HDFS projects just at a much larger scale.

Chris posted some great summaries of the Hadoop Summit on his blog. I hope he gets around to posting the summaries for the rest of the talks from that day.

Interested in Hadoop? Python? We’re looking for engineers at Lookery to work on our data processing cluster.

Learn more about working with Hadoop and BIG Data at Lookery.