Hadoop Performance Optimization is the Next Big Thing


The common theme that runs through every startup that I’ve started, or been a part of, is the need for “big data” analysis. “Big data” analysis is the area where off-the-shelf software tools breakdown, where statisticians, analysts and developers meet and where normal number-crunching turns into “supercrunching”.

At my last two startups, Compete and Lookery, this “big data” analysis has transcended its usual internal audience and become a fundamental part of, if not the entire product.

In the time before Lookery (B.L.) we needed to create our “big data” infrastructure from scratch. This usually took the form of large clusters of computers running proprietary, created from scratch, software. The most formal of these being the software we created at Compete which included the Compete Filesystem (CFS) and CompeteSQL (CSQL).

Today we have open-source software like Hadoop to provide the framework for our data analysis software at Lookery. The Hadoop project has grown fast with companies like Yahoo, Facebook, Last.FM, and The New York Times using it. There are even venture-backed startups focused solely on building services and products on top of the framework.

This weekend Elias Torres, our VP of Engineering at Lookery, released a project he calls Hadoop Timelines. Hadoop Timelines is a great example of what I’m calling “Hadoop Performance Optimization” (HPO).

While the barriers to use something like Hadoop have fundamentally dropped there are only a handful of experts that can make your Hadoop cluster perform well. What’s needed is a new suite of services and tools that can analyze your cluster and automatically optimize your performance. Hadoop Timelines, while rudimentary, is the beginning of an exciting new business niche, Hadoop Performance Optimization (HPO).

If you’re a Hadoop user please comment on how you’re optimizing your performance today.