Archive for the ‘Lookery’ Category

June 9, 2008

Lookery Secrets Exposed

Lookery

If you’re interested to know what we’re up to at Lookery listen to this interview my partner Scott Rafer did with Social Times last week.

PS. Checkout our new Lookery East Coast Office (Cambridge, MA)

Lookery East - Office Space Lookery East - Office Space Todd Sawicki at Lookery Boston Jay Meattle at Lookery Boston Raj Bala at Lookery Boston
Thanks to the Viximo guys for renting part-time space to us.
We’re in the office only a couple of days a week as we all prefer to work from home and keep it frugal.


June 4, 2008

Using Amazon S3 as CDN, Part 2 - Cacheability

In my last post I compared using Amazon S3 as a CDN to other low cost alternatives. What was clear was that S3 performed badly when compared to the true CDNs. Not such a surprising outcome; S3 was not designed to be a CDN.

The real surprise to me was that CacheFly, the lowest cost CDN I tested, performed better than the much higher priced options.

My performance were pure HTTP GET based and utilized the monitoring service Pingdom. What performance monitoring services don’t monitor is how quickly a website load feels to the end-user.

A lot can be done to make a website “feel” faster most of these techniques are outlined in the book High Performance Web Sites by Steve Souders of Yahoo!. Most of the book’s content can be read for free on the Yahoo! Developer Network.

The simplest way to increase performance is to minimize requests to your server by setting proper Expires and Cache-Control headers so that your static content can be cached.

When I tested the CDN options for cacheability only Panther Express and my DIY Nginx web server EC2 servers returned proper “cacheable” headers.

S3 and CacheFly returned no “cache-friendly” headers and EdgeCast returned proper headers but the server time was inaccurate which could lead to caching issues.

Amazon S3
CDN Cacheability - Amazon S3
Panther Express
CDN Cacheability - Panther Express
EdgeCast
CDN Cacheability - EdgeCast
CacheFly
CDN Cacheability - CacheFly

I found no way to add expires headers to files hosted on CacheFly. You can add custom headers to files in S3.

By setting your Expires headers on your static files far into the future you can create a cheap CDN with S3 that can “feel” as fast a traditional CDN. You’ll need to automate this and make it part of your deployment process which ideally renames the deployed files to contain revision numbers so that you can really set those Expires headers far in the future.



May 29, 2008

Using Amazon S3 as a CDN?

At Lookery our Javascript analytics tracker is now pushing more than 250GB of bandwidth per month. This javascript file has grown a bit but is still about 6kb.

We’ve been serving that file from a Nginx webserver running on Amazon EC2 instances. Currently Amazon has 3 EC2 data centers but strangely they are all located on the East Coast of the US (Virginia). Since a lot of our users are international we needed to move that file to a CDN in order to reduce latency.

Less than a month ago we moved that single file off to CacheFly a low-priced CDN. I thought CacheFly would be an interim solution but from our performance testing they seem to be a good longer term option.

For this round of testing I tested the following serving options:

  1. CacheFly
  2. EdgeCast
  3. Amazon S3
  4. Nginx running on an Amazon EC2 Instance (DIY option)

For the next round of tests I’ll add Panther Express to the mix.

Performance testing Amazon S3 is a bit unfair but since so many people are using this as a cheap solution I thought I would test it out myself. The performance tests show that you’re much better serving your static content from a lightweight server like NGINX or using an inexpensive option like CacheFly.

For the performance tests I used Pingdom, a 3rd party performance monitoring service that we’ve been quite happy with.

The monitoring servers were geographically distributed as follows:

CDN Tests

Summary

CDN Tests

CacheFly performed the best but only slightly better than EdgeCast. The S3 option was the worst with the Nginx/DIY option performing just over 100 ms faster.

Details

Below are the details for a single day. I ran these tests for over 2 weeks, the results were identical to this single day.

CDN Tests
CacheFly CDN
CDN Tests
Nginx on an Amazon EC2 Instance (DIY)
CDN Tests
EdgeCast CDN
CDN Tests
Amazon S3 used as a CDN

Notes

  • I also tested a second DIY option, running a Varnish cache on an Amazon EC2 instance, but for static content Nginx performed much better so I omitted the results.
  • EdgeCast has an option that allows frequently used content to be served directly from RAM. My trial account did not allow me to test this option. This option would allow for even better performance, possibly matching/beating CacheFly’s performance.

So far we’re sticking with CacheFly and testing a few other options. I’ll post the Panther Express performance after our tests are complete.

Let me know if you’ve found similar results or if I should be testing any other solutions.



April 14, 2008

Hadoop Summit

hadoop-logo.jpg

Last month Yahoo! held the first Apache Hadoop Summit in Santa Clara, CA. I really wanted to go but had scheduled our family vacation to Austin, TX for that same week months before. Daniel was able to go in my place for Lookery and my friend Chris Gillett, who was on my team at Compete, also attended.

“Hadoop is a software platform that lets one easily write and run applications that process vast amounts of data.”
hadoop-architecture.gif

Hadoop implements Google’s MapReduce programming model to create a framework that breaks up large data into small chunks that are then processed in parallel across a cluster of commodity servers.

The framework is still at a very early stage but is already being used by Facebook, Google, Visible Measures, Yahoo, The New York Times and by us at Lookery.

Back when we started Compete in 2000 there were no Open Source options like Hadoop. Forgot about finding developers that had experience dealing with terabyte-scale data.

We ended up evaluating most of the supercomputing software that was being used mostly within government and academic settings at the time. Software like PBS, MPI, PVM, Torque and Condor where the state of the art at the time. The only option was to create our own solution for dealing with our massive clickstream “database”.

Here are the slides from the presentations Chris gave at the PyCon 2005 conference that describe some of the data processing apps we came up with at Compete.

Cool to see that the paradigms we used are being carried on with the Hadoop, PIG and HDFS projects just at a much larger scale.

Chris posted some great summaries of the Hadoop Summit on his blog. I hope he gets around to posting the summaries for the rest of the talks from that day.

Interested in Hadoop? Python? We’re looking for engineers at Lookery to work on our data processing cluster.

Learn more about working with Hadoop and BIG Data at Lookery.



March 25, 2008

The band is getting back together

Lookery

I’m on vacation in Austin, Texas this week with my family but I thought I’d check-in to share the great news.

Our small team at Lookery is growing quickly and I am very happy to announce that Jay Meattle will be joining us next week.

This is Jay’s last week at Compete where he and I worked together for over 3 years. While at Compete Jay was the Product Manager for Compete.com and part of the small team responsible for creating it. Jay and I also worked together creating Bzzster.com and Shareaholic as our weekend projects while at Compete.

At Lookery Jay will be heading up Product Development for us and helping us launch some of the very exciting products we’ve been heads-down working on.

Looking forward to “bringing the thunder” at Lookery with Jay.



Content © 2007-2008 David Cancel. All Rights Reserved.