I just got back from the Safeway in Seattle's University District where I bought one of my favorite foods: Amy's Refried Beans With Green Chiles. I went to the shortest checkout line. I didn't realize until I was already being checked out that I had had this man before.
He asked me if I knew how refried beans were made. "Uhh, no…". He explained, "They use lard. Do you think they used organic lard while making these beans?"
These are vegan refried beans, so no lard was used. I didn't really care, so I let it slip. However, he continued, "Do you think it is right that they charge more for organic food? Don't you feel like you are being taken advantage of? I mean, I don't think that is right."
Oh shit. I remember this guy. He's the asshole that tried to convince me last time that I was being swindled by the organic food industry.
I was tired of his shit and pissed off at his poor customer service. I replied flat out with the truth, "If anyone is taking advantage of me it is Safeway." I pointed to the beans, "This is cheaper at Whole Foods." I was infuriated and didn't talk to him for the rest of the transaction.
Seriously, I don't know how they let people like this into retail. You shouldn't comment on someone's purchases, but if you must, you don't insist they are getting scammed. The only reason I shop at this Safeway is because of the convenient location, otherwise their prices are high and their produce is disgusting.
Maybe it is because Amazon has drilled being customer centric into my brain, but I find this unacceptable.
In honor of Earth Day, I'm going to talk a little bit about why these new utility computing services (or "cloud computing") are good for the environment (and business too).
One of the tenets of cloud computing is that you use what you need and you pay for what you use. Amazon S3 and Simple DB and Google App Engine all charge based on storage, bandwidth, and CPU time. Additionally, these services run on shared infrastructure, so you don't have separate physical boxes serving your traffic. This allows Amazon and Google to run at high utilization knowing that—statistically—not all of their users will be hammering the service at once. For a detailed discussion of these statistics, read Power Provisioning for a Warehouse-sized Computer [PDF]. I guess you could say the same thing about Dreamhost (or any other shared server provider) since they cram many people on a box and run at high utilization, but they don't provide the same level of scalability as Amazon or Google (you are limited to a single box).
Why does utilization matter? If you can do the same amount of work on fewer computers by having higher utilization, you save the environmental cost of building those computers. Additionally, computers use lots of energy idling. Even though you increase single-computer power usage when consolidating, net power decreases as you take other systems offline. For a concrete example, imagine you have two computers that use 100W idling and 200W at full load. If each machine is one-third utilized, it takes 266W to perform your work (133W for each computer). However, consolidating this onto a single machine results in a total draw of 166W, or a 37% savings. Now… imagine running a datacenter at 80% or more load.
In the second paragraph I didn't mention Amazon's EC2. EC2 differs from S3 or GAE because you do pay for what you don't use. EC2 charges based on how long you have a computer under your control, not by the utilization of that computer. However, this is still a much more granular level of control than colocation, since you can scale your fleet up or down as needed to meet load. When you return a machine, someone else is free to grab it. The end result is high utilization, because people won't hold on to a machine they are idling, and will thus reduce their fleet size to compensate.
In my opinion this makes Hadoop the killer app for EC2. Users can spin up a cluster, run their job at full bore, and then return the computers back to "the cloud". Companies such as Powerset and the NY Times have used EC2 for this very reason. This is a triumphant example of the market's ability to reduce energy consumption and resource usage (in the form of unneeded computers), because it directly translates into monetary savings—no heavy handed government mandates required.
Let me use an example of my own usage of S3 for backups. All the numbers will be based on those in the power provisioning paper mentioned above. I have 20 GB of data backed up in S3 and I want it to be available for instant access. This necessitates a constantly spinning hard disk. If I were to do it myself, I could just add another hard disk in my computer and stick a copy on that. Power usage of hard disk: 12W. This is cheap and easy, but doesn't give me offsite backup in case of a disaster. For the storage server—using the numbers from the paper—we have 200W base power plus 12W for each disk. Giving the server eight disks results in a total draw of 296W and 8192 GB of disk space. Let's double that to take into account battery backup and air conditioning in the datacenter, so 592W. Since the server is shared with others (we are aiming for high utilization, remember?), I'll have to figure out my usage as a proportion of the total. Before I do that, I should take durability into account. Distributed file systems store multiple copies of each file because hard disks and computers constantly fail in large systems. The industry standard seems to be three copies—judging by GFS and HDFS. Three copies results in 60 GB dedicated to my backups. Adding this all up results in the follow equation: 592W * (60 GB / 8192 GB) = 4.33W. As you can see, even with all the additional overhead and infrastructure, using S3 saves two-thirds the power versus having an extra spinning hard disk in my own machine. Plus, my apartment is no longer a single point of failure.
Hopefully this article has provided you a new look at utility computing. Next time the media tries to make a fuss over the growing power usage of datacenters, think back to this and realize they might actually be saving energy. Happy Earth Day, Andrew.
Last week, rumors started to surface that Google would be releasing BigTable as a web service for developers. While open source clones are being created, so far only Googlers have had access to the real BigTable. This announcement would have also been the biggest competition to Amazon's Web Services to date. The actual launch was App Engine—hosted middleware on Google's platform. App Engine runs on BigTable, but doesn't expose all the nitty gritty details of BigTable. This makes it simpler but less flexible, and despite all the comparisons to AWS online, the two offerings are quite different.
I registered the night App Engine was released and soon got my invite. I then came up with the crazy idea to offer BigTable as a web service using App Engine. It would be an infinitely scalable database running in Google's datacenters. I spent my weekend learning Python and hacking together an implementation. Now I'm happy to present the BigTable Web Service. It models the API of Hbase—a BigTable clone. Now you can have simulated BigTable running atop App Engine, which itself provides an abstraction on top of the real BigTable.
The site describes the API of BigTable and gives examples of how to call it. I've also included a Python client for writing software against it. You must register for an account and create tables using the site, but everything after that is done through pseudo-RESTful service calls. I'm allowing free, unlimited access of the service… up to the limits imposed by Google.
