Archive for November, 2009|Monthly archive page
Hopping on the Big Data bandwagon
In 13 years we have gone from 100,000 websites to more than 180 million, with more than 22 billion pages (Google indexes about 15.5 billion of them, maybe more). But this is mostly static data. The pages are made and are then indexed. With the emergence of the real-time web, however, we are creating far more amounts of data than just the growth of the pages on the internet. And the data changes constantly. It is a stream, as John Borthwick so elegantly describes. It’s harder to store, index, access, backup, guard against (and restore from) failure, and analyze. Measurements of the amount of data being created is hard to come by. We know Facebook stores and analyzes petabytes of data, as their 300M users log 8 billion minutes per day on the site. Twitter, with only about 70MM users so far, generates more than 1M tweets per hour and more than 27M tweets per day, and they are just getting started.
Data is more than just our explicit posts, tweets and photos, however. Just by carrying a mobile phone around, we generate thousands of discreet data events a day (phone checking in with tower, etc.) By surfing the web, we each generate thousands of additional data events a day (cookie written, ad served, ad clicked, page requested, cookie read, logged in somewhere, searched, etc.) Throughout our travels in the interactive ecosystem, various parties are logging this data. And this is where the problems emerge. It’s getting harder to do. Much harder.
Vulnerabilities surround us. Gmail, Twitter, Microsoft/Danger’s Sidekick cloud service, Amazon S3 and EC2 have all had outages. We hear about these services when they fail, even for an hour or two. Most of the failures are due to the sheer size and complexity of the undertaking.
What’s happening here? Well, the tools and methods we have used since 1995 to manage web infrastructure are breaking down at this new scale. Some parts of the chain distribute quite nicely when we throw more iron at the problem, like Apache and other web servers. And application servers generally follow suit. But two main areas, in my mind, are particularly vulnerable to the challenges of scaling, and that’s the database and methods to traverse data.
Database
Scaling the database has always been a challenge. Oracle offers RAC and other clustering solutions which can achieve scale but not often the performance optimized for low-latency, high-read applications. These are very expensive and have their own set of issues. But more generally, SQL databases and other relational database solutions really don’t scale horizontally, transparently to the application, and don’t eliminate all “single points” of failure. Others have gone into more detail here. Open source alternatives to Oracle, like MySQL and Postgres have not solved this problem.
I am not the first to recognize this problem. In fact, the NoSQL movement has been active for some time, developing alternate non-relational database solutions in the name of producing truly scaleable, high-performance databases (while giving up some of the elegant features of relational databases). Key-value store non-relational projects like BigTable, CouchDB, HBase, Voldemort, Dynamite, Amazon’s Dynamo, Cassandra, and Mongo offer different approaches here. There is something very interesting here. Facebook, Digg, Google, Twitter and many other large-scale internet properties are adopting these non-traditional database solutions for various services within their architecture.
Traversing Data
Large sets of data are a challenge to analyze and process. Basic computer science brute-force algorithms break down (or at least perform sub-optimally) at these scales and are not optimized to work in a distributed basis. One key example is the search index. To deal with the enormous size of search indexes, Google developed MapReduce which led Doug Cutting at Yahoo! to develop Hadoop, a Java framework to make it easy for data-intensive apps to work in a distributed manner. But these are not the only companies burdened by web-scale data problems. Some smart folks figured this out and formed Cloudera to commercialize Hadoop. I have been kicking myself for missing that investment opportunity. It is sure to be a winner. A number of our portfolio companies now use Cloudera’s Hadoop.
So, big data is revealing itself to be a disruptive phenomenon. I haven’t really done the topic justice, so I would very much appreciate your point of view in the comments below. I am committed to digging in deeper here.
Some iPhone Gripes, Part 1
I love the iPhone. But it is a work in progress, let’s face it. I wanted to quickly list a bunch of my specific issues with iPhone in the hopes that someone at Apple is listening out there (crazy assumption, I know…). First post of several:
- When connectivity is limited and I choose to read/delete email, must you pop up seven different dialog boxes in rapid succession telling me “can’t get mail” and “can’t delete message”? Is that really relevant? Can’t you just queue my requests and actions and re-submit them once connectivity is re-established? Blackberry has been doing that for more than a decade. They even clue me in that a task is queued with a clock icon next to an unsent message.
- When I am on a call and you pop a calendar alert or a text message arrived, must you make me deal with the alert before you let me have access to the phone controls again? Why force me out of the context of the phone call? Why can’t I access things like “speaker” and “end call” before I have responded to a calendar alert or a new arriving text message.
- Push email is not compatible with the battery capacity you have chosen. Turning on push (for exchange servers) will kill the battery within 3-4 hours, rendering this feature completely unusable. Either fix it or warn users more prominently about this.
- Why must I enter my iTunes account password every time before downloading free apps?
- It takes 4-5 steps to delete a calendar event. Really? That’s the best UI you can come up with?
- New calendar events: can’t I please click the hour on which I want the new event scheduled before clicking the “+” button to create a new event? The interface for setting the time is so tedious, I should be able to tell you when I want the event schedule by a simply gesture.
Comments (5)
Comments (5)