The Numbers Behind the Twitter Data Silo

January 30, 2012

The dark future of search is being foreshadowed by this Twitter vs. Google fight. The latest Twitter volley at Google is this quote (seen on GigaOm) from Twitter CEO Dick Costolo:

“Google crawls us at a rate of 1300 hits per second… They’ve indexed 3 billion of our pages,” Costolo said. “They have all the data they need.”

There’s no doubt that 1,300 hits per second is a large number, but let’s put that in perspective:

For part of 2010, Google was perhaps able to keep up with the stream at 1,300 requests per second. Somewhere between February and June, the average volume of tweets outpaced them.

Let’s assume that they kept pace until June 2011, and that on June 1, Twitter went from somewhere in the range of 1,300 tweets per second to their reported 2,300 tweets per second. Google is 1,000 tweets behind per second.

By the end of the year, Google missed 15.5 billion tweets. They are two months behind if they didn’t skip any, and the tweet volume did not increase. But it did increase by 25% or so by October, and surely it has grown more since then.

If Google has only indexed 3 billion pages so far, they have approximately 12 days of tweets at current volume. It’s pretty hard to rationalize the 3 billion pages number against the 1,300 per second number. Was Google indexing at a much slower rate before? Did they not start until a few months ago?

Of course Google may be getting multiple tweets per request, perhaps by crawling the timelines of important users. But this means that they probably get a lot of requests that don’t give them any new tweets, or else the timeliness of the data is poor.

No matter how you slice it, it appears Google would be unable to keep up. Even if they were keeping up now, Twitter’s growth probably sets a time limit for which keeping up remains possible.

Perhaps Google is super clever, and can index only the right tweets. I think that it’s more probable they have “enough” data to surface results for the super popular topics, and miss nearly everything in the long tail of the distribution. I expect that this adversely affects search quality, which one suspects is a high priority for the world’s best search engine.

Google is no saint. They are just as guilty of the same data hoarding. If you ran these numbers for YouTube indexing, I think you will find the situation is much worse. I imagine that most of these data silo companies purposefully set their crawl rates too low for anyone to achieve high quality search results.

In the case of Twitter, the end result for users is even worse because Twitter’s own attempts at search are terrible and are getting worse over time. At least Google makes a decent YouTube search, even if no one else can.

Even if Google could get all the tweets, they still would have very little to no Facebook data. I still think the best strategy in this situation for them is to create their own social data and use that instead. It’s a tough road, but they seem to have little choice.

In the end, it’s not about Google or Twitter or Facebook, but the stifling of innovation and competition around data. We can only hope that some federated solution or some data-liberal company wins out in the end.

The Numbers Behind the Twitter Data Silo - January 30, 2012 - Jack Moffitt