The Potentially Dark Future of Search

January 12, 2012

Twitter sees Google’s latest Google+ feature, integration into Google search, as anti-competitive, and it probably is. However, it brings to the surface some real issues with the future of search and of data.

Twitter’s argument:

We’re concerned that as a result of Google’s changes, finding this information will be much harder for everyone. We think that’s bad for people, publishers, news organizations and Twitter users.

Google’s response was:

We are a bit surprised by Twitter’s comments about Search plus Your World, because they chose not to renew their agreement with us last summer (http://goo.gl/chKwi), and since then we have observed their rel=nofollow instructions.

People have been digging into the semantics of nofollow (see Danny Sullivan and Luigi Montanez), but there is a much bigger issue.

Google and other established and up-and-coming search engines have no real way to include lots of data in their index. It’s easy to imagine that the lack of access to Twitter and Facebook data was a motivator for Google+ in the first place.

Lots of sites now generate enough data that it is unrealistic to crawl them. For example, Youtube has more new content every day than they allow anyone to crawl. Twitter is essentially the same. This means there is no way to index this data without special arrangements with the provider. Twitter has closely guarded their firehose of data, but at least they have some mechanism to obtain it. Youtube, as far as I am aware, has no such mechanism.

My team and I ran into this problem head on trying to build Collecta, a real-time search engine. Access to the data was a primary blocker for many features and product ideas, and over the too short life of that company, access became significantly more difficult, not easier.

Google can build an effective search, even a real-time one, for Youtube, but no one else can. Twitter can build search for their data, but few others can, and their data access policies can and do change on a whim.

If Google believes that microblogging data will improve their search product, then a reasonable strategy to obtain that data is to try and build their own microblogging service to generate it. I can’t fault Google for trying. If I thought Collecta could have effectively competed against Twitter for their audience, I would certainly have attempted that as well.

Google, Twitter, Facebook and others are hoarding silos of otherwise public data. Not only is this artificially limiting the features of their products, but it squashes the potential for new and exciting search applications. The search services that have sprung up are limited to your own data, aggregate results from service-specific search APIs, exist at the mercy of data providers, or make do with a tiny subset of the data. I don’t think Google could have built their own search engine if the Web were similarly hostile.

One could argue for requiring these bits of data to be openly available, but unlike the data of the past, this data is expensive to publish and consume. Most of these services may not even have a mechanism to publish the data, even internally. Simply receiving the Youtube or Twitter firehoses (and not counting video or image media) would require significant engineering effort, and the rate of data generation is only accelerating.

I think we must push for open access to data, even if it is costly. These data wars benefit very few. If things don’t change, the future of search is dark.