February 12, 2005
by skrenta at 6:16 AM
We see a gigantic hole -- an opportunity -- in online search. Search has
become the dominant navigational paradigm for goal-directed reference queries. But search is a poor
way to stream new developments around a topic.
Reference Web vs. the Incremental Web
Google searches the reference Internet. Users come to google with a
specific query, and search a vast corpus of largely static information.
This is a very valuable and lucrative service to provide: it's the
Yellow Pages.
Blogs may look like regular HTML pages, but the key difference is
that they're organized chronologically. New posts appear at the top,
so with a single browser reload you can say "Just show me what's new."
This seems like a trivial difference, but it drives an entirely
different delivery, advertising and value chain. Rather than using
HTML, the delivery protocol for web pages, there is a desire for a new,
feed-centric protocol: RSS. To search chronologically-ordered content,
a relevance-based search that destroys the chronology such as Google
is inappropriate. Instead you want Feedster, PubSub or Technorati.
Feed content may be better to read in a different sort of client,
such as Newsgator, rather than a web browser.
And finally, there is a different advertising opportunity. Rather than
the sort of business ads you see in the Yellow Pages, instead the
ad opportunity is more about reaching a particular demographic or
subscriber group. The kind of ads that are in magazines. How do you
keyword target a breakfast cereal advertisement to fitness-conscious
21-25 year olds? You can't. You need to find something those people
are reading, and put your ad there.
While there's been considerable deployment of goal-directed services,
there has been little technology development around automated aggregation of
relevant topic streams. Until now, this hasn't been a problem.
Most of the growth on the web over the past 10 years has been reference
services. But now we're seeing an explosion in the number of sources
publishing new incremental content every day. Blogs certainly --
but other sources too, such as news organizations, companies, and our
increasingly web-enabled governments are pumping out gigabits of fresh
news online every day. There is a vast proliferation of new
incremental content underway.
It's not appropriate to try to stream this incremental info
with keyword searches. It just doesn't work. Say you want
a feed of interesting news about Google. A while back I
posted
something on this blog about Google which you'd probably
want to see in such a feed. But the rest of the articles
here are not about Google. So you don't want to subscribe
to blog.topix if you just want news about Google. But a keyword
search for "google" isn't going to deliver a useful experience
either -- there are far too many stray mentions of "google"
on the web every day. To get a relevant news feed about Google, you either
have to have people
read everything for you and edit away all the junk, or find an algorithmic technique
to do the same.
Human powered techniques work well when the collection to be scanned is
small, or if you're trying to cover a handful of subjects. But in the near future, when
there are 100X or 1000X the number of posts/day on the blogosphere as there are now,
humans won't be able to keep up. Interesting posts in out-of-the-way
places like this weblog won't be found in a timely manner, or perhaps
not at all. Navigational needs and discovery methods change when
you add zeroes to the end of the number of things you're looking through.
This mirrors the evolution of navigation on the web itself.
| proto web: |
bookmarks |
| small web: |
editorial directories, e.g. Yahoo |
| big web: |
algorithms, google |
For a small web (10-30M pages), editorial guides like Yahoo's original directory worked great.
But when the web grew to 300M pages, the 50-200 editors couldn't keep
up anymore. And when it grew to 10 billion pages, even thousands of
editors at a directory like the
ODP can't scale. At that point you
need algorithms to scale, you need Google.
An analogous transition will occur for webfeed content.
Relevance
Relevance of new information = freshness X personal context.
PageRank doesn't work for incremental data. News by definition is new,
and links take time to accrue. So if you're waiting for the web to
vote up a new piece of information before you'll see it, you'll lag
behind other news services that can recognize important information
the instant it's published. Relevence for a news item is about the
importance of the event, the timeliness of notification, and relevance
to a topic. This personal context is hard to derive by keyword.
Example: Company goes public -- interesting if you work there, own
stock in it, follow the industry of that company, buy the product or
live in the town where the company is located. Keywords will find
the company name, but maybe not the town, or the industry.
Scaling to the Long Tail
This is what we do at Topix.net. A way to think of us is as a purveyor
of 150,000 mailing lists, each focused on a location or a topic.
All updated from the broadest variety of relevant sources on the net.
We are also finding that audience aggregated by topic in this way is
very valuable.
Folks like Jason Calacanis and Nick Denton are doing this with human
labor. Car news from Autoblog,
or Jason's cancer blog,
or Nick's cool gadget blog.
These are great sites, and I am convinced that Jason and Nick have
figured out the future of publishing and are both going to be hugely
successful. A computer-generated product will never replace
high-end editorial sites like these.
But they don't have to. Search may be a winner-take-all market, but news isn't. I don't get all of my news from a
single source, and neither do you. For comprehensiveness,
algorithmic techniques will have to come into play.
People-powered systems just don't scale to the long tail. So we
are leveraging computers to stream news, not for
just 10's or 100's of topics, but for every subject.
Mobile home
manufacturing. Minot,
ND. 5,000 sports
teams. 6,000 public
companies. Every
disease. Every
celebrity. And so on...
150,000 topics, updated every 30 minutes 24/7, from every publisher
in our crawl.
There are 4-8 million active blogs now. At this size, you can still
"know" the top bloggers, and find new posts worth reading by clicking
around. But when the blogosphere grows 100X or 1000X, the current
discovery model will break down. You'll need algorithmic techniques
like Topix.net or a Findory to channel the most relevant material from
the constant flood of new content.