August 1, 2004
Topix.net: The best algorithmic news editing in the business
by at 2:09 PM
We're launching a new version of Topix.net today, with a next-gen version of our NewsRank story technology. NewsRank powers the the relevance, accuracy and magnitude of the stories categorized on Topix.net.
The new front page uses a complex set of semantic story filters to govern news selection. The fully algorithmic editing process takes into account the magnitude of the story, as well as what the story is about, as determined by our AI categorizer from a Knowledge Base of 150,000 topics.
Other improvements also go live onto the site today, including:
- Full Coverage sections backing up major stories. This lets users drill down on big stories with multiple viewpoints.
- Determining the accurate time of the article, as opposed to how recently the story appears on the web and was fetched by our crawler (addressing the phenomenon where a day-old story appears on a news aggregator with "8 minutes ago" as the timestamp).
- Live Feed on the front page. These are raw headlines coming off of our news crawler. No categorization or ranking has been applied, other than profanity and automated QA filtering.
- Press release coverage has been added to the business sections.
- Email alerts are available for every Topix.net category.
- RSS feeds are now available from our search results page, in addition to the 150k subject and location feeds.
- Up to 7,000 sources in our news crawl.
Our goal was to create a more compelling news experience than the other aggregators and online news sites. Rather than simply averaging together the top stories from major news outlets, our NewsRank engine is applying a set of editorial rules to guide the story selection process.
We want to de-homogenize the news selection; instead of averaging down, we want Topix.net to find and bring back the most interesting, compelling (and sometimes the oddest) stories from the deep corners of the web. Stories that won't show up on other sites.
Categorized Aggregation is Hard
Topix.net has an aggregated feed for every ZIP code in the US (and every country in the world), as well as hundreds of thousands of other subjects -- health conditions, sports teams, industries, and so on. How do we do it?
Not with human editing, source tagging, or keyword scanning. The Topix.net NewsRank engine is reading each story individually, determining locality and subject information based on the content of the article. NewsRank also condenses 17 dimensions of importance from every story into a single value.
Categorizing sources in order to produce topic aggregations doesn't work. Susan Mernit writes a great blog about online media, but she also writes about food and other personal topics. Blindly adding her entries to a food or media industry aggregation would result in inappropriate posts showing up.
Source-based categorization doesn't work for local, either. The San Francisco Chronicle runs stories that aren't about San Francisco. Conversely, there are many stories about events in SF that show up in news sources based outside of San Francisco. These stories would be missed with source-based tagging.
Keyword-driven filters are also a poor solution. Pulling every story out of the news stream with "San Francisco" in it will not make a good SF rollup, but instead will yield a random jumble of posts, most of which merely mention "San Francisco", but overall have nothing to do with it:
... on a business trip to San Francisco, ...
... an unrestricted free agent from San Francisco, ...
... was bound from Alaska to San Francisco in the winter of 1860 ...
... moved, with her family to San Francisco in 1960, ...
The situation is even worse if the keyword is ambiguous ("Kerry", "Bush", "Springfield").
Our solution is to disambiguate references to people, places and subjects, and match them against our Knowledge Base of 150,000 topics. The result lets our algorithmic story editing technology leverage a much finer-grained idea of what a story is about than simply using the big 7 news categories (US, World, Business, Sci/Tech, Sports, Entertainment, Health.) We can bias up Olympics coverage while slighting movie reviews. Some pages on Topix.net are programmed to slightly favor sensational stories, others to de-emphasize the lurid.
Our complete news system -- article crawler and extractor, story clustering engine, NewsRank determination, topic and locality categorizer, the Topix.net Knowledge Base, and the algorithmic editing system (the "Robo-Editor") comprise the most sophisticated algorithmic news editing system on the net. It's by no means finished though -- so please keep the feature suggestions and bug reports coming and we'll keep improving it. :-)
Update: More on Topix.net's new algorithmic editorial algorithms can be found in this Cyberjournalist article.
Recent Entries
- Headline News: Topix on CNN.com
- Topix Cracks the Top 20 & Gets a New Suit
- Inviting Readers to the Party: Expanding the Definition of News
- Topix Grows 81%, According to Hitwise
- What's Missing from Your Local News?
- 500 Editors and Counting
- Reinventing Topix: Topix.Com(munity)
- Topix shows you "How To" at BlogHer
- SXSW Talk: When Communities Attack
- What can you do with one million people?
Archives
- July 2007
- June 2007
- May 2007
- April 2007
- March 2007
- February 2007
- January 2007
- December 2006
- November 2006
- October 2006
- August 2006
- July 2006
- June 2006
- May 2006
- April 2006
- March 2006
- February 2006
- January 2006
- November 2005
- October 2005
- September 2005
- August 2005
- June 2005
- May 2005
- April 2005
- March 2005
- February 2005
- January 2005
- December 2004
- November 2004
- October 2004
- September 2004
- August 2004
- July 2004
- June 2004
- May 2004
- April 2004
- March 2004
- February 2004
- January 2004
Powered by Movable Type
About Topix
- About Us
- Advertise
- Contact Us
- FAQ (General)
- Feedback
- Jobs
- Press Room
- Privacy Policy
- Terms of Service
Blogroll
- Rich Skrenta
- Mike Markson
- Blake Williams
- Chris Zaharias
- alarm:clock
- John Battelle
- Susan Mernit
- Micro Persuasion
- Greg Linden
- Jeremy Zawodny
- Search Engine Watch
- ResourceShelf
- Jeff Jarvis
- Traffick
- TechCrunch
- PaidContent
- Allen Morgan
Topix
