February 27, 2006
Every word in every document is already a tag
by at 8:20 AM
Back when web directories were still cool, AOL had an effort to build their own based on the Dewey Decimal System. They had 60 contractors in Arizona typing in web urls and assigning DDC numbers to them.This didn't work. But why?
Because two thoughtful, non-malicious humans sitting next to each other will tag the same URL differently. (And, in this particular case, the most obscure URLs would default to more prominent positions in the DDC hierarchy, because they couldn't be classified.)
When you pick up the result of this exercise by a particular DDC number to get that category page, it's junk. It's missing a lot of stuff it should have, and it has stuff it shouldn't.

Before we had full text search of the world's knowledge at our fingertips, search systems would let you retrieve documents by keywords. If the item you were looking for hadn't been given the right keywords, it was undiscoverabale. "Internet Law?" "Software Patents?" "IP Theft?" Modern search systems consider every word or phrase in the document a tag.
Chris posted a rant about tagging here previously. I go back and forth on them.
On one hand tags work because they maximize participation with a simple user ask and the social use effects help rough standardization emerge around them.
But tags aren't a panacea, since they're excessively vulnerable to spam, and the items which should belong to the same categories will get different tags from different users. Which is it, "topixnet"? or "topix"?
They're uniquely valuable in a system like Flickr since photos don't have any text of their own to keyword search, so getting the user to add any searchable text at all is a big win. You can ask users to caption their photos but often putting just a word or two is easier so the participation level is higher.
But if you have the full text of the web, or blogosphere, or whatever, the marginal utility of the "keywords" tag on the document seems to be rather low. To deal with spam and relevance issues, the search interface for a large collection needs to be appropriately skeptical about what documents are claiming to be about.
It's great if you can get the user to enter additional metadata about their posts. But if you aren't already looking at the existing text you're missing a lot of pre-existing "tags".
Recent Entries
- Headline News: Topix on CNN.com
- Topix Cracks the Top 20 & Gets a New Suit
- Inviting Readers to the Party: Expanding the Definition of News
- Topix Grows 81%, According to Hitwise
- What's Missing from Your Local News?
- 500 Editors and Counting
- Reinventing Topix: Topix.Com(munity)
- Topix shows you "How To" at BlogHer
- SXSW Talk: When Communities Attack
- What can you do with one million people?
Archives
- July 2007
- June 2007
- May 2007
- April 2007
- March 2007
- February 2007
- January 2007
- December 2006
- November 2006
- October 2006
- August 2006
- July 2006
- June 2006
- May 2006
- April 2006
- March 2006
- February 2006
- January 2006
- November 2005
- October 2005
- September 2005
- August 2005
- June 2005
- May 2005
- April 2005
- March 2005
- February 2005
- January 2005
- December 2004
- November 2004
- October 2004
- September 2004
- August 2004
- July 2004
- June 2004
- May 2004
- April 2004
- March 2004
- February 2004
- January 2004
Powered by Movable Type
About Topix
- About Us
- Advertise
- Contact Us
- FAQ (General)
- Feedback
- Jobs
- Press Room
- Privacy Policy
- Terms of Service
Blogroll
- Rich Skrenta
- Mike Markson
- Blake Williams
- Chris Zaharias
- alarm:clock
- John Battelle
- Susan Mernit
- Micro Persuasion
- Greg Linden
- Jeremy Zawodny
- Search Engine Watch
- ResourceShelf
- Jeff Jarvis
- Traffick
- TechCrunch
- PaidContent
- Allen Morgan
Topix
