January 18, 2004
Fast crawls by search engines scary but fun
by at 7:04 PM
Over the past week we've had many search engines and spiders visit topix.net. Generally spiders will rate-limit themselves to visiting a particular domain no more than once every 30 seconds. However, for a large site like Yahoo, Geocities or dmoz, this means that it could take half a year to finish indexing the whole site.
But search engines want to have the freshest data, and webmasters want to be indexed as quickly as possible, so a few advanced crawlers will detect if they are visiting a very large site, and speed up dramatically if they sense that the site can handle the traffic.
We observed this first hand the second day after our launch. Googlebot was the first to show up, and quickly accelerated to about 1 hit/second. Teoma arrived and spent half a day fetching 30,000 or so pages. But then AltaVista's spider Scooter arrived and really fetched up a storm. They were fetching well over 5 pages/second at the peak. I thought for a minute it was DOS attack until I saw that it was just AltaVista indexing us. :-)
Fortunately we've built a wicked-cool page serving infrastructure, so our servers didn't even break a sweat. Load on one peaked at 1.14 with 75% cpu idle. Not bad for a pair of Supermicro 1U Linux boxes. We haven't even added the planned third front-end server to the cluster yet. At this rate we may not need to for a while and can hold it back as a hot spare in the rack.
Recent Entries
- Headline News: Topix on CNN.com
- Topix Cracks the Top 20 & Gets a New Suit
- Inviting Readers to the Party: Expanding the Definition of News
- Topix Grows 81%, According to Hitwise
- What's Missing from Your Local News?
- 500 Editors and Counting
- Reinventing Topix: Topix.Com(munity)
- Topix shows you "How To" at BlogHer
- SXSW Talk: When Communities Attack
- What can you do with one million people?
Archives
- July 2007
- June 2007
- May 2007
- April 2007
- March 2007
- February 2007
- January 2007
- December 2006
- November 2006
- October 2006
- August 2006
- July 2006
- June 2006
- May 2006
- April 2006
- March 2006
- February 2006
- January 2006
- November 2005
- October 2005
- September 2005
- August 2005
- June 2005
- May 2005
- April 2005
- March 2005
- February 2005
- January 2005
- December 2004
- November 2004
- October 2004
- September 2004
- August 2004
- July 2004
- June 2004
- May 2004
- April 2004
- March 2004
- February 2004
- January 2004
Powered by Movable Type
About Topix
- About Us
- Advertise
- Contact Us
- FAQ (General)
- Feedback
- Jobs
- Press Room
- Privacy Policy
- Terms of Service
Blogroll
- Rich Skrenta
- Mike Markson
- Blake Williams
- Chris Zaharias
- alarm:clock
- John Battelle
- Susan Mernit
- Micro Persuasion
- Greg Linden
- Jeremy Zawodny
- Search Engine Watch
- ResourceShelf
- Jeff Jarvis
- Traffick
- TechCrunch
- PaidContent
- Allen Morgan
Topix
