February 2, 2004
Memory resident
by at 10:48 PM
A while ago I was chatting with my old boss Wade about a nifty algorithm I found for incremental search engines, which piggybacked queued writes onto reads that the front end requests were issuing anyway, to minimize excess disk head seeks. I thought it was pretty cool.
Wade smacked me on the head (gently) and asked why I was even thinking about disk anymore. Disk is dead; just put the whole thing in RAM and forget about it, he said.
Orkut is wicked fast; Friendster isn't. How do you reliably make a scalable web service wicked fast? Easy: the whole thing has to be in memory, and user requests must never wait for disk.
A disk head seek is about 9ms, and the human perceptual threshold for what seems "instant" is around 50ms. So if you have just one head seek per user request, you can support at most 5 hits/second on that server before users start to notice latency. If you have a typical filesystem with a little database on top, you may be up to 3+ seeks per hit already. Forget caching; caching helps the second user, and doesn't work on systems with a "long tail" of zillions of seldom-accessed queries, like search.
It doesn't help that a lot of the scheduling algorithms found in standard OS and database software were developed when memory was scarce, and so are stingy about their use of it.
The hugely scalable AIM service stores everything in memory across a distributed cluster, with the relational database stuck off to the side, relegated to making backups of what's live in memory. Another example is Google itself; the full index is stored in memory. Servers mmap their state when they boot; no disk is involved in user requests after everything has been paged in.
The biggest RAM database of all...
An overlooked feature that made Google really cool in the beginning was their snippets. This is the excerpt of text that shows a few sample sentences from each web page matching your search. Google's snippets show just the part of the web page that have your search terms in them; other search engines before always showed the same couple of sentences from the start of the web page, no matter what you had searched for.
Consider the insane cost to implement this simple feature. Google has to keep a copy of every web page on the Internet on their servers in order to show you the piece of the web page where your search terms hit. Everything is served from RAM, only booted from disk. And they have multiple separate search clusters at their co-locations. This means that Google is currently storing multiple copies of the entire web in RAM. My napkin is hard to read with all these zeroes on it, but that's a lot of memory. Talk about barrier to entry.
Recent Entries
- Headline News: Topix on CNN.com
- Topix Cracks the Top 20 & Gets a New Suit
- Inviting Readers to the Party: Expanding the Definition of News
- Topix Grows 81%, According to Hitwise
- What's Missing from Your Local News?
- 500 Editors and Counting
- Reinventing Topix: Topix.Com(munity)
- Topix shows you "How To" at BlogHer
- SXSW Talk: When Communities Attack
- What can you do with one million people?
Archives
- July 2007
- June 2007
- May 2007
- April 2007
- March 2007
- February 2007
- January 2007
- December 2006
- November 2006
- October 2006
- August 2006
- July 2006
- June 2006
- May 2006
- April 2006
- March 2006
- February 2006
- January 2006
- November 2005
- October 2005
- September 2005
- August 2005
- June 2005
- May 2005
- April 2005
- March 2005
- February 2005
- January 2005
- December 2004
- November 2004
- October 2004
- September 2004
- August 2004
- July 2004
- June 2004
- May 2004
- April 2004
- March 2004
- February 2004
- January 2004
Powered by Movable Type
About Topix
- About Us
- Advertise
- Contact Us
- FAQ (General)
- Feedback
- Jobs
- Press Room
- Privacy Policy
- Terms of Service
Blogroll
- Rich Skrenta
- Mike Markson
- Blake Williams
- Chris Zaharias
- alarm:clock
- John Battelle
- Susan Mernit
- Micro Persuasion
- Greg Linden
- Jeremy Zawodny
- Search Engine Watch
- ResourceShelf
- Jeff Jarvis
- Traffick
- TechCrunch
- PaidContent
- Allen Morgan
Topix
