By John Borland,
The Web now contains at least 320 million pages, and is fast outstripping the ability of online search engines to provide a comprehensive index, concludes a study published Friday in Science magazine.
The most complete engine encompasses barely one-third of the total estimated Web pages, writes Steve Lawrence and C. Lee Giles of the NEC Research Institute. Their study, conducted last December, finds Wired Digital's HotBot engine returned the highest number of relevant pages in a series of searches.
"We think we've got pretty accurate results," Lawrence said, comparing the study's methodology with other estimates of the Web's scale. But even the NEC estimates could be low. The survey's bias is going to be such that the real number is more than 320 million, Lawrence added.
As part of their study, the pair took 575 search requests made by NEC scientists and fed them into six of the Web's top search engines. After checking the documents returned by each query for accuracy and broken links, they used the number of valid search results to extrapolate to their estimate of total Web pages.
Although HotBot returned the most pages and covered about 34 percent of the Web, the study also finds the Wired engine contained the most invalid links. The Lycos engine, which the researchers estimated contains only about 3 percent of total Web documents, returned the lowest percentage of broken links.
Lycos officials contested the study's results, saying they don't come close to matching the company's figures. "There's a 200 percent discrepancy between what they suggest and what we are," said Rajive Mathur, Lycos senior product manager for search products.
Researchers' focus on scientific searches may have underestimated Lycos' database contents, Mathur added. "Our spider is focused on the sites and the sets of sites that our users are looking for," he said. "We're looking more at the popular culture that's out there. That's what the Internet is all about." Lycos studies have suggested quality of links, and a low number of invalid links, may be more important to Web users than total number of search results.
The authors of the NEC study agreed that counting search results isn't necessarily the best way to rank engines, saying that each company offers additional services and different methods of ranking relevance. Thus, the most comprehensive engine won't necessarily be the best way of finding a given page, they said.
The various search engines have different schedules for retrieving new Web documents, so a recently posted document might not show up first in the most comprehensive engine. "They don't seem to be completely regular in the way they do their crawls," Lawrence said. "It's possible that the most comprehensive service hasn't done any new indexing of pages for a while."
The researchers also suggested that "meta" search services such as Metacrawler, which combine search results from several engines, can produce a more comprehensive result than any single engine. Specialized services such as Ahoy, which searches for individual home pages, also pick up obscure pages that might not find their way into the major engines, Lawrence said.
Other than HotBot and Lycos, the study estimates that AltaVista covers 28 percent of pages, followed by Northern Lights at 20 percent, Excite at 14 percent, and Infoseek at 10 percent.
The number of Web pages is likely to grow by another 1000 percent over the next few years, Lawrence predicted, making it even more difficult for the engines to keep up.
Lowes seeking Information Security Analyst II in North Wilkesboro, NC
United Nations Foundation seeking Systems Administrator in Washington, DC
World Book seeking Java Technical Lead in Chicago, IL
Advanced Workstations in Education seeking Software Developer in Chester, PA
Silicon Labs seeking Automotive Market Segment Director in Austin, TX
For more great jobs, career-related news, features and services, please visit our Career Center.
TechWeb's FREE e-mail newsletters deliver the news you need to come out on top.
Get definitions for more than 20,000 IT terms.
Editorial and vendor perspectives