Enterprise Search: Microsoft, Google, Specialized Players Vie For Supremacy

Sep 26, 2008 (08:09 PM EDT)

Read the Original Article at

Enterprise search tools are evolving to meet significantly different business requirements. IT and legal may need to scoop up documents, files, and e-mail relevant to forthcoming litigation. Security and compliance officers want to search laptops to make sure credit card numbers aren't hitting the road. Meanwhile, lines of business are clamoring for better ways to extract value from reams of enterprise data. Cracking open different repositories could help salespeople better use information gathered about customers.

Companies approaching enterprise search must match their requirements to the capabilities of competing search platforms from Google, Microsoft, and a growing field of specialized vendors. Yet even if CIOs scope out requirements perfectly, they may find themselves running multiple search products for different business units to address diverse needs, and piling on the storage and server resources.

And that's OK.

Take National Instruments, a maker of computer-based measurement and automation products for manufacturers and scientists. The company has seen its search infrastructure--covering information from customers outside the firewall and employees inside it--grow from 10 servers to 25 in about three years. Eight of those are production servers, with the rest dedicated to testing and development, security, and processing. Of particular note is the wildfire growth of National Instruments employees' use of search. John Graff, VP of marketing and customer operations, says CPU requirements to index data and respond to employee queries are growing 152% year over year.

InformationWeek Reports

But National doesn't begrudge the increase in resources. "As IT comes back to me to say 'We need more,' it's an easy sign-off because the value is so clear," Graff says.

In this business climate, what kind of technology draws that kind of support? One that solves problems. Still, purchasing decisions are complex. There's not only no clear market leader, but the category is diverging into two distinct paths.

While Google is synonymous with Web search, it's only one of many players in this market--and by no means dominant. Autonomy, Microsoft via its Fast Search & Transfer acquisition, Recommind, and others more than hold their own against the Big G. Endeca and IBM offer search products aimed at specific business problems. And companies such as Guidance Software, Kazeon, and StoredIQ Software are winning customers faced with e-discovery burdens.

Internet Evolution -- The Site for News, Analysis & Opinion About the Future of the Internet.
On the Web, it's a one-search-fits-all world. Startups say that could be changing.
InformationWeek separates the enterprise search market into two major categories: compliance search and business search. Vendors in the first category aim at IT and corporate officers, such as legal counsel, human resources professionals, or compliance officials. These constituencies aren't trying to find one relevant result out of a 1,000, but 5,000 relevant results out of 1 million. This category is dominated by e-discovery, and search products not only must find information, but also manage it, whether by moving it to a new repository or applying controls to ensure files aren't changed or deleted. Vendors in the second category aim at employees, whether a business unit looking to extract more value from the information in various repositories, or a broader audience that needs help finding mislaid documents.

Autonomy, Guidance, Kazeon, and StoredIQ offer compliance search technology, with e-discovery and information management as the major drivers. In December 2006, the Federal Rules of Civil Procedures, which govern the processes and requirements of parties in federal civil suits, clarified the rules regarding electronically stored information. These rules have had a significant impact on the breadth of data that companies are expected to find and produce in litigation.

While case law around these updated rules is evolving, the upshot is clear: Courts won't accept "I can't find it" as an excuse for not producing information relevant to a lawsuit. In January 2008, Qualcomm was slapped with an $8.5 million penalty because it mishandled the e-discovery process and failed to produce e-mail relevant to a lawsuit with Broadcom. And without search in place, e-discovery costs can fast eclipse the amount that a company may stand to lose in a lawsuit. Even employing precise keyword searches,Verizon places the price of processing, reviewing, culling, and producing 1 GB of data at between $5,000 and $7,000, according to a study by the University of Denver's Institute for the Advancement of the American Legal System. Multiply that by the size of your data stores, and the cost of a few new servers seems downright reasonable.

Mike Brooks, CIO and senior VP of CVR Energy, a $3 billion-a-year refinery, uses Autonomy's Idol as part of his e-discovery program. Idol is a search and indexing technology that underpins all of Autonomy's software products. Brooks runs discovery searches using Idol, and then uses a homegrown software tool to move relevant information to a secure repository. "We are trying to make sure there's nothing in our enterprise we don't know about, to avoid surprises," Brooks says.

E-discovery phases can be mapped out using the Electronic Discovery Reference Model, an independent framework that has been adopted by the vendor and legal communities. The search products described here focus on the initial phases of the discovery process, including identification, collection, and preservation. In the identification phase, search products must seek out content relevant to a lawsuit; they connect to various repositories, crawl the content, and create a searchable index. Users--in this case IT, HR, and legal counsel--run queries and get back a list of matches. And like business search tools, these products offer capabilities that go beyond simple keyword and Boolean search, such as support for multiple languages, natural language processing, and pattern recognition to extract additional layers of meaning from the information being indexed.

The features needed for the collection and preservation phases separate discovery-focused search products from their business kin. For instance, to do collection effectively, these products must be able to move content from one repository to another while preserving metadata, such as time stamps, to demonstrate that information wasn't altered during the discovery period. StoredIQ addresses this by logging the original metadata and then adjusting the necessary fields when files are moved or copied to a new location. Autonomy, Guidance Software, and Kazeon say their products can move files without changing metadata at all.

Preservation requires relevant data to be maintained in an unaltered state. In the discovery process, employees involved in litigation, called custodians, are issued a preservation notice by legal counsel instructing them not to destroy or tamper with files, e-mail, and other information related to the case.

Follow The Rules
Setting up an e-discovery program that will keep you out of hot water doesn't have to be difficult.
Human nature being what it is, custodians may be inclined to do exactly the opposite, so these search products have to provide a legal hold, in which information is preserved for the duration of a case. Discovery search products enforce these holds either by moving relevant information to a secure server or archive, or by altering write, open, or delete permissions.

Laptops and desktops present problems for collection and preservation that don't exist with business search, and vendors have approached those challenges differently. Autonomy's Zantaz Introspect, Kazeon, and StoredIQ can access and search PCs over the network. They can collect information and enforce legal holds without the use of an agent.

Guidance requires a small piece of software, which it calls a servlet, to be installed on systems to be searched, though the company says the servlet doesn't install DLLs or interact with the host operating system. Autonomy also includes an agent with its Aungate Legal Hold software to lock down relevant information on laptops, PCs, and servers. StoredIQ says it plans to release an agent for laptops and desktops at the end of the year.

Deploying a search product is often a tactical response to an e-discovery emergency. But there are long-term strategic benefits to these products that involve understanding and managing corporate information, particularly unstructured data.

In addition to discovery, CVR's Brooks uses Idol to index work orders that are generated as part of plant maintenance operations. While these work orders contain pre-defined codes that identify common operations, employees also include detailed comments about problems and solutions that provide context about maintenance issues that can't be gleaned from codes.

"When Idol goes through the data, it groups together like topics, so when I'm running through a set of work orders, I can look at what the issues have been," Brooks says. If he sees large clusters of work orders around a specific topic, it allows him to identify reoccurring problems.

IT search also can shine a light into hidden corners of an organization, such as laptops and desktops. IT often has little visibility into the kinds of information stored there. Popular repositories such as SharePoint, which can be deployed without IT's input or even awareness, also are prime candidates for compliance search. "We see people running our technology once a week for audits, like finding personally identifiable information, source code, intellectual property," says John Patzakis, chief strategy officer at Guidance. Kazeon CEO Sudhakar Muddu says 50% of his company's business is e-discovery, with the rest in support of governance, security, and data management.

This process of looking at search and indexing to serve e-discovery needs also can help companies manage--or create--a retention and disposition strategy. "Forward-looking companies are being driven by chief risk officers to get a handle on data," says Craig Carpenter, general counsel and VP of marketing for Recommind. "They want to have it organized and start retiring data they don't need." Getting rid of data may go against the natural instincts of technology profesionals, but as organizations add terabytes of information to the infrastructure every year, those instincts may be swamped by necessity.

Impact Assessment: Enterprise Search

(click image for larger view)

Business enterprise search is evolving from its original use case, which can be described as "search for search's sake." The goal then was to generate a general index of information repositories and provide a front end for employees to browse through it, the way they would the Web--with simple queries that coughed up a long list of results. Today, companies approach business search to get better insight into specific domains and address business problems. "Customers aren't looking to buy search," says Craig Reinhardt, director of enterprise content management at IBM. "They want better business results. We look at search as a critical ingredient that needs to be integrated with other applications."

Reinhardt points to customers, such as those in law enforcement, that use IBM's OmniFind Enterprise search platform to find patterns in criminal records, or to manufacturers that use the search software to analyze customer comments on blogs and wikis.

Microsoft sees significant opportunity in this business-oriented approach to search, which was a major driver for its January acquisition of Fast Search & Transfer. National Instruments' situation illustrates why Microsoft's view makes sense. In early 2005, National was looking for a way to streamline access for different lines of business to all the content it gathers on its customers.

"We made a list of all the databases and repositories where we had customer information," VP Graff says, adding that he quickly found certain groups were tapped into different systems--for instance, the sales group used the CRM application, while engineering tracked the company's tech support Web pages--but no one group had access to everything. "We called it the '17 databases' problem," he says.

The company had deployed the Fast Enterprise Search Platform to enhance the search capabilities of its customer-facing Web site, and Graff thought search might be a good way to unlock the silos that contained information about its customers. These silos included the CRM system, corporate file servers, Lotus Notes, Oracle databases, and an internal wiki.

"It's been a huge hit," Graff says. "Our salespeople use it to do research on customers prior to visits. Marketing and engineering management use it to get feedback on what customers are doing with products. Even our CEO uses it."

A key to success is the search interface, which employees access through an intranet. National customized the user interface to let people select facets of a search. One surprisingly popular choice is age. "You can look at the creation date an order was processed, or the date a tech support query happened," he says. "Users can bring the freshest information to the top of the pile."

Companies also are asking search to provide more context, based on a variety of factors, such as the person conducting the search. For example, Google's latest version of its Search Appliance leverages Active Directory and LDAP-based directories to personalize search results based on the searcher's organizational role.

"You can create a policy group for the sales department that gives higher priority to documents talking about pricing," says Nitin Mangtani, lead product manager for Google Enterprise Search. "For engineers, you can give higher importance to engineering documents."

Another example of context is Recommind's MindServer search platform, which can be augmented with modules, such as Expertise Location. MindServer uses information gleaned from indexed data and other sources, such as HR portals, to associate users with expertise in different content areas based on that user's work product. It can serve as an extended company directory to help employees locate colleagues with specialized knowledge.

Microsoft also is pursuing expert search.

"We refer to it as 'people search,'" says Jared Spataro, director of enterprise search at Microsoft. "It will relate concepts to people in the organization who might know something about something. It's something we hear a lot about from our customers." Spataro says the company wants to lead in this area, but it hasn't yet announced specifics.

Regardless of the type of search you're interested in, there are technological issues that must be addressed, including indexing speed, index size, and security. In a discovery effort, time is of the essence. Initial results may need to be available to counsel within weeks. That may sound like a long time, but not when faced with repositories that hold multiple terabytes of information that have to be indexed before anything else can happen.

Indexing times are fluid. How quickly an engine can create an index depends on the content. A file share full of PowerPoint slides with 25 words per page will be indexed in a blink. Text-heavy documents take longer, as do PST files that have to be cracked open or files that may have multilevel attachments.

Some search products will federate with an index that has been created by the repository's native search feature, such as a Documentum repository or an e-mail archive. This speeds indexing time and saves on storage space. Through federation, the third-party search engine essentially brings the query to the application's native search field, and then incorporates the results into its own user interface.

Note that most e-discovery search vendors prefer to index content themselves, whether or not the targeted repository has native search capability.

Customers also have to take the search infrastructure into account. Google and StoredIQ deliver via an appliances, while the other search products are pure software deployed on servers. IT must provide sufficient processing capacity to handle volumes of queries. This may not be an issue with compliance search, which isn't intended to address simultaneous search requests from a large audience of users.

Companies must also provide storage for the index (except for Google and StoredIQ). Vendors usually estimate the index size as a percentage of the content being cataloged. For example, if the index is 10% of the content, a 100-TB body of data will yield a 10-TB index. The primary factor is how detailed you want the index to be. For instance, the Fast search engine can produce an index that runs about 20% of the size of the content, but most organizations will enrich the index through advanced linguistics to provide more detailed search results. Microsoft's Spataro says customers opting for a rich index should expect it to run two or three times the size of the actual content store.

Another issue is how the search engine links to content repositories. Most search products include out-of-the-box connectors for popular platforms, such as Exchange, Notes, SharePoint and Documentum, as well as general-purpose connectors for file and Web servers. However, IT may need to tweak connectors or build one-off integrations if a critical application or repository isn't supported.

CIOs also need to make sure users don't get access to search results that violate corporate access controls. Most search engines can match user identities to permissions associated with groups in the company's directory system.

Bottom line, enterprises should approach search as a strategic technology that will help solve specific business problems. To that end, companies must understand their own requirements when evaluating search platforms--IT should involve business units, legal, and HR, and start with the business case to see where search will provide value. Done right, this is one technology that will pay off not just in dollars, but in productivity and peace of mind.

Illustration by Ryan Etter