Read the Original Article at http://www.informationweek.com/news/showArticle.jhtml?articleID=181500448
Organizations use business intelligence applications primarily with structured data sources -- databases, data warehouses, and databased archives. These tend to be number-intensive data sources. But BI services firm the Atre Group estimates that 80-85 percent of data within organizations resides in semi-structured forms (text block fields, documents, attached notes, e-mails, reports, etc) or unstructured forms (paper and micrographic archives, raw files and backup disks or tapes, books, manuals and so forth). BI is starting to move into this unstructured data territory, through application of its analytic, statistical, and classifying technologies. BI is getting into the business of "information-making" by helping IT shops to organize and unlock semi-structured and unstructured data.
That inevitably means BI is also starting to tread into a lot of enterprise application integration (EAI) territory already occupied by some very well placed players, such as Google, IBM, Autonomy, Inxight and others. As BI becomes more deeply involved in the business of information-making -- enabling the searching, structuring, analysis and movement of data beyond familiar numeric constraints -- BI collides with vendors in the search, content management and information integration arenas.
But BI has been for a long time expanding its data and information nets to draw upon semi-structured data such as text sources, location, and mapping data. Perhaps the biggest new pool is the often unstructured data lying in otherwise structured databases -- in Binary Large Object (BLOB) fields such as e-mails, instant messages, and other commentary. Organizations are discovering that many valuable nuggets of unstructured data reside side-by-side with very refined structured information.
Structuring data is about making it accessible to decision-making. Data exists on a continuum from unstructured (no container for storage, retrieval, backup, and secured access) through semi-structured (stored, but not yet fully cleansed, sorted, interlinked and made searchable) to structured information (linked and correlated data that's attached to analytic models or processing APIs).
This is the process of information-making -- transforming data first into accessible and then actionable objects in the programming sense. In effect, data becomes information through three steps. First, it's made safely accessible. Next, it's correlated and interlinked in models that explain how the data is interrelated. Last, it's linked to dashboards and processes that allow decision-makers to observe and act on the information.
These models of behavior can remain fairly static, or may change over time. For example, inventory-processing works in a warehouse through a relatively static, stable model. In contrast, the ways stocks are valued in the market presents an ever-changing model. And of course, some portions of data may be applied to multiple models. Think of a databank with hundreds of economic data points that are used in many profoundly different explanatory and predictive models. So BI practitioners need not only set up and operate their decision-making tools, but also constantly refine their models and the underlying data brought into them.
So now we have the key insight -- new data is constantly coming forward and vying to become part of the decision-making set, while existing information may be re-molded and even discarded in parts. Within BI, semi-structured text, geographic information, and other non-traditional data resources are vying with numeric data as best indicators and predictors for decision models. Thus we arrive at the current state of information affairs -- there are a variety of tool-makers (BI vendors, database firms, information integrators and start-ups) at work on these globs of data, trying to turn unstructured and semi-structured data into more useful and actionable information.
The Information Makers
The new Information Makers are software companies that are unlocking the value of unstructured and semi-structured data. The single most influential player in the field surprisingly does not come from among search giants such as Google, Microsoft or Yahoo, nor from among BI players such as Cognos, Hyperion, SAS and SPSS. Rather, it's IBM. IBM has managed to pull together its own research labs work, along with results from universities, to deliver the Unstructured Information Management Architecture (UIMA). UIMA is an API for processing unstructured data of all types (text, speech, video, audio, etc.) into a series of open, standardized and extensible methods.
Hadley Reynolds, of the Delphi Group, has noted that "IBM's UIMA framework proposes a new 'standard' for text [and other] analytics implementations that includes common interface definitions and a common data model. It does not include a search engine for distribution or a runtime environment in which to process and provision analytic applications to business systems ... the big news is that IBM is throwing its weight behind an infrastructure that can reduce the complexity of implementing analytic applications."
UIMA has been embedded into IBM products like WebSphere Information Integrator OmniFind Edition, Lotus Workplace and WebSphere Portal Server. Sixteen other vendors have signed onto using the API framework, including a cross-section of BI and content analysis players such as Attensity, ClearForest, Cognos, Inxight, SAS and SPSS. The UIMA framework, plus IBM's acquisitions of Ascential, Bowstreet and iPhrase signals that IBM is staking a large position in the emerging information-making marketplace. Of equal import is Curt Monash's idea of an ontology management system. Is UIMA the underpinning of such a system? Time will tell.
Another company that's made a big unstructured data play is Autonomy, which in November 2005 bought out search engine company Verity for a cool $500 million. This places Autonomy at the head of the business search market, according to a recent assessment from Forrester. Autonomy offers its Intelligent Data Operating Layer (IDOL) server, which integrates the latest in personalization, collaboration and retrieval features. It also has subsidiaries in the sound-processing (SoftSound), and video-interpretation fields (Virage). In very quick order, Autonomy has combined the IDOL framework with Verity's K2 advanced word-phrasing, relational taxonomies, and other classifying features. The combo may well add savvy to general search and analytics customers, while bringing new features to its forms, BPM and business search products.
In the area of classifying and categorizing data there are a number of strong startup players. Attensity is a Palo Alto, Calif.-based company that adds natural language-based extraction engines and text analytic tools that not only help to classify and categorize text but also bring that data into XML or relational databases for broader analysis. Attensity is part of the UIMA standards group. Another member of the UIMA group, ClearForest, has developed its ClearPath methodology as a four-step iterative process for the design, installation and fine-tuning of its text-extraction, tagging and analytic methodologies into a BI setting. The whole direction is to make unstructured data available through tags and refinement to structured analytic tools. In contrast, Inxight, also a part of the UIMA group, offers its Smart Discovery Awareness Server, which delivers federated search, clustering and dynamic alerts to customers.
Two BI stalwarts, SAS and SPSS, are the Schwarzenegger-DeVito Twins of statistical analytics. They often hover around the same markets in their own distinctive ways. Both have chosen to adopt the UIMA framework and are investing in text-mining. SPSS is using its LexiQuest and Text Builder routines to improve the later stages of text analysis and to provide better data and concepts for its Clementine Data Mining Engine. Ditto for the SPSS Text Analysis for Surveys tool, which improves the caliber of clustering and extractions before text-based response data is passed along for analysis. In other words, SPSS does text analysis to improve its other statistical and analytic procedures. SAS does the same, as its SAS Data Miner has added text-refining capabilities to include vast deposits of textual data. Integrating text-based information with structured numeric data enriches predictive modeling capabilities. This might explain why SAS Text Miner is sold only in conjunction with SAS Enterprise Miner. For its part, Oracle has not only updated its Text Miner Engine but added new smart search technology from recently acquired TripleHop, combining constantly updated user profiles with information discovery and categorization capabilities. The built-in Oracle Text Miner has been recognized for having robust classification and theme extraction features, combined with three major indexing strategies. The combination of the two technologies could become a rival to IBM's UIMA framework.
Likewise, Google is investing in unstructured data technologies. Curiously, its highly refined Web Search algorithms work less well in a corporate setting, where they run into three major barriers. First, Google uses hyperlink references extensively to rank and classify Web pages. But hyperlinks occur much less frequently in corporate shops. Also, the types of data files available to Google are less complete in the Web space, so Google has had to modify its search engine for organizational settings. Third, a lot of security and access restrictions exist on files in corporate shops. However, Google recently made an agreement with IBM to cooperate on enabling searches on database and structured data resources -- this may help with finding acceptable solutions to access restrictions.
Meanwhile, Google is leading the brand recognition race in search. Google has published all its APIs for every service from Web Search through Maps and Froogle to the new Video, Talk and Google Desktop. And Google has cleverly seeded the development space by sponsoring the Summer of Code initiative, which funds hundreds of projects bringing code, often Google-based, to Open Source projects. In sum, Google is building up an open base of APIs that start to manipulate (but don't necessarily analyze) unstructured data. Don't count out Google from the unstructured data business and the race to get their APIs into developer's mind space.
In sum, there are a number of players attracted to managing and analyzing the 80-85 percent of all corporate data that's stored in unstructured and semi-structured formats. These vendors are determined to bring their diverse information-making skills, algorithms and methods to bear on data that's inherently ambiguous, multi-faceted, voluminous, and difficult to process. Nonetheless, they have some successes and ROI rewards to show for their efforts.
The Unstructured Rewards
One of the drivers in the unstructured data market is businesses' need to get trusted views of customers, products, partners and competitors, and to marshal all their organizational data assets in doing so. Another very real current driver is the war on crime and terror. Governments and law enforcement at all levels are much more aware of "needing to connect the dots." Vendors such as Attensity, ClearForest and SPSS trumpet their contributions to these efforts.
The next most common uses of unstructured data fall in the arenas of legal compliance and customer service. Making certain that an organization is meeting its regulatory obligations, or discovering patent research precedence are just two areas where text-mining and unstructured analysis can be useful. Some organizations have turned this focus outward and are using unstructured analytics to improve their customer-facing call center and service operations. They use the analytics to identify consistent trouble spots in their products or services so they can alert support personnel on how to handle these cases and advise engineering on products that need fixes. Evidence shows that diverse retail and financial services companies are finding substantial return through better customer service, and they're investing in improving their processes.
Finally, BI is seeing all sorts of improvements based on the use of supplementary unstructured data. SAS and SPSS have been improving their statistical analysis routines by adding categorized or clustered results using as additional data sources items like attached memo notes or survey remarks. This analysis of largely untapped data is in turn allowing organizations to provide better and more targeted services to customers, and also to respond more quickly when problems arise. Automobile companies are using ClearForest text analysis tools to improve their warranty and product service offerings. By continually analyzing the text notes on service requests and repair reports, automakers get:
When customers have added up the benefits, they've seen substantial returns through improved business process management. In general, this is a recurring pattern in such BI arenas as scorecarding, business metrics measurements or business activity monitoring. The more often underlying textual or location observations from customers, suppliers and employees is added to the analysis, the truer and more trusted view of the product that emerges. Given a growing battery of tested tools, organizations can ill afford to ignore their unstructured data resources.
Jacques Surveyer is a writer and consultant, see some of his work at theOpenSourcery.com
Towards the Next Generation of Enterprise Search Technology, by A.Z. Broder and A.C. Ciccolo
The Well Organized Enterprise, by Penny Lunt Crosman