Aug 23, 2011 (10:08 AM EDT)
10 Lessons Learned By Big Data Pioneers
Read the Original Article at InformationWeek
What does it take to make the most of big data, as in tens, if not hundreds of terabytes of information? That depends on your needs and priorities. Ad-delivery firm Interclick found a fast platform that helps it be more productive while also delivering near-real-time insight. Harvard Medical School learned that data can grow even when obvious measures such as patient counts and years of data studied remain constant. comScore, the digital-media measurement giant, has twelve years of experience taking advantage of data compression by way of a column-store database. In fact, it uses sorting techniques to optimize compression and reduce processing demands.
Yahoo, eHarmony, Facebook, NetFlix, and Twitter have discovered that Hadoop is an ideal, low-cost platform for processing unstructured data. This open-source project is not just for Internet giants, however. JPMorgan Chase and other mainstream businesses are also taking advantage of Hadoop. And as data supplier InfoChimps has discovered, Hadoop is fast maturing, with a growing selection of add-on and helper applications available to support deployments.
Keep in mind that not all big-data deployments are measured by total scale. Linkshare, for instance, only retains a few months worth or data, but each day it loads and must quickly analyze tens of gigabytes, so it's a big deployment measured on an interday scale. Perhaps the most important lesson detailed in this image gallery is to heed Richard Winter's advice to pay attention to all six dimensions of data warehouse scalability. Only then can you formulate an accurate request for proposal, test for the most demanding needs, and make appropriate technology investments that will meet long-term needs.
Massively parallel processing platforms, column-store databases, in-database processing techniques, and in-memory computing options can slash query times from days or hours down to minutes or seconds, but what's the big hurry? As New York ad-delivery firm Interclick has discovered, the most obvious benefit of fast analysis is productivity. Quick responses free up time for more queries, deeper analysis or both. A second benefit is near-real-time insight, whereby analyses can be acted upon while there's still an opportunity. With fast response, Interclick can serve up ads to targeted segments of Web surfers within hours or even minutes of demonstrated behavior. An ad for an airline, hotel chain, or car rental agency, for example, can be delivered soon after someone has visited a travel-related Web site--and before they've made all their arrangements. Interclick's speed enabler is a ParAccel column-store database deployment turbocharged by a 3.2-terabyte-RAM in-memory cluster.
When planning a data warehouse investment, look beyond the simple dimensions of customer, record, or transaction counts. That's a lesson learned by Harvard Medical School, which has long examined about 20 year's worth of medical records to study the efficacy and risks of various drugs. While the patient counts and timeframes have remained fairly constant, the richness of each medical record has grown as new measures such body mass index and LDL cholesterol have emerged. Data grows in unexpected ways, so understand all dynamics before projecting needs.
Better data compression saves on storage, and that's still important even as hardware costs per terabyte have declined. Column-store databases, such as HP Vertica, Infobright, ParAccel, and Sybase IQ, can achieve 30-to-1 or 40-to-1 compression while row-store databases, such as IBM DB2, Microsoft SQL Server, and MySQL, average 4-to-1 compression. That's because columnar data is consistent, containing all zip codes or all purchase order numbers, for example. Rows hold a mix of data, such as all the attributes associated with an individual customer--name, address, zip, purchase order number, and so on. The Aster Data and Oracle databases offer hybrid row/columnar features. Oracle's Hybrid Columnar Compression, for one, can crunch data at a 10-to-1 ratio.
Compression levels vary depending on the data, and keep in mind that column-store databases aren't always the best choice. If your queries call on many attributes, a row-store product may deliver better performance. Indeed, row-store databases are more commonly used for enterprise data warehouses handing a mix of queries whereas column-store databases more often power focused data marts. Column-store customers include digital-media measurement giant comScore, a Sybase IQ user since 1999, and fast-growing online network Interclick, which deployed ParAccel in 2009.
Just as consistent columnar data aids compression, you can improve compression optimization by sorting data before loading. comScore uses Syncsort DMExpress software to sort data alphanumerically before it's loaded into Sybase IQ. Where 10 bytes of unsorted data can be compressed to three or four bytes, says Michael Brown, comScore's chief technology officer, pictured above, 10 bytes of sorted data can typically be crunched down to one byte. "That makes a huge difference in the volume of data we have to store," Brown says.
Sorting also can streamline processing. comScore sorts URL data to minimize Web site taxonomy lookups. Instead of loading the 40 URLs for Web site pages in the order they were visited during a session, sorting might reveal that 20 of those pages were on Facebook, 12 were on GMail and the balance were at NYTimes.com. The sorted data would trigger just three site lookups whereas unsorted data might trigger many redundant lookups if the visitor bounced back and forth among just a few sites. "That saves a lot of CPU time and a lot of effort," Brown says. It's possible to sort data with SQL statements, and custom scripts, but sorting is also a common feature in data-integration software from IBM, Informatica, Oracle, SAP, SAS, Syncsort, and others. At truly large scale, Hadoop is an option for sorting and other processing steps.
Apache Hadoop, one of the fastest-growing open-source projects going, is a collection of components for handling distributed data-processing, particularly large volumes of unstructured data such as Facebook comments and Twitter tweets, email and instant messages, and security and application logs. MapReduce is a Hadoop-supported programming model for rapid processing of masses of information. Conventional relational databases, such as IBM Netezza, Oracle, Teradata, and MySQL, can't handle this data because it doesn't fit neatly into columns and rows. And even if they could do the job, the cost of the licenses would be prohibitive, as we're talking about hundreds of terabytes or even petabytes. Hadoop software is free, and it runs on low-cost commodity hardware. (Keep in mind that puppies are free, too -- in other words, Hadoop deployments require care and feeding that is not free.)
Hadoop pioneers include Yahoo!, eHarmony, Facebook, NetFlix, and Twitter, but even straight-laced financial giants like JPMorgan Chase are putting Hadoop to work. A growing list of commercial support options will only help Hadoop grow.
The Hadoop market is expected to grow into the billions of dollars, and supporting products and integrations are quickly emerging. Well-known data-integration vendors Informatica, Pervasive Software, SnapLogic, and Syncsort, for example, have all announced products or integrations aimed at making it faster and easier to work with this young processing platform.
Pervasive Software's Data Rush tool optimizes concurrent, parallel processing within Hadoop. Data provider InfoChimps uses Data Rush in combination with Hadoop instances running in Amazon's Elastic Compute Cloud. InfoChimps CTO Philip Kromer, pictured above, says he has seen 2-4X performance increases in tests of Data Rush involving hundreds of gigabytes, cutting 16-hour jobs down to four to eight hours. That makes it possible for InfoChimps to reduce computing costs and harvest that much more data from Twitter and other non-relational data sources.
Informatica, SnapLogic, Syncsort and others are making it possible to load, sort, and aggregate data using a single tool set across conventional databases and Hadoop deployments. A single, familiar approach and tool set should make it easier for your data management professionals to do their work.
In many scenarios, the "big" in big-data isn't the ultimate scale of the database so much as the amount of information loaded and analyzed each day. Marketers, for example, typically need to load and analyze lots of data as quickly as possible so the insight can be quickly applied to identify new segments and lists, and to improve targeting or creative content for the next campaign. If you know what's working sooner, you won't waste money on segments, enrichment data, or marketing appeals that aren't fruitful. To provide search, lead-generation and affiliate marketing services to publishers and advertisers, ad network Linkshare loads and analyzes tens of gigabytes of clickstream data per day, but the total database tops out at just six terabytes. Low-latency insight is now a competitive must for Linkshare. "Five years ago it was okay to give people yesterday's data, but that's not good enough any longer," says Jonathan Levine, LinkShare's chief operating officer.
Netezza and Greenplum broke into the data warehousing market in the mid 2000s by outscaling conventional Oracle, IBM DB2, and Microsoft SQL Server deployments. Times have changed. Oracle introduced Exadata in 2008, IBM bought Netezza last year, and early this year, Microsoft started shipping SQL Server 2008 Parallel Data Warehouse (PDW) appliances. IBM, Microsoft, and Oracle shops now have good reason to consider incumbent vendors. The DirectEdge stock exchange, for example, has long been a Microsoft shop, and that made PDW (a very new and yet-to-be-market-proven product) "an obvious choice," says chief technology officer Richard Hochron, pictured above.
All current DirectEdge business intelligence and data warehousing assets are built on Microsoft technology, including some 200 finance, strategy, compliance, legal, and regulatory reports built on Microsoft SQL Server Reporting Services. That doesn't mean PDW got a free ride. Hochron oversaw a proof-of-concept project using DirectEdge data and "proving specific points that were very important to us." Automotive data provider Polk, an Oracle shop, chose Exadata in large part because of staff familiarity with managing the database, but it waited for V2 "because we never buy version 1.0 of anything," says Doug Miller, director of database development and operations.
Some data warehousing platforms offer intergenerational compatibility while others force you to migrate and retire the old stuff. Teradata, for one, has long maintained compatibility between several generations of releases so you can mix old and new hardware to scale up the total environment. If there's no compatibility, the boxes can't be linked to access data as a single environment. Even when compatibility is an option, there are limits to just how many generations of databases and hardware vendors can span, so check with your perspective vendor on compatibility and the long-term upgrade roadmap. Retailing giant Walmart has been a Teradata customer for more than 20 years, and last year it reached an agreement with Teradata to extend the relationship. Teradata powers Walmart's enterprise data warehouse, a data store that analyst Curt Monash pegged at 2.5 petabytes back in 2008. As part of the new agreement, Walmart's Teradata deployment will be expanded and refreshed. Up-to-data hardware usually offers good reason to upgrade. The latest Teradata products are said to use 50% less floor space and 40% less energy than older generations of hardware.
Database expert Richard Winter advises those planning investments in new data-warehousing platforms to consider six dimensions of scalbility: data size, data complexity, number of users, query volume, query complexity, and data latency requirements. Lots of concurrent users (like 1,000 or 10,000 or more), mixed queries, and complex analyses can be every bit as constraining as sheer big-data scale. Fail to anticipate demands along any one of these dimensions and you may outgrow your system sooner than expected.
Finally, it's an absolute must to test your prospective platform with your most complex data, your hardest queries, and a best attempt at replicating the workloads you'll face in terms of numbers of concurrent users and the mix of queries that will challenge the data warehousing platform.