May 23, 2008 (08:05 PM EDT)
Reinvent The Data Warehouse With Column-Store Databases And Appliances
Read the Original Article at InformationWeek
Data warehouses are growing fast. Nearly 40% of organizations say data volumes are increasing as much as 50% per year, while 18% say their warehouses are doubling in size annually, according to an IDC survey. To cope with the growth, the underlying databases have to be carefully tuned, but even then, many midsize and large companies now face six-, seven-, or even eight-figure upgrades of legacy data warehouses.
Enter data warehouse appliances and column-store databases. Both have taken off in recent years, with venture capitalists placing bets on a slew of startups. They're safe bets, too, given the corporate desire to do more in-depth analysis of all available data.
"One of the reasons Wal-Mart, Staples, and Amazon.com have been so successful is that they analyze their data, and they have it at the fingertips of the entire enterprise," says Foster Hinshaw, CEO of upstart appliance vendor Dataupia and a co-founder of Netezza, the 8-year-old leader in the data warehouse appliance market. Companies must be able to drill into all their data to understand where to locate a new store, which products are selling, which need to be moved to different locations, and what programs to offer customers, he says.
Boasting fast query performance, ease of deployment, and prices as low as $10,000 per terabyte, appliances have been wooing business away from the leading data warehouse vendors. It's no surprise that IBM and Teradata have responded with appliances of their own, or that Oracle has optimized reference configurations for third-party hardware. But the incumbents have yet to respond to the threat from column-store databases, which can deliver the industry's fastest query performance on complex analytic queries.
It sounds like a promising new era for data warehousing, but buyer beware. Appliances and column-store databases aren't always suitable replacements for a conventional enterprise data warehouse, or EDW. In fact, these alternatives are most often used for data marts that off-load data-intensive applications from the EDW, thereby avoiding (or at least delaying) the need to replace the main data warehouse. Whether you're looking for an analytic data mart or your next data warehouse, here's what to look for beyond the dazzling price, scalability, and performance claims.
QUERIES BY COLUMN
Unlike a conventional database that stores data in rows, a column-store database looks at data vertically. In the case of a customer database, for example, the row-store database queries across each customer record, while the column-store database can query selected columns. The row approach is great for OLTP, with lots of writes for each new customer transaction. Column-store is ideal for OLAP, with lots of reads against particular attributes. If you want to explore sales by ZIP code and product, your query will interrogate only the ZIP code and product SKU columns rather than wading through names, addresses, and other irrelevant attributes.
Another advantage of the column-store database is that it can apply optimized compression--upwards of 10-to-1 compression--because the data in each column is consistent (all names, for example). As long as your query isn't record-intensive-- that is, it doesn't involve a lot of attributes--the amount of data coming from a column-store is always less than from a conventional database, so it delivers faster performance.
Column-store databases are ideal for many applications. Telecommunications companies have billions of call data records that need to be in an enterprise data warehouse for billing and historical purposes, but columnar extracts can be placed in a data mart for marketing analysis. Big retailers, banks, and insurance companies take much the same approach to look for up-sell and cross-sell opportunities, fraud, and mortality trends, respectively.
The Internal Revenue Service launched one of the first column-store deployments in 1996, using Sybase IQ, in what began as a 2-TB warehouse. Back then, 2 TB was a very large database, says Jeff Butler, the IRS's director of research databases. The columnar database brought query times down from days to a matter of hours, "which was just a phenomenal boost in productivity," Butler says. "Today, we're at 158 TB, and the queries that once took hours now take seconds."
The IRS's analytic database stores population-level data (meaning no names or Social Security numbers) from every tax return over the last 10 years. The warehouse is used to explore filing and compliance behavior, as well as the potential impact of tax code changes.
The IRS went with a column-store database because it didn't know what data elements it would need to query from one question to the next, says Butler. With a static reporting application, you can use just about any database because you can do a lot of optimization, he says, "but that's not our situation."
(click image for larger view)
PERFORMANCE IN A BOX
What most data warehouse appliances have in common, whether built on row- or column-store databases, is a massively parallel processing, shared-nothing architecture. MPP means that the query load is spread across many processors, or nodes, usually on commodity hardware running Linux. Shared nothing means that each node is independent, with its own memory and storage. The result is high performance without the expense of the high-powered, symmetric multiprocessor servers that typically run conventional data warehouses.
Appliances also are gaining ground because they're easier to deploy and maintain than conventional warehouses, which have to be tuned, optimized, and, lately, clustered to perform in large-scale deployments. Trading on this appeal, column-store database vendors, including ParAccel, Sybase, and Vertica, have introduced software-hardware bundles built on third-party hardware.
In the case of Teradata, which was the first to bundle hardware and software in an MPP, shared-nothing architecture (without calling it an appliance), the upstarts are competing primarily on price. Teradata responded last month with its own appliances--one for data marts and one for small warehouses--while also upgrading the performance and scalability of its core EDW product.
WEIGH THE OPTIONS
When evaluating an appliance or column-store database, consider whether you're replacing the EDW or off-loading complex, data-intensive analytic queries to improve performance, delay an enterprise data warehouse upgrade, or both. If it's the former, keep in mind that column-store products aren't suitable for row-intensive queries involving many attributes. Does your EDW have to support thousands of users with a mix of query types? Not all appliances can do that. Teradata's flagship product, IBM's InfoSphere Balanced Warehouse E-Class, Hewlett-Packard's Neoview, DATAllegro v3, and Greenplum's G3 are all technically suitable for EDW deployment, though HP, DATAllegro, and Greenplum have to use partners to match Teradata's or IBM's data integration and information management software.
Appliances and column-store databases are great for analytic data marts and focused warehouses tackling complex queries and extreme data volumes. Case in point, NYSE Euronext, the parent company of the New York Stock Exchange, has replaced three aging 100-TB warehouses built on Oracle databases with three Netezza Performance Servers. Complex queries that took as long as 26 hours in the conventional warehouse now take two and a half minutes, while simple queries that took seven minutes now take five seconds, says Steve Hirsch, chief data officer.
However, these aren't typical EDWs. Loads are limited to hundreds of queries per day, he says, and the user community tops out at 150 per appliance, with only 20 concurrent users on a device. One of the broadest Netezza deployments has more than 600 users and 20,000 queries per day, Netezza says, but it's not unusual for large EDWs to have tens of thousands of concurrent users and hundreds of thousands of queries a day.
While not massive in terms of its user base, Euronext does do complex analytic calculations. "It's very possible that we could hit 40 to 50 TB of data in a single query," Hirsch says.
Once you understand your data warehousing needs, including the number of users and the depth, diversity, and volume of queries, you'll be ready to wade through all the confusing vendor claims. And when you get down to the finalists, insist on a proof-of-concept deployment.
"Don't trust the FUD or the TCP-H benchmark statistics or anything else," Gartner analyst Donald Feinberg says. "You need to put your data out there and run your queries and see which alternative is faster in your application." That's the only way to know for sure that you've made the right choice.