Feb 22, 2008 (07:02 PM EST)
Database Pioneer Rethinks The Best Way To Organize Data
Read the Original Article at InformationWeek
Is there a better way to build a data warehouse? For years, relational databases--which organize data in tables composed of vertical columns and horizontal rows--have served as the foundation of data warehouses. Now database pioneer Michael Stonebraker is promoting a different way of organizing them, promising much faster response times.
DB2, MySQL, Oracle, SQL Server, Sybase, and Teradata all have rows as their central design point. Yet in data warehousing, faster performance is gained through a column layout, maintains Stonebraker, CTO of startup Vertica Systems, which last week introduced its Vertica 2.0 column-based database, due for general release in March. Stonebraker says all types of queries on "most data warehouses" will run up to 50 times faster in a column database. The bigger the data warehouse, the greater the performance gain.
While a scientist at the University of California at Berkeley in the 1970s, Stonebraker was one of the original architects of the Ingres relational database, which spawned several commercial variants. A row-based system like Ingres is great for executing transactions, but a column-oriented system is a more natural fit for data warehouses, Stonebraker now says.
Why? Data warehouses frequently store transactional data, with each transaction having many parts. Columns cut across transactions and store an element of information that's standard to each transaction, such as customer name, address, or purchase amount.
A row, by comparison, may hold 20 to 200 different elements of a transaction. A standard relational database would retrieve all the rows reflecting, say, sales for a month, load the data into system memory, then find all sales records and generate an average from them. Being able to focus on just the "sales" column leads to improved query performance.
There's a second performance benefit in the column approach. Because columns contain similar information from each transaction, it's possible to derive a compression scheme for the data type, then apply it throughout the column. Rows can't be compressed as easily because the nature of the data--name, ZIP code, and account balance, for example--varies from record to record. Each row would require a different compression scheme.
Compressing data in columns makes for faster storage and retrieval and reduces the amount of disk required. "In every data warehouse I see, compression is a good thing," Stonebraker says.
Twenty customers have begun using Vertica's database since it was released in September, says CEO Ralph Breslauer. One of them is Sonian Networks, a 10-employee company that archives e-mail for other businesses. Sonian CTO Greg Arnette says he considered PostgreSQL, MySQL, and Oracle for business intelligence but opted for Vertica, based in part on "Stonebraker's reputation."
Arnette expects Sonian's e-mail archive to grow from a few terabytes to a petabyte sometime next year. "We needed a system that was optimized for high-volume reporting," Arnette says. In tests conducted by Sonian, Vertica was "a lot faster" than MySQL and required less CPU power, he says.Sonian's e-mail archive and Vertica data warehouse are hosted on Amazon.com's Simple Storage Service and computing service. Sonian's business customers are able to access reports generated by its data warehouse.
IBM VP Anant Jhingran, in a blog posting last fall, called the column database concept "academically interesting," but suggested that APIs and integration issues stand in the way of widespread adoption. But, with $23.5 million in venture capital funding, Vertica has grown to 50 employees. Says Stonebraker, "I expect the data warehouse market to become completely column-store based."