Jul 23, 2010 (08:07 AM EDT)
Big Data: The Early Days Are Over

Read the Original Article at InformationWeek

1   2   3  
At least five innovative data warehousing practitioners have stepped up to share their stories for our "mastering big data" feature article planned for August 9. Their accounts show that application-specific needs are diverse, making generic speed, feed and TPC-H benchmark claims all the more irrelevant.

I'll get to the list of my latest customer interviews in a moment, but first a refresher. As I detailed in this column, the big-data era isn't new. Despite claims that the market is suddenly red-hot (now that the big vendors have finally responded), data volumes have been steadily growing for years.

Pioneering independent vendors have led the way toward highly scalable and performance-oriented approaches including massively parallel processing (MPP), column-store databases, in-database analysis and, more recently, NoSQL approaches. Going back to the well of stories published at IntelligentEnterprise.com in recent years, consider the examples of Sweden's TradeDoubler and India's Reliance Communications, shared in this story posted in June 2008.

TradeDoubler is a pan-European new-media marketing firm that needed faster load speed and analytic performance than it could achieve in an existing Oracle deployment. The company chose Infobright, which offers a column-store database that runs on commodity symmetric multiprocessor (SMP) hardware -- TradeDoubler chose a $12,500 Dell server that's probably much cheaper today.

In June '08, TradeDoubler had more than 125,000 Web sites in its network and was tracking 20 billion ad impressions, 265 million unique visitors and 12 million leads per month. The mart retains only three days' worth of clickstream data and 60 days' worth of aggregated online order data, so it was actually less than a terabyte in size. But with rapid data turnover, TradeDoubler was loading 2 billion rows of data per day, and it was hitting a wall.

"We had a one person working with the data full time, but depending on the complexity of the queries, it took anywhere from half a data to two days to get the data out," explained CTO Ola Uden.

TradeDoubler was able to load, rebuild and query the Infobright database all within the same day. The gains were due partly to the column-store compression (said to be 30 times that of a relational database) and partly due to the fact that Infobright auto indexes and doesn't need the partitioning and tuning required to make relational databases perform. (Infobright says its database requires up to 90% less admin work than Oracle, Microsoft SQL Server or IBM DB2 and is half the cost in terms of licensing and storage requirements.)

TradeDoubler's example is one of big-data loading and turnover rather than sheer scale, and it's common workload requirement in Web clickstream analysis. TradeDoubler could have easily built a larger-scale, higher-performance Oracle-based warehouse (even before the fall 2009 introduction of Oracle Exadata V2), but Uden said the costs would have been much higher than its Infobright investment.