Jul 23, 2010 (08:07 AM EDT)
Big Data: The Early Days Are Over
Read the Original Article at InformationWeek
At least five innovative data warehousing practitioners have stepped up to share their stories for our "mastering big data" feature article planned for August 9. Their accounts show that application-specific needs are diverse, making generic speed, feed and TPC-H benchmark claims all the more irrelevant.
I'll get to the list of my latest customer interviews in a moment, but first a refresher. As I detailed in this column, the big-data era isn't new. Despite claims that the market is suddenly red-hot (now that the big vendors have finally responded), data volumes have been steadily growing for years.
Pioneering independent vendors have led the way toward highly scalable and performance-oriented approaches including massively parallel processing (MPP), column-store databases, in-database analysis and, more recently, NoSQL approaches. Going back to the well of stories published at IntelligentEnterprise.com in recent years, consider the examples of Sweden's TradeDoubler and India's Reliance Communications, shared in this story posted in June 2008.
TradeDoubler is a pan-European new-media marketing firm that needed faster load speed and analytic performance than it could achieve in an existing Oracle deployment. The company chose Infobright, which offers a column-store database that runs on commodity symmetric multiprocessor (SMP) hardware -- TradeDoubler chose a $12,500 Dell server that's probably much cheaper today.
In June '08, TradeDoubler had more than 125,000 Web sites in its network and was tracking 20 billion ad impressions, 265 million unique visitors and 12 million leads per month. The mart retains only three days' worth of clickstream data and 60 days' worth of aggregated online order data, so it was actually less than a terabyte in size. But with rapid data turnover, TradeDoubler was loading 2 billion rows of data per day, and it was hitting a wall.
"We had a one person working with the data full time, but depending on the complexity of the queries, it took anywhere from half a data to two days to get the data out," explained CTO Ola Uden.
TradeDoubler was able to load, rebuild and query the Infobright database all within the same day. The gains were due partly to the column-store compression (said to be 30 times that of a relational database) and partly due to the fact that Infobright auto indexes and doesn't need the partitioning and tuning required to make relational databases perform. (Infobright says its database requires up to 90% less admin work than Oracle, Microsoft SQL Server or IBM DB2 and is half the cost in terms of licensing and storage requirements.)
TradeDoubler's example is one of big-data loading and turnover rather than sheer scale, and it's common workload requirement in Web clickstream analysis. TradeDoubler could have easily built a larger-scale, higher-performance Oracle-based warehouse (even before the fall 2009 introduction of Oracle Exadata V2), but Uden said the costs would have been much higher than its Infobright investment.
Reliance Communications is one of India's largest and fastest-growing telcos. Back in 2008 it was adding some 1.5 million customers per month. The resulting flood of data was maxing out a 50-terabyte Oracle data warehouse, so in early 2007, the company decided to offload a call-data-record (CDR) data mart application. It chose a 60-terabyte Greenplum MPP database deployment. In 2008, Reliance added a 120-terabyte configuration for a total of 180 terabytes of capacity.
Indeed storage was more of a priority than speed for this particular application. Reliance needed to retain CDRs for compliance reasons. In a police investigation, for instance, law enforcement officials might ask Reliance for a complete record for all calls a particular subscriber made or received during a certain time period. The Indian government requires CDRs to be retained for 13 months, and with nearly one billion new calls made each day, the demands were massive.
"Access to CDRs is not very frequent, but we needed fast loading and fast retrieval for large amounts of data," said Raj Joshi, vice president of decision support systems.
Speed wasn't really the point of the deployment, but queries that previously took two to three hours were returned in 30 minutes on the Greenplum platform. Joshi said the cost savings over a conventional data warehouse were also "substantial."
Storage-hungry customers such as Reliance surely figured in EMC's recently announced plan to acquire Greenplum. Competitors have also taken note of the niche; in late 2008, Teradata added the Teradata 1550 Extreme Data Appliance, aimed at telco CDRs, Web clickstream and other extreme-scale applications involving up to 50 petabytes of information. This isn't an application you can affordably address with a one-size-fits-all box.
TradeDoubler and Reliance offer just two examples of the diverse needs that had customers looking for a better way back when independents offered the only alternatives to conventional data warehouse deployments. In fact, if you're looking to upgrade, you should consider at least six dimensions of scalability: data size, number of users, data complexity, query volume, data latency and query complexity.
All the better if you can read about or talk to customers who tackled a deployment that's similar to the one you are contemplating. Which brings me to the list of real-world case examples I plan to share in my upcoming "managing big data" article:
I have yet to hear from Oracle about a customer willing to talk about a successful data warehousing deployment of Exadata V2. Oracle had a lot to say to Bob Evans about customers, performance and Exadata V2's unique ability to address both transactional (OLTP) and analytic (data warehousing) needs. By one Gartner estimate, OLTP accounts for 60% to 70% of database license revenue, but more than 75% of the growth is attributable to data warehousing. That's because warehouses are where data is retained for analysis rather than regularly purged for keep-the-lights-on processing.
That's really my point. The early days for Teradata were in the 1990s, and then it got a competitive wake-up call in the middle of this decade. The early days for Netezza and Greenplum were in 2004 and 2005. The early days for Aster Data were in 2007. Right now we're seeing the early days of things like Hadoop and NoSQL alternatives that may change the data-analysis market even more dramatically than what we've seen over the last decade.
There's an old saying from the Wild West that the pioneers get the arrows and the settlers get the land. Maybe that analogy will ultimately apply to Oracle and Microsoft, both now settling into the scale-out data warehousing space (as well as scale-out OLTP, in Oracle's case). For now, I'm looking for proven production deployments that will show you what's possible within your enterprise.