Apr 03, 2013 (08:04 AM EDT)
Inside IBM's Big Data, Hadoop Moves
Read the Original Article at InformationWeek
IBM is making a series of high-profile analytics announcements on Wednesday from its Almaden Research Center in San Jose, Calif., a fitting location at the epicenter of big data activity. The themes of the announcements will sound familiar because they've been the subject of announcements by plenty of competitors in recent years, but IBM contends it's setting new standards of performance.
There are two major announcements. This first is BLU Acceleration, a combination of compression, in-memory analysis and vector-processing techniques that IBM says will drive huge improvements in relational database performance. BLU is set for release in the second quarter, and it will benefit DB2 first and foremost. IBM is also bringing the technology to the Informix database with this release and, according to sources, to the Netezza database in future releases.
The second announcement is IBM PureData System for Hadoop, an appliance-based platform that customers will be able to scale up by simply adding more boxes. The hardware will run an upgraded version of IBM's InfoSphere BigInsights Hadoop distribution that's also being announced on Wednesday.
In typical IBM fashion, the announcements are loaded with bravado about the billions the company has spent acquiring software companies in recent years, but there's plenty of substance behind the buzz words.
[ Want more on big data analytics announcements? Read 6 Big Data Advances: Some Might Be Giants. ]
With BLU Acceleration, IBM is taking advantage of the same breakthroughs in low-cost memory and processing power that SAP has been talking about in connection with its Hana in-memory database. BLU is not IBM's answer to Hana, however. The focus for now is strictly on analytics and does not, as yet, address transaction processing, so it's more of a competitive response -- bar raising, IBM contends -- to the likes of Teradata, HP Vertica and EMC Greenplum, and data warehousing uses of Oracle Exadata and Oracle Exalytics.
The techniques employed by BLU include hybrid row and columnar storage, advanced compression, data skipping, vector processing and leveraging of increasingly affordable memory to speed processing. We've seen all these techniques before -- mixed columnar and row from Teradata, HP-Vertica and EMC Greenplum, data skipping from InfoBright and IBM's Netezza database, vector processing from Actian (formerly Ingress), and aggressive use of memory from multiple vendors -- but IBM is alone in putting all of these techniques together.
With BLU, IBM says it will be able to crunch 10 terabytes down to 1 terabyte; bring that 1 terabyte into memory; and effectively crunch it again down to 10 gigabytes. With the data-skipping technology, the database can then focus in on the 1 gigabyte that matters to a query without wading through repeating or irrelevant data. Your mileage may vary, as the saying goes, but IBM reports that BLU improves performance by 8X to 25X over the last DB2 release (10.1).
All these enhancements will undoubtedly reassure existing DB2 customers that the database roadmap is keeping up with state-of-the-art features. "What's novel in IBM's approach is that it's doing acceleration in several ways," analyst Robin Bloor of Bloor Group told InformationWeek. Whether it breaks new performance benchmarks and changes market share dynamics remains to be seen.
"The 25X figure is very aggressive, but when people are making purchase decisions, they do proof-of-concept benchmarks and put one database up against another," said Bloor. "You don't actually know what kind of performance you'll get until you've done the comparisons."
What IBM has yet to address with BLU, and what will likely require more extensive use of memory, is transaction processing as well as analytics. That's what SAP is doing with Hana, it's what Microsoft has announced it will do with project Hekaton (expected in 2015) and it's what Oracle is rumored to be working on for a future Oracle Database release.
"We do see an evolution of this technology beyond reporting and analytic workloads, but I can't comment on a timeframe for that," IBM's Tim Vincent, IBM fellow, VP and chief technology officer told InformationWeek. If the same pattern holds that IBM took in introducing the BLU technologies, it might wait to see what others do before attempting to do them one better.
IBM was ahead of both Oracle and Microsoft in embracing Hadoop, and it took a different path by introducing its own basic and enterprise distributions of its BigInsights Hadoop software in May 2011. Oracle and Microsoft entered the market in 2012 through partnerships with Cloudera and Hortonworks, respectively.
Now that IBM is announcing its own appliance, the PureData System for Hadoop, the in-house path will give it the advantage of offering a "100% IBM solution with our software distribution and our hardware," said Nancy Kopp, IBM's director of big data, in an interview with InformationWeek.
There will be two key differentiators from the Hadoop appliances that are either on the market (from EMC and Oracle) or in the works (from Teradata), Kopp said. "We saw that there's a key use case emerging for Hadoop as an archival system, so we've built archive capabilities right into the appliance," said Kopp. This will enable customers to offload data from warehouses for cold storage or archival compliance. The data is still active, however, so you can retrieve and restore to faster analytic databases.
[ Are you following the hot debate on the future of Hadoop? Read Will Hadoop Become Dominant Platform? ]
The second differentiator, according to Kopp, is a family of analytic accelerators starting with three: one for social data, one for text analytics and one for machine data. "The accelerators will make it easier to develop applications that take advantage of these data types," said Kopp, and she added that new accelerators will join the family in the future.
Beating the likes of Oracle and Microsoft on Hadoop is one thing. The question is now whether these giants will be the tortoises that ultimately finish ahead of the big data hares like Cloudera and MapR. Cloudera, in particular, is way out ahead in bringing Hadoop deployments to large enterprises with hundreds of deployments. By contrast, you seldom hear about BigInsights, and IBM refuses to disclose the number of customers running the software. At least one customer, MoneyGram, was set to participate in Wednesday's announcement.
IBM has addressed key Hadoop drawbacks that other distributors have addressed, including reliability and availability concerns tied to Hadoop's NameNode and the limited and slow SQL query capabilities of Apache Hive. On this last note, the upgraded BigInsights distribution announced Wednesday and set for release in the second quarter will include BigSQL, IBM's answer to SQL-on-Hadoop analysis.
EMC is set to release its remedy for Hive shortcomings with its Pivotal release later this month, but it looks like IBM will have BigSQL ahead of Cloudera's Impala, Hortonworks' Stinger and MapR's Drill initiatives.
As to the tortoise-and-hare question, Bloor says vendors that control the hardware will have advantages.
"My money would be on the boys with the iron, because they can look at the big picture, and as long as they get their pricing correct, then they're probably going to be able to a better job than vendors that are limited to software," he said.
That suggests that IBM -- as well as EMC/VMWare, HP, Intel, Oracle and no doubt others to come -- will have advantages. Which tortoise will win? We'll have to wait years to find out.
InformationWeek is conducting a survey on IT spending priorities. Take the InformationWeek 2013 IT Spending Priorities Survey today. Survey ends April 5.