Read the Original Article at http://www.informationweek.com/news/showArticle.jhtml?articleID=240144124
There are leaders and there are followers in the big data movement. This collection comprises a baker's dozen leaders. Some, like Amazon, Cloudera and 10Gen, were there at the dawn of the Hadoop and NoSQL movements. Others, like Hortonworks and Platfora, are newcomers, but draw on deep experience.
The three big themes you'll find in this collection are Hadoop maturation, NoSQL innovation and analytic discovery. The Hadoop crowd includes Cloudera, HortonWorks and MapR, each of which is focused entirely on bringing this big data platform to a broader base of users by improving reliability, manageability and performance. Cloudera and Hortonworks are improving access to data with their Impala and HCatalog initiatives, respectively, while MapR's latest push is improving HBase performance.
The NoSQL set is led by 10Gen, Amazon, CouchBase, DataStax and Neo Technologies. These are the developers and support providers behind MongoDB, DynamoDB, CouchBase, Cassandra and Neo4j, respectively, which are the leading document, cloud, key value, column and graph databases.
Big data analytic discovery is still in the process of being invented, and the leaders here include Datameer, Hadapt, Karmasphere, Platfora and Splunk. The first four have competing visions of how we'll analyze data in Hadoop, while the last specializes in machine-data analysis.
What you won't find here are old-guard vendors from the relational database world. Sure, some of those big-name companies have been fast followers. Some even have software distributions and have added important capabilities. But are their hearts really in it? In some cases, you get the sense that their efforts are window dressing. There are vested interests -- namely license revenue -- in sticking with the status quo, so you just don't see them out there aggressively selling something that just might displace their cash cows. In other cases, their ubiquitous connectors to Hadoop seem like desperate ploys for some big data cachet.
For many users, the key issues include flexibility, speed and ease of use. And it isn't clear that any single product or service can offer all of those capabilities at the moment.
We're still in the very early days of the big data movement, and as the saying goes, the pioneers might get the arrows while the settlers get the land. In our eyes, first movers like Amazon and Cloudera already look like settlers, and more than a few others on this list seem to have solid foundations in place. As we've seen before, acquisitions could change the big data landscape very quickly. But as of now, these are 13 big data pioneers that we're keeping our eyes on in 2013.
10Gen is the developer and commercial support provider behind open source MongoDB. Among six NoSQL databases highlighted in this roundup (along with DynamoDB, Cassandra, HBase, CouchBase and Neo Technologies), MongoDB is distinguished as the leading document-oriented database. As such it can handle semi-structured information encoded in JSON (Java Script Object Notation), XML or other document formats. The big attraction is flexibility, speed and ease of use, as you can quickly embrace new data without the rigid schemas and data transformations required by relational databases.
MongoDB is not the scalability champion of the NoSQL set, but 10Gen is working on that. In 2012 it introduced MongoDB 2.2, which added a real-time aggregation framework, new sharding and replication features for multi-data center deployments, and improved performance and database concurrency for high-scale deployments. The data aggregation framework fills an analytics void by letting users directly query data within MongoDB without using complicated batch-oriented MapReduce jobs. CouchBase plans to step up competition with MongoDB by way of JSON support, but we're sure 10Gen and the MongoDB community will step up to improve scalability and performance in 2013.
Amazon is about as big a big data practitioner as you can get. It's also the leading big data services provider. For starters, it introduced Elastic MapReduce (EMR) more than three years ago. Based on Hadoop, EMR isn't just a service for MapReduce sand boxes; it's being used for day-to-day high-scale production data processing by businesses including Ticketmaster and DNA researcher Ion Flux.
Amazon Web Services upped the big data ante in 2012 with two new services: Amazon DynamoDB, a NoSQL database service, and Amazon Redshift, a scalable data warehousing service now in preview and set for release early next year.
DynamoDB, the service, is based on Dynamo, the NoSQL database that Amazon developed and deployed in 2007 to run big parts of its massive consumer website. Needless to say, it's proven at high scale. Redshift has yet to be generally available, but Amazon is promising ten times faster performance than conventional relational databases at one-tenth the cost of on-premises data warehouses. With costs as low as $1,000 per terabyte, per year, there's no doubt Redshift will see adoption.
These three services are cornerstones for exploiting big data, and don't forget Amazon's scalable S3 storage, EC2 compute capacity and myriad integration and connection options for corporate data centers. In short, Amazon has been a big data pioneer, and its services appeal to more than just startups, SMBs and Internet businesses.
Cloudera is the #1 provider of Hadoop software, training and commercial support. From this position of strength, Cloudera has sought to advance the manageability, reliability and usability of the platform.
During 2012, the discussion turned from convincing the broad corporate market that Hadoop is a viable platform to convincing people that they can gain value from the masses of data on a cluster. But to do that, we'll need to get past one of Hadoop's biggest flaw: the slow, batch-oriented nature of MapReduce processing. Tackling the problem head on, Cloudera has introduced Impala, an interactive-speed SQL query engine that runs on the existing Hadoop infrastructure. Two years in development and now in beta, Impala promises to make all the data in the Hadoop Distributed File System (HDFS) and Apache HBase database tables accessible for real-time querying. Unlike Apache Hive, which offers a degree of SQL querying of Hadoop, Impala is not dependent on MapReduce processing, so it should be much faster.
There's a lot riding on Impala. What's not yet clear is whether it will mostly work with conventional relational tools or whether it will cut many of them out of the picture. Thus, all eyes will be on Cloudera in 2013.
A top contender in the NoSQL movement, Couchbase is a key-value store that is chosen for its scalability, reliability and high performance. As such, it's used by Internet giants (Orbitz), gaming companies (Zynga), and a growing flock of brick-and-mortar companies (Starbucks). These and other customers need to scale up much more quickly and affordably than is possible with conventional relational databases. Couchbase is the developer and commercial support provider behind the open-source database of the same name.
Key-value stores tend to be simple, offering little beyond record storage. With Couchbase 2.0, set for release in mid-December, Couchbase is looking to bridge the gap between key-value store and document database, the latter being MongoDB's domain. Couchbase 2.0 adds support for storing JSON (Java Script Object Notation) documents, and it adds tools to build indexes and support querying. These basics may not wow MongoDB fans used to myriad developer-friendly features, but Couchbase thinks scalability and performance will win the day. Look for a pitched battle in 2013.
Having lots of data is one thing. Storing it all in one scalable place, like Hadoop, is better. But the real value in big data is being able to structure, explore and make use of that data without delay. That's where Datameer comes in.
Datameer's platform for analytics on Hadoop provides modules for data integration (with relational databases, mainframes, social network sources and so on), a spreadsheet-style data analysis environment and a development-and-authoring environment for creating dashboards and data visualizations. The big draw is the spreadsheet-driven data analysis environment, which provides more than 200 analytic functions, from simple joins to predictive analytics.
Datameer customer Sears Holdings reports that it can develop in three days interactive reports that would take six to 12 weeks to develop using conventional OLAP tools. What's more, the spreadsheet-style interface gives business users a point-and-click tool for analyzing data within Hadoop. Through a recent partnership with Workday, Datameer is poised to embed its capabilities into that cloud vendor's enterprise applications. We'll be watching for breakthrough results.
Apache Cassandra is an open-source, column-group style NoSQL database that was developed by Facebook and inspired by Amazon's Dynamo database. DataStax is a software and commercial support provider that can implement Cassandra as a stand-alone database, in conjunction with Hadoop (on the same infrastructure) or with Solr, which offers full-text-search capabilities from Apache Lucene.
The combination of Cassandra and Hadoop on the same cluster is attractive. There are some performance tradeoffs in the bargain, but Cassandra as implemented by DataStax offers a few scalable and cost-effective options. A big appeal with this NoSQL database is CQL (Cassandra Query Language) and the JDBC driver for CQL, which provide SQL-like querying and ODBC-like data access, respectively. Implemented in combination with Hadoop, you can also use MapReduce, Hive, Pig and Sqoop. Use of Solr is separate from Hadoop, but capabilities include full-text search, hit highlighting, faceted search, and geospatial search.
The two biggest threats to Cassandra, and thus to DataStax, are HBase (now used by Facebook) and DynamoDB, Amazon's cloud-based service based on Dynamo. The bigger threat appears to be HBase, as the entire Hadoop community is working on maturing that Hadoop component into a stable, high-performance, easy-to-manage NoSQL database that's available as part of the same platform. Success will likely take some of the wind out of Cassandra's sails (and out of DataStax's sales). For now, HBase is still perceived as green while DataStax customers like Constant Contact, Morningstar and NetFlix attest to stability, scalability and performance on Cassandra today.
Hadapt was hip to the need for business intelligence and analytics on top of Hadoop before its first round of funding in early 2011. Hive, the Apache data warehousing component that runs on top of Hadoop, relies on slow, batch-oriented MapReduce processing. Hadapt works around that delay by adding a hybrid storage layer to Hadoop that provides relational data access. From there you can do SQL-based analysis of massive data sets using SQL-like Hadapt Interactive Query. The software automatically splits query execution between the Hadoop and relational database layers, delivering the speed of relational tools with the scalability of Hadoop. There's also a development kit for creating custom analytics, and you can work with popular, relational-world tools such as Tableau software.
Hadapt is in good company, with Cloudera (Impala), Datameer, Karmasphere, Platfora and others all working on various ways to meet the same analytics-on-Hadoop challenge. It remains to be seen which of these vendors will be a breakout success in 2013.
Hortonworks is the youngest provider of Hadoop software and commercial support, but it's an old hand when it comes to working with the platform. The company is a 2011 spinoff of Yahoo, which remains one of the world's largest users of Hadoop. In fact, Hadoop was essentially invented at Yahoo, and Hortonworks retained a team of nearly 50 of its earliest and most prolific contributors to Hadoop.
Hortonworks released its first product, Hortonworks Data Platform (HDP) 1.0, in June. Unlike those from rivals Cloudera and MapR, Hortonworks' distribution is entirely of open source Apache Hadoop software. And while Hortonwork's rivals claim higher performance (MapR) or are shipping components that are not yet sanctified by Apache (Cloudera), Hortonworks says its platform is proven and enterprise-ready.
Hortonworks isn't leaving it up to others to innovate. The company led the development of the HCatalog table management service, which is aimed at the problem of doing analytics against the data in Hadoop. Teradata is an early adopter of HCatalog and a major partner for Hortonworks. Microsoft is another important partner, and it tapped Horton to create a version of Hadoop (since contributed to open source) that runs on Windows. With partners like these and its influential team of contributors, there's little doubt Hortonworks will be a big part of Hadoop's future.
Karmasphere provides a reporting, analysis and data-visualization platform for Hadoop. The company has been helping data professionals mine and analyze Web, mobile, sensor and social media data in Hadoop since 2010. The software also is available as a service on Amazon Web Services for use in conjunction with Elastic MapReduce.
Karmasphere uses Hive, the data warehousing component built on top of Hadoop. The company concedes that Hive has its flaws, like lack of speed tied to MapReduce batch processing. But Karmasphere is integrating its software with the Cloudera Impala real-time query framework as one way around those flaws. "Impala dramatically improves speed-to-insight by enabling users to perform real-time, interactive analysis directly on source data stored in Hadoop," stated Karmasphere in an October announcement about the partnership.
We'll see how quickly Impala will mature from private beta testing to proven production use, but if it delivers as promised, Karmasphere and others will see a huge leap forward in low-latency big data analysis.
MapR's guiding principles are practicality and performance, so it didn't think twice about chucking the Hadoop Distributed File System out of its Hadoop software distribution. HDFS had (and still has, MapR argues) reliability and availability flaws, so MapR uses the proven Network File System (NFS) instead. In the bargain, MapR claims to get "twice the speed with half the required hardware." The NFS choice also enabled MapR to support near-real-time data streaming using messaging software from Informatica. MapR competitors Cloudera and Hortonworks can't stream data because HDFS is an append-only system.
MapR's latest quest for better performance (regardless of open source consequences) is the M7 software distribution, which the vendor says delivers high-performance Hadoop and HBase in one deployment. Many users have high hopes for HBase because it's the NoSQL database native to the Apache Hadoop platform (promising database access to all the data on Hadoop). But HBase is immature and still suffers from flaws, including instability and cumbersome administration.
M7 delivers two times faster performance than HBase running on standard Hadoop architectures, says MapR, because the distribution does away with region servers, table splits and merges and data compaction steps. MapR also uses its proprietary infrastructure to support snapshotting, high availability and system recovery for HBase.
If you're an open source purist swayed by arguments about system portability, MapR may not be the vendor for you. But we've talked to high-scale customers who have chosen MapR for better performance. Want to give it a try? MapR is available both on Amazon Web Services and the Google Compute Engine.
"Social applications and graph databases go together like peanut butter and jelly," says graph database consultant Max De Marzi. That's the big reason Neo4j, the open source graph database developed and supported by Neo Technologies, has a unique place in the NoSQL world.
Neo4j is used to model and query highly complex interconnected networks with an ease that's not possible with relational databases or other NoSQL products. Other NoSQL databases may excel in dealing with ever-changing data, but graph databases shine in dealing with ever-evolving relationships. In social network applications you can model and query the ever-changing social graph. In IT and telecom network scenarios you can quickly resolve secure access challenges. In master data management applications you can see changing relationships among data. And in recommendation-engine apps you can figure out what people want before they know they want it.
Neo4j is a general-purpose graph database that can handle transaction processing or analytics, and it's compatible with leading development platforms including Java, Ruby, Python, Groovy and others. Neo Technology is working on scale-out architecture, but even without that, Neo4j can manage and reveal billions of relationships. As social and network applications multiply, Neo Technology is in a prime position to manage the future.
Platfora is one of those startups offering a "new-breed" analytics platform built to run on top of Hadoop. The software creates a data catalog that enumerates the data sets available on a Hadoop Distributed File System. When you want to do an analysis, you use a shopping cart-metaphor interface to pick and choose the dimensions of data you want to explore. Behind the scenes, Platfora's software generates and executes the MapReduce jobs required to bring all the requested data into a "data lens."
Once the data lens is ready -- a process that takes a few hours, according to Platfora -- business users can create and explore intuitive, interactive data visualizations. You get sub-second query response times because the data lens runs in memory. Need to add new data types or change dimensions? That takes minutes or hours, says Platfora, versus the days or weeks it might take to rebuild a conventional data warehouse.
Platfora is the newest of new-breed big data analysis companies, but it has a who's who list of venture capital backers and an experienced management team. In short, this is one to watch in 2013.
Splunk got its start offering an IT tool designed to help data center managers spot and solve problems with servers, messaging queues, websites and other systems. But as the big data trend starting gathering steam, Splunk recognized that its technology could also answer all sorts of questions tied to high-scale machine data (a big factor in the company's successful 2012 IPO).
Splunk employs a unique language and its core tools are geared to IT types, but those power users can set up metrics and dashboards that business users can tap to better understand e-commerce traffic, search results, ad campaign effectiveness and other machine-data-related business conditions.
There's an overlap with Hadoop in that Splunk has its own proprietary machine-data repository, but database expert Curt Monash says Splunk is working on ways to work with Hadoop that go beyond the two-way integrations currently available. That would presumably leave Splunk free to pursue analytics while diminishing the need for redundant infrastructure. We'll be watching for that important release.