TechWeb

Will Microsoft's Hadoop Bring Big Data To Masses?

Oct 30, 2012 (09:10 AM EDT)

Read the Original Article at http://www.informationweek.com/news/showArticle.jhtml?articleID=240012533


Microsoft announced last week HDInsight Server, a version of the Hadoop big data analytic framework designed to run in simpler, less expensive environments than Hadoop usually requires.

Microsoft characterized the product, now in beta, as an effort to simplify big data implementations and enable IT managers to run Hadoop on Windows machines, without having to jump through the usual hoops.

Apache.org asserts that Hadoop can run on Win32-based machines but recommends doing so only for development, not as a production-quality system.

The on-premises version of HDInsight is fully supported and certified by Microsoft and partners as a production platform. It includes an automated install-and-configuration process and links that enable business users to download subsets of Hadoop data sets so they can run their own scenarios using Excel, PowerPivot for Excel and PowerView.

Microsoft says the HDInsight preview runs on its Windows Server and Azure platform-as-a-service cloud offering. Branded Windows Azure HDInsight Service and Microsoft HDInsight Server for Windows, the Windows versions "dramatically" lower the cost and complexity of deploying Hadoop, according to Microsoft technical fellow David Campbell.

The Hadoop framework is designed to run on implementations as small as one server, but is typically configured to run on a whole server cluster running the open source Apache Web server.

Even installed on just one node, Hadoop must be configured as a cluster within which Hadoop's own name servers and resource managers coordinate the integration of a series of Hadoop modules that manage, schedule, process, query, analyze and publish both data and analytics.

Integration with Microsoft's System Center 2012 is designed to simplify management of both the Windows Server and the Azure versions of HDInsight by allowing IT managers to tweak or control the applications using Microsoft's familiar management tools.

Though HDInsight is fully compatible with Apache Hadoop, according to Microsoft, it is designed to be more adaptable because it can be run either on-premises or in the cloud -- or in both places with connections secured through Active Directory that allow on-premises and cloud versions to exchange data and/or queries.

In addition, having HDInsight running on Windows Azure provides the same dynamic resource configuration as every other cloud service, so admins can install it as if it is running on a single server, then increase RAM, storage, CPU cycles and other resources to cover peaks in demand. They can also expand a virtual single-node version of HDInsight into a multi-node, clustered version without having to reinstall or migrate the installation to a different set of physical servers.

Both also ship with links to the U.S. Census Bureau, United Nations, Dun & Bradstreet and other data sources via the online Windows Azure Marketplace. They also allow data-crunching business users to download subsets of Hadoop data or the results of previous queries to refine, reprint or republish the results using Excel.

Both versions can also be linked with installations of SQL Server to trade data or run cross-queries on both systems, using connectors from Hortonworks, the Hadoop specialist that handled most of the porting and integration of Hadoop onto Windows.




Microsoft has "spent tremendous engineering time" creating a smooth integration between Hadoop and existing Windows security and management capabilities within Active Directory and Microsoft System Center, as well as the easier access possible using SQL Server and Excel, according to Doug Leland, Microsoft general manager of SQL Server marketing, in an interview with InformationWeek last week.

HDInsight can also run within virtual Windows Servers using Microsoft's Hyper-V hypervisor. Microsoft is trying to make that option simpler, too, Leland said, by developing templates that would act as pre-configured instances of HDInsight Server. These could be spun up or shut down at will, bringing it the advantages cloud services like Azure have to expand or contract a big data cluster.

Hortonworks Offers Similar Hadoop Features

Hortonworks sells a version of Apache Hadoop that already offers many of the advantages touted for HDInsight, including integrated and automated installation, server-management and data-integration modules, and an adaptation of the Apache HCatalog, which allows data to be shared among different Hadoop installations. (See the Hortonworks Data Platform data sheet for more details.)

Though portions of the Hadoop framework are perfectly capable of managing traditional relational data, Microsoft made a point of positioning SQL Server as its preferred database management system for structured data, while HDInsight big data installations manage unstructured data and federations or mergers of multiple external data sets into a larger, big data platform.

"We need to accelerate the process of enabling the masses to benefit from the power and value of Apache Hadoop in ways where they are virtually oblivious to the fact that Hadoop is under the hood," according to a blog from Shaun Connolly, Hortonworks' VP of corporate strategy. "Doing so will help ensure time and energy is spent on enabling insights to be derived from big data, rather than on the IT infrastructure details required to capture, process, exchange and manage this multi-structured data."

The need to query both structured and unstructured data -- and the question of what it actually means to integrate existing data management systems with Hadoop -- are ongoing problems for big data-enamored corporations, according to Forrester analyst Boris Evelson, writing before the HDInsight announcement.

The connection between Hadoop data and Excel is enabled with via ODBC or Scoop connectors that can extract data from Hadoop so it can be imported into SQL Server. Though Hortonworks' materials focus on the value of connecting big data to traditional SQL Server databases, Microsoft's announcements make clear it views SQL Server 2012 as its primary data-management option both in the cloud and on premises.

"The next frontier is all about uniting the power of the cloud with the power of data to gain insights that simply weren't possible even just a few years ago," Microsoft VP Ted Kummert said in a Microsoft press release. "Microsoft is committed to making this possible for every organization, and it begins with SQL Server 2012."

In-memory analytics offers subsecond response times and hundreds of thousands of transactions per second. Now falling costs put it in reach of more enterprises. Also in the Analytics Speed Demon special issue of InformationWeek: Louisiana State University hopes to align business and IT more closely through a master's program focused on analytics. (Free registration required.)