Read the Original Article at http://www.informationweek.com/news/showArticle.jhtml?articleID=240148458
There's a lot more to Microsoft's big data strategy than a Hadoop partnership with Hortonworks. In fact, Hadoop is just the beginning of what Microsoft Technical Fellow Dave Campbell describes as "an information production line."
You don't hear much about Microsoft in the big data market, but the company wants the world to know it has big plans. In fact, it's living big data through its Bing search engine, Office 365 services and Azure cloud platform. Those businesses have helped Microsoft build deep expertise around analytics and machine learning that the company wants to bring to big data practitioners. It also has a data market on Azure, unsung in-database analytics capabilities and a High-Performance Computing platform that Campbell says will help customers speed the last mile of deep analysis.
Microsoft is putting on a PR offensive this week to try to raise Microsoft's big data profile by way of surveys and case studies. But InformationWeek wanted a closer look at what Microsoft has in the works. In this Q&A interview, Campbell, VP of product development for SQL Server, offers a closer look at Microsoft's big data thinking and what it's working on behind the scenes.
InformationWeek: What's your background and how long have you been in your current role at Microsoft?
Dave Campbell: I've been looking at our whole big data strategy for four years now, and I've been in the database industry for close to 25 years. I can unequivocally state that this is the most exciting period in my career. Seven years ago it was hard to get students excited about databases because it seemed like a solved problem. Then all hell broke loose. As we see it, it's about two opportunities for business. First there's time to insight -- how you can quickly validate or falsify a hypothesis about something. And then there's return on accessible data, with the key term being "accessible."
[ Want more on Redmond's version of Hadoop? Read Microsoft Releases Hadoop On Windows. ]
About two years ago I was talking to executives at one of the U.S. airlines that was in the process of being acquired by another airline. The enterprise architect I was talking to put his head in his hands and said, "Our business is horrible. The airlines are running each other into the ground, and the customers just go to Orbitz or Expedia looking for the lowest price." Then he paused and said, "We've come to realize that the only way we're going to survive is to do a better job of yield management and pricing, do a better job of rescheduling the fleet after a big storm, do a better job of fuel price hedging, and do a better job of upsell than our competitors. That requires us to do new things with data that we don't know how to do."
Success or failure depended on the ability to get an increased return on accessible data. In this case the questions were, "Where are we going to get the fuel futures pricing data?" and "Where are we going to get the meteorological model to know the probability that Logan and JFK are going to be closed tomorrow morning?" I'll come back to that, but today you see so much evidence of data being external to organizations.
IW: What is Microsoft doing to help companies take advantage of external data?
Campbell: One of the things we're working on is this notion of a data market [on Windows Azure]. But it's not just about offering data sets; it's also about analytic models and other things. I'd characterize the last 15 years as being the era of the enterprise mega applications -- the SAPs, PeopleSofts and such. These apps have encouraged data silos. We've gone through several consolidation periods where we need to glue together multiple meaningful applications into suites, but this big data opportunity is way more horizontal. You want to be able to mash up data from your business processes, systems of record, external data, everything. It's not really about the applications and appliances. It's about information production.
I had a conversation recently with executives at a large national health organization and I asked them, "What kind of valuable questions would you want to answer with information that you don't think would belong in your data warehouse?" They looked at me sideways and finally offered that they had GPS telemetry data from their ambulances. Well, can you take data and turn it into patient response times? Can you correlate patient outcomes with those response times? Well then maybe you figure out how much heart attack survival rates improve with response time so you can optimize where you place your ambulances? That analysis might require data on population density and demographics to find the concentrations of people who are most likely to have heart attacks. Their eyes lit up as we started talking about the possibilities.
IW: How does that get back to Microsoft's services and components for big data?
Campbell: Our strategy is about doing a great job of making the information production process easy, helping you to mash up data in different forms and then bring it into the rest of our BI platform. It's about helping people to maximize their return on all accessible data. I pushed Microsoft as much as anyone to adopt Hadoop because it had become a brand, like Kleenex. RFPs said, "what is your Hadoop integration story," not "what is your big data integration story." If customers are going to have hundreds of terabytes or petabytes in Hadoop, that should be seen by us as potential value. But the business value is not at the Hadoop layer, it's how you can turn that into valuable information.
If you've followed our partnership with Hortonworks, you know that we think it's important to domesticate Hadoop by making it easier to install, deploy and manage. That means deploying it with Microsoft Virtual Machine Manager, managing it with Systems Center, and integrating it with Active Directory to make it easy for Microsoft customers. We're working with Hortonworks to do all this as close to the trunk in Apache as we can to make it available for all the distributions.
IW: Where do Microsoft SQL Server and Microsoft High-Performance Computing (HPC) come in (the latter being Microsoft's distributed, super-computing platform)?
Campbell: We're building out an information production line, and in most scenarios the large volumes of data -- the hundreds of terabytes or petabyte-scale data -- will be in Hadoop. That then gets reduced, usually though MapReduce jobs, down to several terabytes that can fit on a small cluster of fairly modest machines. You would then do the final phase of refinement on HPC.
IW: Where does Microsoft handle in-database analytics (a technique now commonly used to speed predictive modeling work)?
Campbell We have a set of foundation algorithms that we can run across several processing runtimes. Time and location are pretty much fundamental in this new world, so we're looking at running time-series-analyses in-memory in the data warehouse or in HPC. We haven't said a lot about in-database analytics, but we have a lot of people using SQL Server's CLR [common language runtime] to define analytic functions and user-defined functions. Jim Gray [the late computing pioneer] introduced scientists and astronomers to database capabilities, so a lot of scientific work is being done on top of SQL Server using that .Net CLR capability.
[ Want more on Redmond's version of Hadoop? Read Microsoft Releases Hadoop On Windows. ]
IW: What about the non-scientific community and where does HPC fit in?
Campbell: HPC is not about doing 10,000-node clusters for national laboratories; it's about efficient information production for businesses. We have many data scientists in our online services division that have built a machine learning workbench that allows them to run experiments and then operationalize and deploy [model-based] applications. We're currently productizing that machine-learning workbench and incorporating elements of HPC.
We demonstrated one example at a supercomputing conference where we took historical meteorological data over decades and historical airline flight-delay data over decades and we built a predictive model that combined them. We could then ask, "On a clear day, what are the probabilities of delay for various airlines, airports and times?" Based on the historical meteorological data, we could also ask, "What do those probabilities look like when there's six inches of snow in Detroit?" If you then add meteorological forecast data, you can ask, "What are the chances I'll miss a 30-minute-window connection in Detroit next Tuesday?"
IW: Many see machine learning as way to close the data science talent gap because the computers themselves can use data to develop and adjust models on the fly. Will this machine learning workbench enable companies to build predictive applications without needing lots of data scientists?
Campbell: Absolutely. The idea is to be able to scale the efforts of the relatively scarce data scientists. The machine-learning work being done now is being handled by PhDs. They have their versions of duct tape and bailing twine to keep things running, but they don't know they have a problem because they're running small numbers of models. If something breaks they go fix it. But over the last few years, the big ad networks using predictive models have run into problems because they need them for every purchaser. They're running thousands of machine-learning models at once. They want models that take care of themselves and that spawn new models when they're no longer effective. … We want to be able to do that on behalf of customers.
There are fairly well-known models for things like fraud detection, spam detection and such. People are going to be building these models, and we expect to package them with a deployment environment on Azure. For every one person who can build a model, there might be 500 who could deploy it and run it in our cloud environment. Where do they get the data they'll need? That's going to be available on Azure as well.
IW: How do you think Microsoft's big data prospects stack up to competitors?
Campbell: If you look at the Oracles, SAPs and even IBM, frankly, none of them are processing hundreds of petabytes daily for their own businesses. None of them have a copy of the Web every day [like we do for Bing], which gives us social signals. This confluence of our commercial data platform, SQL Server, HPC and our online services is turning out to be an interesting cauldron.
I really do believe that this new era is going to be horizontal, so it's not going to be locked up in any single app or application suite. I'm eager to tell our story because there are few other entities on the planet that match us in terms of having Internet services and scale and also the commercial platform of our operating system, database, BI and High-Performance Computing platform.