This post is by Joseph Sirosh, Corporate Vice President of the Data Group at Microsoft.
This week I’m joining thousands of people attending Strata + Hadoop World in San Jose to explore the technology and business of big data and data science. As part of our participation in the conference, we are announcing several important investments to continue delivering on our commitment to make big data processing and analytics simpler and more accessible:
- Advanced analytics at scale with R Server for HDInsight and the latest version of Spark for HDInsight are now available in preview: Customers can leverage their existing R skills and reuse current code to run at scale. R Server for HDInsight offers popular scalable R algorithms and the ability to parallelize any existing R function. We are also releasing the latest version of Spark for HDInsight, which can deliver 7x performance over MapReduce for most analytics. These capabilities give our customers the ability to train and run advanced analytics and ML models on larger datasets, and much faster than previously possible in the cloud.
- Out-of-the-box application integration, providing easier access to popular big data apps: Customers can now discover and deploy popular big data applications with HDInsight without any code or scripting required. Leading solutions such as Datameer Cloud, which offers a self-service big data analytics platform, AtScale, which offers cloud-based OLAP BI on Hadoop, and an ecosystem of other big data applications can now be deployed alongside HDInsight.
- Azure Data Catalog, previously announced as a public preview will be generally available tomorrow: Data Catalog is an enterprise metadata catalog and portal for the self-service discovery of data sources. Users can now spend less time trying to find, understand and access the data they need, and more time analyzing it for value.
R Server for Azure HDInsight
Since announcing that R Services are built-in as part of SQL Server 2016 and announcing the availability of the public preview of Azure Data Lake, we’ve heard customer demand to bring R, the most popular programming language for predictive modeling, to our fully managed big-data services in the cloud. We are pleased to announce R Server for Azure HDInsight, a 100% open source R implementation. It runs the most comprehensive set of ML algorithms and statistical functions in the cloud, leveraging Hadoop and Spark. By making it available as a workload running inside HDInsight, we remove obstacles for users to unlock the power of R, eliminating memory and processing constraints and extending analytics from the laptop to large multi-node Hadoop and Spark clusters. This lets you train and run ML models on larger datasets than previously possible and make more accurate predictions. It also reduces the time to move ideas into production, eliminating time-consuming installation, set up and procurement cycles for new hardware.
For more information, go to the R Server page of HDInsight.
The Latest Version of Spark for Azure HDInsight
Spark is one of the most popular big data projects, known for its ability to handle large-scale data applications in memory, batch and interactive queries, real-time streaming, machine learning, and graph processing with the same common execution model. Spark for Azure HDInsight has been updated to the latest Apache Spark 1.6 edition, gaining critical performance improvements including a 10x speedup for streaming state management, automatic memory management, and new machine learning algorithms and capabilities. With Spark for Azure HDInsight, we offer customers more value with an enterprise ready Spark solution that’s fully managed, and a choice of compelling and interactive experiences with different BI tools and popular notebooks such as Jupyter (iPython). This makes it easier for business analysts and data scientists to find new insights over big data.
For more information, go to the Azure HDInsight Spark page.
Easier Access to Popular Big Data Apps
Today, we are also announcing out-of-the-box application integration, providing easier access to popular big data apps. This lets customers discover and deploy popular big data applications with HDInsight without any code or scripting required. As part of this ecosystem, we are happy to showcase two applications: Datameer Cloud, which enables code-free data preparation, and AtScale, for cloud-based OLAP BI on Hadoop.
Datameer Cloud is a fully managed big data analytics-as-a-service platform that provides a one-stop-shop to integrate, prepare, analyze, visualize and operationalize data of any size, type or source. Datameer’s self-service functionality and end-to-end workflow combined with Microsoft’s world-class Hadoop distribution lets business analysts analyze data and produce results without needing Hadoop knowledge or skills, and without IT having to implement a potentially costly, on-premises Hadoop infrastructure. No extra hardware, technical staff or administration costs are required, drastically reducing the total cost of ownership.
In addition, we are announcing the availability of AtScale in the Azure Marketplace with out-of-the-box integration with HDInsight. With AtScale, you can create OLAP cubes on top of data in HDInsight. This gives business analysts the ability to gain insights with their favorite BI tools while IT preserves the control, security and responsiveness of the Hadoop platform.
Azure Data Catalog General Availability
Since we announced Azure Data Catalog as a public preview, we’ve seen strong customer engagement, and we are excited to announce that Azure Data Catalog will be a generally available service starting tomorrow, March 30th. Data Catalog is an enterprise metadata catalog that allows for self-service discovery of data sources. It allows any data consumer – be it a business analyst, data scientist or data developer – to register, enrich, discover, understand and consume data sources. Crowdsourced annotations let users who are knowledgeable about the data enrich the metadata at any time. This bridges the gap between IT and data consumers, encouraging the community to share their business knowledge while still allowing IT to maintain control and oversight.
Today’s announcements underscore our commitment to continued innovation in the world of big data and analytics. We will share more about these developments at Strata + Hadoop World this week. Microsoft is also hosting //Build, our premier event for developers, later this week. Stay tuned.
Follow me on twitter @josephsirosh