Open Source Partners: Focus on fast big data


Tim Walton - Technology Solutions Professional Open Source

In the October community call, we discussed the opportunity for partners to produce domain specific solutions using Azure Container Service. The November focus for the Open Source Solutions Partner Community is OSS and big data. In this post, I'll provide an in-depth look at a trending solution called the SMACK stack - a valuable solution for every partner's digital transformation toolkit.

Sign up for the November 30 community call

Watch the October community call on demand

SMACK is a technology solution stack that comprises Spark, Mesos, Akka, Cassandra, and Kafka. It is a data-processing architecture designed to handle massive quantities of data that can take advantage of both batch and stream processing methods. It becomes incredibly important when trying to solve problems such as ingesting and querying data produced from the Internet of Things and today's big data producers.

Extract, Transform, and Load (ETL) systemsData ingestion from various systems was typically achieved through Extract, Transform, and Load (ETL) systems. However, ETL has some inherent problems:

  • Loss of data
  • Duplicate data after failover
  • Decreases throughput
  • Expensive to scale
  • Increases the complexity of the pipeline

The SMACK stack is an attempt to rationalize various data processing scenarios. It is made up of highly scalable, reactive frameworks to deliver a fast, Highly-Available Redundantly-Distributed (HARD) system.

The diagram below shows how the SMACK stack relates to the first-party services in Microsoft Azure. Partners are increasingly plugging Microsoft first-party services into the SMACK stack where it makes sense. For example, Apache Spark on Azure HDInsight.

smack-stack-image-oss-nov-2016-blog

The components of the SMACK stack

Spark

Apache Spark is the processing layer, an open source cluster computing framework that addresses the disk-based limitations in traditional map reduce solutions. Specifically, Spark focuses on providing distributed shared memory primitives that drastically improve performance and interactivity of the data. Spark provides a unified interface allowing SQL queries, machine learning, graph analysis, and streaming (micro-batched) processing.

Advantages:

  • Distributed analytics platform
  • Simple abstraction of datasets
  • Multiple language support
  • Streaming support
  • Machine learning
  • Integrated SQL queries

Learn more about Apache Spark

R & Spark as Yin and Yang of Scalable Machine Learning in Azure HDInsight

Leverage R and Spark in Azure HDInsight for scalable machine learning

Big, Fast, and Data-Furious...with Spark

Build interactive data analysis environments using Apache Spark

Mesos

Mesos can be thought as the resource manager or service fabric for the other frameworks. Apache Mesos is an open-source cluster, providing efficient resource isolation and sharing across distributed applications, or frameworks. The software enables resource sharing in a fine-grained manner, improving cluster utilization.

Learn more about Apache Mesos

Mesos videos on Channel 9

On-demand webcast: Mesosphere and Azure Container Service

Akka

Akka ingests the data and is an open-source toolkit and runtime simplifying the construction of concurrent and distributed applications on the JVM. Akka is message focused and emphasizes actor-based concurrency and is similar to Azure ServiceFabric.

  • Fault tolerant
  • Hierarchical supervision
  • Customizable failure strategies and detection
  • Asynchronous data passing
  • Parallelization
  • Adaptive/predictive
  • Load-balanced across cluster nodes

Learn more about Akka

Akka videos on Channel 9

Cassandra

Apache Cassandra is the storage layer of the stack, an open source distributed database management system designed to handle large amounts of data across many commodity servers. Cassandra is used to persist distributed events, providing high availability with no single point of failure.

  • Massively scalable
  • High performance
  • Always on
  • Masterless
  • Multiple datacenter cluster support

    Learn more about Apache Cassandra

    Cassandra videos on Channel 9

    December 16: Building geo-distributed public cloud apps on Cassandra

    Kafka

    Apache Kafka is the transportation layer and buffer for dealing with event streams. It provides:

    • High throughput distributed messaging
    • Decouples data pipelines
    • Handles massive data load
    • Support massive number of consumers
    • Distribution and partitioning across cluster nodes
    • Automatic recovery from broker failures

    Learn more about Apache Kafka

    The partner opportunity

    The SMACK stack simplifies streaming analytics, but there is a need for partners with a full stack knowledge and domain expertise. For example, ESRI, a Microsoft Gold Application Development Partner, recently demonstrated a fantastic partner-created solution with its forthcoming ArcGIS service, which utilizes DC/OS by way of Azure Container Service. ArcGIS takes advantage of data systems such as Spark Streaming, Kafka, and Elasticsearch, as well as Azure IoT Hub, in order to analyze and visualize geospatial data in real time. This is a packaged offering that ESRI can provide to their customers as a managed service.

    Whether your partner business focuses on data platform, advanced analytics, IoT, or application development, understanding SMACK is critical for your architects. The attributes of each of the frameworks that make up the SMACK stack act as a patterns for reactive, Highly-Available Redundantly-Distributed systems.

    The demand for SMACK expertise is growing rapidly, it provides deep business value, and is a perfect fit for hyperscale properties of Microsoft Azure.

    Resources

    Microsoft Ignite sessions on demand

    Streaming in the Cloud: We've Got It All Covered

    Azure Container Service sessions

    Training recommendations

    Webcast series about open source on Microsoft Azure

    Training and certification for Azure

    Open Source Solutions (OSS) Partner Community

    oss-community-call-nov-30     CTA - OSS Partners Yammer group     microsoft-open-source-website

    Comments (0)

    Skip to main content