Azure Synapse – A unified Platform for Data Warehousing, Big Data Analytics, and Machine Learning
Azure Synapse provides a unified workspace for data warehousing, big data analytics, and machine learning. It also includes a suite of capabilities for data ingestion, data exploration and preparation, model building and scoring, and deployment.
CIOs will need to address several challenges to successfully implement synapse machine learning, including data quality, skills gap, and integration with existing systems.
Data Pipelines
Powered by Apache Spark, the new platform provides a unified environment for data warehousing, big data analytics, and machine learning. It integrates SQL and Spark for the movement of data in and out of a DW, and includes integration technologies to ease data transformation and movement from external sources.
Synapse allows you to build and deploy machine learning models with no-code AutoML workflows and T-SQL constructs. It also offers a variety of open-source libraries for Apache Spark and Azure ML, including tensorflow and pytorch. The platform also supports the t-sql predictive model ‘predict’ function for scoring predictions.
Synapse offers a visual ETL environment for data collection and processing. It can extract and move raw data to a data lake for processing. It can then publish the data in a data warehouse for business intelligence applications to consume. It uses an integration runtime to connect and orchestrate activities with data flows, data lakes, and compute services.
Data Exploration
Synapse analytics supports a variety of data exploration tools for your business to get holistic insights. The platform can pull data from various sources like blob storage, on-premises databases, and even real-time data streams. This flexibility allows for the rapid and easy development of data pipelines.
The unified analytical experience of synapse machine learning allows for quick and easy data analysis with powerful chart options and integration with Power BI, Synapse SQL, and Azure Machine Learning. In addition, it provides a scalable and cost-efficient model through its intelligent architecture that separates the data from the compute resource. This ensures that the resources can be scaled up and down for different business needs.
In addition, synapse analytics uses Apache Spark and serverless SQL pools to query the data. These pools execute queries over the data in a scalable manner. These pools are configured to run python libraries such as pandas, matplotlib, and seaborn for exploring the data.
Model Training
For machine learning models, Synapse offers the opportunity to use various libraries. For a small dataset, popular libraries such as Scikit-learn and Matplotlib are available to explore data and perform feature engineering. For larger datasets, the Spark MLlib library allows you to train and score machine learning models.
For example, a customer segmentation model might be built to predict a customer’s propensity to purchase or abandon products. This information can help companies tailor marketing campaigns and improve customer relationships. Another common application is predictive analytics in healthcare. This enables organizations to identify patients who are at risk of developing certain conditions and provide targeted interventions.
A typical ML and AI pipeline involves storing a raw dataset in the data lake, doing all of the training and scoring of a model using AML, pushing the scored results back to the data lake, then serving those scores via Power BI or Synapse. Now that AML has tighter integration with Synapse, it’s possible to run this workflow in the Synapse Workspace.
Model Scoring
The ML workflow in Synapse provides capabilities that enable the activities across the entire ML lifecycle within a unified platform. This includes acquiring data, exploring it, preparing it for machine learning, model training, modeling scoring and deployment.
With a rich set of chart options in the Synapse Notebook and integration with Power BI, analyzing, preparing and exploring data is easy and convenient. Synapse also provides a complete set of libraries including Apache Spark MLlib and Microsoft Cognitive Services for text analytics (Anomaly Detector) and sentiment analysis.
Predictive analytics is an important decision making process in many business areas. In marketing, it helps segment customers for targeted customer marketing efforts. In healthcare, it helps identify patients who are at risk for certain diseases so they can be monitored and treated early. Synapse supports predictive analytics through deep integration with Azure Machine Learning and by enabling the TSQL PREDICT function in serverless SQL pools. This functionality allows you to score data without moving it out of the data warehouse.