Every machine learning project requires a deep understanding of the data, to be able to understand whether the data is representative of the problem to be solved, to determine the approaches to be undertaken and indeed for the project to be successful.
Understanding of the data typically takes place during the Exploratory Data Analysis (EDA) phase. It is complex part of the project where data is attempted to be cleansed, outliers identified and the suitability of the data is assessed to inform hypothesis and experiments.
The following image illustrates the various phases, their respective complexity and roles during a typical machine learning project:
The Data Discovery Playbook aims to quickly provide structured views on your text, images and videos, all at scale using Synapse and unsupervised ML techniques that exploit state of the art deep learning models.
The goal is to present this data to and facilitate discussion with a business user/data owner very quickly via PowerBI visualisation, so that the customer and team can decide the next best action with the data, identify outliers or generate a training data set for a supervised model.
Another goal is to help simplify and accelerate the complex Exploratory Data Analysis phase of the project by democratising common data science functions and to accelerate your project so that can focus more on the business problem you are trying to solve.
Keep all code assets standalone and as simple as possible for quick usage or adaptation for production usage.
The aim of this Playbook is to illustrate the usage of the tools, alongside guidance, examples and documentation to get rapid insights of your unstructed data, all of which have been applied in real customer solutions.
The intended audience of this Playbook includes:
- Engineering/project leads
- Data scientists/data engineers
- Machine learning engineers
- Software engineers
This Playbook provides code to quickly discover data as part of the Exploratory Data Analysis phase of the project. The overall approach is to take a large unstructured dataset that has no labels available, and to iterate over the data using a variety of techniques to aggregate, cluster and ultimately label the data in a cost effective and timely manner.
This is achieved by using unsupervised ML clustering algorithms, heuristic approaches and by direct input and validation by a domain expert. Asking questions of the data in natural language is also possible, if text based, using the semantic search feature of Azure Cognitive Search.
By combining these approaches, structure and labels can be applied to large datasets so that the data may either be indexed for discovery via a search solution such as Azure Cognitive Search or for a supervised ML model to be trained so that future unseen data can be classified accordingly.
The following illustrates this approach at a high level for a text based problem where large amounts of unstructured data exists:
- Cluster and explore the data quickly in the generated interactive PowerBI report
- Ask specific questions of your data from within the Synapse notebooks using Azure Cognitive Search with Semantic search and SynapseML
- Assess the data to determine whether some simple heuristics may be applied to classify the data with a semantically relevant term - see the Heuristics notebook
- Apply the heuristic classification to the underlying data and remove the data from the larger corpus that could be classified
- Run Text Clustering in the remaining data and generate Word Clouds - iterate until a ideal number of clustered data appears and the clusters make sense to a Domain Expert
- The Domain Expert assesses the Word Clouds in more detail, and makes obvious corrections to Word Clouds by programmaticaly moving terms between clusters or using PixPlotML
- The Domain Expert assess the Concept Graph for network connectivity and relationships
- The Domain Expert labels the clusters with a semantically relevant term which is programmatically propagated to the underlying records within the dataset
- Merge steps 1 and 6 which now allows for a classification model to be trained
PixPlotML is an interactive and zoomable visualization of your whole dataset. This web-based tool, a modified version of the original Pixplot, is valuable for object detection and classification projects to perform these tasks:
- Initial investigation and visualization of a labelled (or unlabelled) dataset.
- Fixing incorrect classifications and removing invalid or confusing images. (Click on an image and update its label, or flag for removal)
- Visualizing false positive bounding boxes to identify why they are occuring. Images that look similar are located next to or near each other, making it easy to see where errors occur (in the UMap visualization).
Images that look similar are located next to or near each other, making it easy to see where errors occur (in the UMap visualization).
UMap Visualization | Interactive and Zoomable | Different Views (by label) |
---|---|---|
Code written during EDA may not make it to production, but treating it as production code is vital as it provides an audit and represents the investment made to determine the correct ML solution as part of a hypothesis driven development approach.
This allows teams to not only reproduce the experiments but also be able to learn from past lessons learnt, saving time and associated development costs.
All Synapse notebooks contain full AML and MLFlow experiment tracking to provide lineage on data and parameters used.
This Playbook aims to provide similar approaches accross a variety of technologies and uses the following components:
- Azure Synapse notebooks and Spark Pools for data processing and compute.
- PowerBI for rapid and simple interactive data visualisation dashboards
- OpenAI These models can be easily adapted to your specific task including but not limited to content generation, summarization, semantic search, and natural language to code translation
- Keras Applications in particular InceptionV3 for image inference and feature extraction
- Transformers Packages for feature extraction
- BLIP model for image captioning via visual and word transformers
- Pegasus xsum model for abstractive text summarisation
- SparkML pipelines for clustering techniques
- SKLearn for clustering techniques and others
- spaCy for feature extraction
- Azure Cognitive Search for rapid search of the dataset
- PixPlotML for rapidly visualising and labelling datasets
A new Synapse workspace and all cluster configuration and notebooks can be deployed from here.
-
Download and install the Azure CLI
-
Download and install jq, a lightweight and flexible command-line JSON processor
-
Azure Data Lake Storage Gen2 storage account - The Azure Synapse workspace needs to be able to read and write to the selected ADLS Gen2 account. In addition, for any storage account that you link as the primary storage account, you must have enabled hierarchical namespace at the creation of the storage account, as described on the Create a Storage Account page. More info on creating Azure Data Lake Storage can be found here. This existing or new account also needs a Blob Container to be created with a chosen name to be supplied below in the environment variable
BlobContainerName
. For example, call your containershare
.
-
Login to your Azure Subscription via
az login
-
Clone the Playbook repo:
git clone https://github.com/microsoft/data-discovery-toolkit cd data-discovery-toolkit/environment_preparation/deployment
-
Rename the file
vars.sample
tovars.env
-
Populate the required variables within the
vars.env
file:
# The resource group that your Synapse instance has been provisioned to
SynapseResourceGroup=
# The region of the Synapse Resource Group
Region=
# The existing ADLS Storage Account name
StorageAccountName=
# The existing resource group of the ADLS storage account
StorageAccountResourceGroup=
# The name of the existing Blob Container within the ADLS Gen 2 storage account mentioned above - also called File Share in some of the notebooks
BlobContainerName=
# The name of the Synapse Workspace
SynapseWorkspaceName=
# The Synapse SQL user
SqlUser=
# The Synapse SQL password
SqlPassword=
# The Azure subscription id
SubscriptionId=
- Run the following command:
sh deploy.sh
☕ Grab some coffee as it will take around 30 minutes ☕
- Verify that your resource group contains a provisioned Synapse service
- Verify that no permission errors are raised when launching the Synapse Workspace
- Verify that are Apache Spark Pool cluster has been provisioned
- Verify that packages have been installed against the cluster
- Verify that the notebooks have been imported into a folder called
Data Discovery
- Verify that you can associate the Apache Spark Pool cluster with a notebook
- Verify that a Spark session is started when executing the notebook
To use the Synapse components, an Azure Synapse Spark pool is required. Please navigate to Synapse Environment Preparation to configure the cluster for usage.
To use the Azure Cogitive Search functionality, a provisioned Search instance must be provisioned and ensure that Semantic Search is enabled
Node
- a single infrastructure VM comprised of compute and memoryCluster
- a group of nodesSpark Pool
- a cluster with its associated configuration and sizingClustering
- an unsupervised machine learning technique for grouping similar records togetherADLS
- Azure Data Lake Storage
Refer to the Text Clustering section For more detailed information on clustering documents.
The following code accelerators serve as starting points to try approaches that are known to work for the data discovery phase. Note - these accelerators are not intended for production, they will require amendment to incorporate into a production pipeline
Media Type | Scenario | Description | Platform |
---|---|---|---|
Text Documents | Text Clustering | Extract features with TF-IDF and cluster documents with built in Search and interactive PowerBI report | Synapse |
Text Documents | Text Clustering | Extract features with spaCy and cluster documents with built in Search and interactive PowerBI report | Synapse |
Text Documents | Text Clustering | Extract features with BERT and cluster documents with built in Search and interactive PowerBI report | Synapse |
Text Documents | Text Clustering | Extract features with Azure OpenAI and cluster documents with built in Search and interactive PowerBI report | Synapse |
Text Documents | Text Summarisation | Generate abstractive text summaries with Pegasus xsum model with built in Search | Synapse |
Images and videos | Image Clustering | Extract features from images, make an imagenet prediction and cluster | Synapse |
Images and videos | Image Captioning | Generate a caption for an image and cluster the captions with built in Search and interactive PowerBI report | Synapse |
See Environment preparation for Synapse
This section contains some documented common scenarios:
Scenario | Description | Platform |
---|---|---|
Animal Face Image Captioning and clustering | End to end walkthrough of captioning against an animal face dataset | Synapse |
Animal Face Feature Extraction and clustering | End to end walkthrough of clustering against an animal face dataset | Synapse |
Interactive Image Cluster dashboard | Setting up a dashboard from scratch in 2 minutes | PowerBI |
BBC Sports Similarity Matrix | A notebook that illustrates how to use locality-sensitive-hashing (LSH) to create a similarity matrix. Could be useful when dealing with large amounts of text documents | Synapse |
Image Situation dataset bias | A notebook that shows how to detect biases in a labeled image dataset | Synapse |
Applying simple heuristics for classification | A notebook that shows how run simple search techniques to quickly get a baseline | Synapse |
Classify BBC Sports documents with Azure OpenAI | A notebook that shows how to use OpenAI one-shot classification to quickly get a baseline | Synapse |
EDA with Azure OpenAI | A notebook that shows how to use OpenAI for EDA including OpenAI Retrieval Augmented Generation Pattern evaluation using Azure Cognitive Search, Azure Cognitive Semantic Search, Azure Synapse/Trident and evaluation, clustering and automated classification | Synapse |
Search Evaluation with Azure OpenAI | A notebook that shows how to automate queries from a dataset, build a standard, semantic and hybrid vector Azure Cognitive Search index and evaluate the different indices including the OpenAI Retrieval Augmented Generation (RAG) pattern | Synapse |
Azure Synapse Analytics is a limitless analytics service that brings together data integration, enterprise data warehousing and big data analytics. It gives you the freedom to query data on your terms, using either serverless or dedicated options – at scale. Azure Synapse brings these worlds together with a unified experience to ingest, explore, prepare, transform, manage and serve data for immediate BI and machine learning needs.
PowerBI. Connect to and visualize any data using the unified, scalable platform for self-service and enterprise business intelligence (BI) that’s easy to use and helps you gain deeper data insight.
Azure Cognitive Search is a fully managed search as a service to reduce complexity and scale easily including:
- Auto-complete, geospatial search, filtering, and faceting capabilities for a rich user experience
- Built-in AI capabilities including OCR, key phrase extraction, and named entity recognition to unlock insights
- Flexible integration of custom models, classifiers, and rankers to fit your domain-specific needs
Graphframes is a package for Apache Spark which provides DataFrame-based Graphs. It provides high-level APIs in Scala, Java, and Python. It aims to provide both the functionality of GraphX and extended functionality taking advantage of Spark DataFrames. This extended functionality includes motif finding, DataFrame-based serialization, and highly expressive graph queries.
The notebooks contain a basic graph implementation that can be amended to run functions such as BFS, DFS, find communities and label propagation amongst others.
Dataset | Description | Labels |
---|---|---|
BBC sports | Consists of 737 documents from the BBC Sport website corresponding to sports news articles in five topical areas from 2004-2005 | Class Labels: 5 (athletics, cricket, football, rugby, tennis) |
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services.Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.
Please refer to the following references for additional relevant material: