Evaluating Azure ML Regression Results.
In the earlier article, we used Azure ML Designer to build a regression model. The final output of the regression model is few metrics which we use to understand how good our regression model is.
There are two steps of interest in evaluating the efficiency of the model. The score model step predicts the price, and evaluate model step finds the difference between prediction and actual price which was already available in the test dataset.
Azure ML Step |
Function |
Dataset |
Train Model |
Find mathematical relationship(model) between input data and price |
Training dataset |
Score Model |
Predict prices based on the Training model |
Testing dataset.
Added 1 more column of forecasted price. |
Evaluate Model |
Calculate the difference between prediction and the actual price |
Testing data set. |
Score Model
Before going to evaluation, it is pertinent to investigate the output of Score Model and what has been scored.
Understanding Scoring
When we were training our model, we selected the Label column as Price.
Training model for label price means in simple terms is :
Using 25 other columns in the data, find what is the best combination of values, which can predict the value of our Label column (price)
Scoring model
- Used training model to predict the value of price
- Used test dataset and provide a predicted value of price
Therefore, after scoring, we will have an extra column added at the end of the scored data set, called Scored Labels. It looks like this in preview
This new column “Scored Labels” is the predicted price. We can use this column to calculate the difference between the actual price which was available in the test data set and how the predicted price (Scored Labels) is
The lower the difference, the better the model is. Hence, we will use the difference as a measure to evaluate the model. There are several metrics which we can use to evaluate the difference.
Evaluate Model
We can investigate these metrics by right clicking on Evaluate Model > Preview data > Evaluation results
The following metrics are reported:
Both Mean Absolute Error and Root Mean Square error are averages of errors between actual values and predicted values.
I will take the first two rows of the scored dataset to explain how these metrics evaluate the model.
make |
Wheel-base |
length |
width |
…. |
Price |
Predicted price
(Scored Label) |
Mitsubishi |
96.3 |
172.4 |
65.4 |
… |
9279 |
11413.49 |
Subaru |
97 |
173.5 |
65.4 |
… |
10198 |
8489.47 |
Absolute Errors
Mean Absolute Error
Evaluates the model by taking the average magnitude of errors in predicted values. A lower value is better
Using the table above, MAE will be calculated as :
Price |
Predicted price
(Scored Label) |
Error =
Prediction – Price |
Magnitude of error |
Average of error |
9279 |
11413.49 |
2,134.49 |
2,134.49 |
854.76 |
10198 |
8489.47 |
-1,708.53 |
1,708.53 |
854.76 for the above 2 rows is the average error. Let’s assume there was another model whose MAE will be 500.12. If we were comparing two models
Model 1 : 854.76
Model 2 : 500.12
In this case, model 2 will be more efficient than Model 1 as its average absolute error is less.
Root Mean Squared error
RMSE also measures average magnitude of the error. A lower value is better. However, differs from Mean Absolute Error in two ways :
- it creates a single value that summarized the error
- Errors are squared before they are averaged, hence it gives relatively high weight to large errors. E.g., if we had 2 error values of 2 & 10, squaring them would make them 4 and 100 respectively. This means that larger values get disproportionately large weightage.
This means RMSE should be more useful when large errors are particularly undesirable.
Using the table above, RMSE will be calculated as :
Price |
Predicted price
(Scored Label) |
Error =
Prediction – Price |
Square of Error |
Average of Sq. Error |
Square root. |
9279 |
11413.49 |
2,134.49 |
4,556,047.56 |
3,737,561.16 |
|
10198 |
8489.47 |
-1,708.53 |
2,919,074.76 |
Relative Errors
To calculate relative error, we first need to calculate the absolute error. Relative error expresses how large the absolute error is compared with the total object we are measuring.
Relative Error = Absolute Error / Known Value
Relative error is expressed as a fraction, or multiplied by 100 to be expressed as a percent.
Relative Squared error
Relative squared error compares absolute error relative to what it would have been if a simple predictor had been used.
This simple predictor is average of actual values.
Example :
In the Automobile price prediction, we have an absolute error of our model which is Total Squared Error.
Instead of using our model to predict the price, if we just take an average of the “price” column. Then find squared error based on this simple average. It will give us a relative benchmark to evaluate our original error with.
Therefore, the relative square error will be :
Relative Squared Error = Total Squared Error / Total Squared Error of simple predictor
Using the two-row example to calculate RSE :
Calculate Sum of Squared based on Simple Predictor of Average
Price |
Average of Price
(New Scored Label) |
Error =
Prediction – Price |
Sum of Squared |
9279 |
9,738.5 |
459.5 |
|
10198 |
9,738.5 |
459.5 |
|
Mathematically relative squared error, Ei of an individual model i is evaluated by :
Relative Absolute Error
RAE compares a mean error to errors predicted by a trivial or naïve model. A good forecasting model will produce a ratio close to zero. A poor model will produce a ratio greater than 1.
Relative Absolute Error = Total Absolute Error / Total Absolute Error of simple predictor
Coefficient of determination
The coefficient of determination represents the proportion of variance for a dependent variable (Prediction) that’s explained by an independent variable (attribute used to predict).
In contrast with co-relation which explains the strength of the relationship between independent and dependent variables, the Coefficient of determination explains to what extent variance of one variable explains the variance of the second variable.
R (Correlation) (source: http://www.mathsisfun.com/data/correlation.html)
Azure machine learning studio provides an easy-to-use interface for data scientists and developers to build train and productionise machine learning models. Another major benefit it provides is the ease of collaboration and
In this article, we will explore how to solve a machine learning problem with Azure Machine Learning Designer.
Defining the Problem
To solve the problem via Azure ML Studio. We need to do the following steps
- Create a Pipeline,
- Set pipeline’s compute target.
- Importing Data
- Transforming Data
- Train the Model
- Testing the Model
- Evaluate the Model
Creating pipeline using ML Designer
Azure machine learning pipelines are workflows of executable steps that enable users to complete Machine Learning workflows. Executable steps in azure pipelines include data import, transformation, feature engineering, model training, model optimisation, deployment etc.
There are 3 ways of creating pipelines in Azure Machine learning Studio
- Using Code ( Python SDK )
- Using Auto ML
- Using ML Designer.
- When we login to Azure ML Studio , we see the following options.
- Click on Designer (Start Now) to create a new Pipeline.
By default it’s given a name based on today’s date. I have changed it to Automobile Prediction.
Setting Compute Target
A compute target is an instance of Azure virtual machine which will be used to provide processing power for our pipeline execution.
Default compute target will be used for entire pipeline, we can also use separate compute targets for individual steps of execution.
I created a compute instance earlier, so I can select existing
Importing Data
We can import the data from several sources. For this article, I will use sample datasets provided by Azure.
To explore what is in the data set, we can click on the data set and go to preview data.
This provides us a sneak-preview of what’s included in the data. E.g. there are 205 rows, 26 columns. Clicking on each column provides key statistics about data in that specific column. E.g. If I click on length column, I get a histogram showing the frequency of length values and various other statistics about it.
Transform Data
The data preview feature is helpful in understanding the columns and transforming data for any characteristics necessary to run our model.
Exclude a column
The data transformation section on left side menu provides several commonly used data transformation operations.
I want to remove the column normalized-losses
So, I can drag and drop “Select Columns in Dataset”
When I go to the details of “Select Columns in Dataset” , I can then select all columns other than normalized losses
Clean Missing Data
After removing the normalized-losses columns, our data still has empty values. To clean the missing data. I can use Clean Missing Data module from left side menu. So that our worspace looks like
Going into details of Cleaning missing data, I can select Alll columns.
Training the Model
I want to divide my dataset rows into
- Training rows (training dataset)
- Testing rows (testing dataset)
I can use Split Data Module from left side menu, so that my pipeline will now look like :
The Split Data module has 2 outputs. The left outputwill connect to Train Model and Right output will connect to Test Data.
In the details of Split Data module, I can choose the ratio with which I want to split the data across training and testing.
In summary, we cleaned the dataset and then divided it into two separate one as shown in the image below.
Training the Model
To train any model, we need to have.
- The model
- The data on which model is to be trained.
In our case, we want to predict automobile prices using Linear Regression Model. The training data for Linear Regression Model is the one which we have split from our overall data.
Therefore, we can train our model by combination of Linear Regression module and Train Model module from left side menu.
The Train Model module requires a label to train the model for. A label in this case is independent variable? with the help of which we can find the dependent variable.
[From y = mx + c , a label is X ]
Testing the Model
In Split Data step above, we used only 60 % of our data for training the model. We left remaining 40% of the data for testing. We can setup the testing now by using Score Model module.
Score Model module will need two input.
- What needs to be tested (output of our trained model)
- With what to test (test data from split)
This will look like:
Evaluating the Model
Now we want to see how our model scored when compared against the test dataset. We can use Evaluate Model module and connect Score Model module to it.
This will finish our pipeline creation. We now need to submit it.
Pipeline Submission
Pipeline submission will create an experiment name and compute target.
Understanding the Predictions
Azure ML takes a bit of time especially if the experiment is run for the first time. You can view the progress by looking at the “Running” status (1) or by looking at what’s the status of each individual module ( 2, 3).
Once the model finishes it run, right-click on Score Model and select Visualize > Scored dataset
In the Scored Labels column, you can see the predicted prices.
Understanding Models efficiency
We can use Evaluate Model to see the efficiency of the trained mode.
Right click Evaluate Model -> Visualize -> Evaluation Results
The following statistics are available.
- Mean Absolute Error
- Root Mean Squared Error
- Relative Absolute Error
- Relative Squared Error
- Coefficient Determination
Azure machine learning pipelines are workflows of executable steps that enable users to complete Machine Learning workflows. Executable steps in azure pipelines include data import, transformation, feature engineering, model training, model optimisation, deployment etc.
Benefits of Pipeline:
- Multiple teams can own and iterate on individual steps which increases collaboration.
- By dividing execution into distinct steps, you can configure individual compute targets and thus provide parallel execution.
- Running in pipelines improves execution speed.
- Pipelines provide cost improvements.
- You can run and scale steps individually on different compute targets.
- The modularity of code allows great reusability.
Creating a pipeline in Azure
We can create a pipeline either by using Machine learning Designer or by using python programming
Creating pipeline using Python
1. Loading the workspace configuration
from azureml.core import Workspace
ws = Workspace.from_config()
2. Creating training cluster as a compute target to execute our pipeline step
from azureml.core.compute import ComputeTarget, AmlCompute
compute_config = AmlCompute.provisioning_configuration(
vm_size='STANDARD_D2_V2', max_nodes=4)
cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)
cpu_cluster.wait_for_completion(show_output=True)
3. Defining estimator which provides required configuration for a target ML framework:
from azureml.train.estimator import Estimator
estimator = Estimator(entry_script='train.py',
compute_target=cpu_cluster, conda_packages=['tensorflow'])
4. Configuring the estimator step
from azureml.pipeline.steps import EstimatorStep
step = EstimatorStep(name="CNN_Train",
estimator=estimator, compute_target=cpu_cluster)
5. Defining and executing a pipeline :
from azureml.pipeline.core import Pipeline
pipeline = Pipeline(ws, steps=[step])
Pipeline is defined simply through a series of steps and is linked to a workspace.
6. Validating pipeline to check
7. All steps are validated. We can now submit it as an experiment to workspace.
from azureml.core import Experiment
exp = Experiment(ws, "simple-pipeline")
run = exp.submit(pipeline)
run.wait_for_completion(show_output=True)
Creating pipeline using ML Designer
We have covered pipeline creation using Azure Machine Learning Designer in another article in detail.
Databricks is a cloud-based data engineering tool used for data transformation and data exploration through machine learning models. Azure Databricks is Microsoft Azure Platform’s implementation of Databricks.
Evolution of Databirkcs :
A short timeline of evolution of technology will give us an overview of the underlying stack.
2003: Google released Google File system papers in 2003.
2004: This was followed up in 2004 by Google MapReduce Papers. It takes a load of analytics work and distributes it across cheap compute instances.
2006: This led to Apache Hadoop creation in 2006. Apache Hadoop had a problem where it was doing a lot of functions which would cost hard disk input outputs.
2012: Matei started Spark project which built on Apache Hadoop and did a lot of the compute calculations in Memory.
2013: Spark was donated to Apache foundation by Matei. Matei & his Berkley colleagues founded Databricks. The platform provides a management layer over Apache Spark functions including the following :
- Spark providing Map Reduce in Memory
- DataFrames allow you to write Map Reduce for you. You can also write SQL statements which are then compiled into DataFrames call which can then do Map Reduce.
- Graph API
- Stream API
- Machine Learning
Databricks Implementations:
Since Databricks is a 3rd party tool. All major cloud platforms have their own implementation of Databricks. Google Cloud, Amazon AWS and Microsoft Azure are major Databricks providers.
Azure Databricks Environments:
Azure Databircks offers 3 environments for developing data intensive applications.
Azure Databricks SQL
It provides an easy-to-use platform for analyst who want to
- Run SQL queries on their data lake.
- Create multiple visualization types to explore query results
- Build and share dashboards
Azure Databricks Data Science & Engineering
Provides an interactive workspace for
- Enabling collaboration between data engineer, data scientists and machine learning engineers.
- Data (raw or structured) is ingested into Azure Data Factory in batches or streamed using Apache Kafka, Event Hub or IOT hub. This data lands in either Blog storage or Data Lake storage.
- Reads data from multiple data sources and turns them into insights.
Azure Databricks Machine Learning
Provides managed services for
- Experiment tracking
- Model Training
- Feature Development and Management
- Feature and model serving
Databricks Clusters
Types of Clusters
There are two types of clusters:
All-purpose clusters
Typically, used to run notebooks. They remain active until you terminate them.
Job clusters
They run when you create a job. They are terminated after the job is completed.
Creating a Cluster
Creating a cluster is necessary for getting started with data bricks.
Databricks Notebooks
After creating the workspace in first step, we can now create a notebook to execute our functions. This can be done via UI or CLI.
Clicking on Create -> Notebook is all there is to creating a notebook. Choose language and Cluster and finally click on the create button.
Databricks Cluster Navigation
Clicking on Clusters leads us to this helpful interface which gives us the following options :
Configuration
Lists the basic configuration of the cluster which we used to build this cluster.
Notebooks
Notbooks lists all the notebooks available in the cluster.
Libraries
Libraries interface allows us to attach appropriate SDK package to the cluster for our use cases.
Event Log
Gives a complete log of events in clusters.
Spark UI
This is the holy grail of all the sparks jobs which are actually running under the databricks shell.
Driver Logs
Driver logs gives detailed log of Sparks events.
Metrics
Metrics interface opens a separate application called Ganglia UI.
Apps
Apps gives an option to configure R studio
PowerBI provides a handful of features for building robust data models. Here are a few concepts to begin modeling data in PowerBI :
Fact tables & Dimensions tables:
In its simplest form, a data model design will consist of the following:
Fact table:
Also known as primary table. This table contains numeric data which we want to aggregate and analyze. It’s the primary table in a schema and has foreign keys to link it with dimension tables/
Dimension tables:
Also known as a lookup table. This table contains descriptive data. This descriptive data is primarily text data used to slice and dice the data available in primary tables.
Measures in PowerBI :
Measure In Power BI is an expression which outputs a scalar value. There are two classifications of measures in PowerBI
Implicit Measures:
Any column value can be summarized by a report visualization. This is referred to as an implicit measure. In other words, it’s the default summarization available for which you do not have to write a DAX query.
Explicit Measures:
Explicit measures on the other hand are those which require DAX calculation to query the underlying data model.
Generally, dimension tables contain a relatively small number of rows. Fact tables on the other hand can contain very large numbers of rows and continue to grow over time.
Relationships:
Once you have imported some data, the next step is to build a relationship between the tables. Relationship can be defined by either
1. Going to the Model view from Left Side Ribbon
2. Or, from Modeling -> Manage relationships
Purpose of Relationship:
Relationships decide how the filters applied on one column of the table will propagate to the other model tables. If a table is disconnected (does not have any relation to other tables), any filter applied on other tables will not propagate to the disconnected table. There are certain attributes of relations that determine the propagation of filters.
Relationship Keys:
The column on the basis of which relationship is established determines the link for propagating filters. To build a relationship, you need to determine which column will be the primary key and which one will be foreign key. Once the columns are determined, you can drag and drop the columns from either of the tables. PowerBI will prompt a dialog pop-up to confirm the keys.
In this example, we are creating a relationship between Date & Week Start Date.
Relationship Cardinality:
Cardinality defines what type of relationship exists between the tables; it can be:
1 to 1:
The column of the table on both sides has only one instance of the value.
1 to Many:
The column of the table on one side of the relationship is usually dimension table. It has only one instance of a value. This usually is the primary key. The table on many sides is usually a fact table and can have many instances of value.
Many to 1:
Many to one is inverse of the above with same logic. Just the direction is reversed.
Many to Many:
Many to Many relationships remove the need for unique values in tables. It removes the need to create bridging tables for establishing relationships.
In powerBi , it appears like this:
Relationship Direction:
Under the heading cross-filter direction, powerBI allows you to configure two types of directions
Single Cross Filter Direction:
Single cross filter direction would mean that relationship will only propagate in single direction. E.g., if a single cross filter direction is chosen in a 1 to Many relationships, filter will only execute when we filter from 1 side of the table.
Double Cross Filter Direction:
A double cross filter, as the name suggests, propagates filters from both directions. E.g., if a double cross filter direction is selected in a 1 to Many relationships, filter will execute from 1 side of the table as well as from the many sides of the table.
Cross filter options vary by cardinality. The following combinations are possible in PowerBI
Cardinality type |
Cross filter options |
One-to-many (or Many-to-one) |
Single
Both |
One-to-one |
Both |
Many-to-many |
Single (Table1 to Table2)
Single (Table2 to Table1)
Both |
PowerBI can support any schema arrangement. Here I cover the two most used ones
Commonly Used Schemas
Commonly used schemas in in PowerBI are:
Star Schema:
A star schema has a single FACT table which connects to multiple DIMENSION tables.
Cardinality: The cardinality between DIMENSION and FACT table in a star schema is 1 to Many.
Snowflake Schema:
Snowflake schema is a variant of STAR schema in which you have dimension tables that are related to other dimension tables in a chain. When possible, you should flatten these dimension tables to create a single table.
Cardinality: The cardinality between DIMENSION and FACT table in a snowflake schema is 1 to Many.
Comparison between Star & Snowflake Schema
Star Schema |
Snowflake Schema |
Simplest data model |
Relative Complex model |
Hierarchies for the dimensions are stored in the dimensional table. |
Hierarchies are divided into separate tables. |
It contains a fact table surrounded by dimension tables. |
It also contains one fact table surrounded by dimension tables. However, these dimension tables are in turn surrounded by other dimension tables. |
Only a single join creates the relationship between the fact table and any dimension tables. |
A snowflake schema requires joins to fetch the data. |
Denormalized Data structure |
Normalized Data Structure. |
Data redundancy is high |
Data redundancy is low |