IntroductionIntroduction.

Aims

This course aims to give the user the tools to use to analyse data in the AI context.
The course covers the statistical methods that are employed to analyse data and find relationships.
Measuring the strength of relationships, one can then use the staistical model to make predictions and give a measure of the confidence of the predictions.

Scenario

You are wanting to buy a secondhand car for your partner.
You have shown them a range of advertised cars and asked if they like the model. You now have a list of cars likely to be liked and a hope to see if a newly advertised car is likely to be liked before you buy it.

Types of data

Data types can be divided into three main categories:

Numerical data are numbers, and can be split into two numerical categories:

Measurements on data

Mean, Median, Mode Standard Deviation and Variance

There are 5 measures of interest in Machine Learning:

Standard Deviation is often represented by the symbol Sigma: σ
Variance is often represented by the symbol Sigma Squared: σ 2 and is the square of the Standard deviation

Another measure used in ML is the percentile.
Percentiles are used in statistics to give you a number that describes the proportion of the values are lower than the maximum value
Eg 10 readings - [10,20,30,40,50,60,70,80,90,100]
75% Percentile=70
ie every reading below and including 70 lies in the 75% region - [10,20,30,40,50,60,70]

Age Speed Engine Origin Like
5 99 1200 FRANCE NO
7 86 1300 GERMANY YES
8 87 1250 JAPAN YES
7 88 3000 ITALY YES
2 111 2000 UK YES
.
.

We will start by looking at the relationship between Age and Speed to illustrate some basic concepts.

For the car data the following metrics are computed:
mean x=7.03
mean y=91.96
median x=6.5
median y=92.5
mode x=ModeResult(mode=array([5]), count=array([3]))
mode y=ModeResult(mode=array([86]), count=array([3]))
standard deviation x=4.32
standard deviationy=15.27
variance x=18.72
variance y=233.26
75% percentile x=[9.]
75% percentile y=[102.]
75% quantile x=[9.]
75% quantile y=[102.]
slope=-3.01
intercept=113.20
r=-0.855
p=2.594e-08
std_err=0.37

Visualising the data

A scatter chart

A histogram chart

A bar chart

Linear Regression

Linear regression uses the relationship between the data-points to draw a straight line through them all.
This line can be used to predict future values.

R and R2measures

The R value is defined as:

rxy​=(∑i=1n​(xi​−xˉ)2)*( ∑i=1n​(yi​−yˉ​)2)
     ___________________________________
​      (∑i=1n​(xi​−xˉ)*(yi​−yˉ​)​)-2

In English, this equation says take the average of all the x and y values multiplied together and divide by the the square root of all the averages which normalises the data between -1.0 and 1.0.

The result gives a measure of how close the data resembles a linear relationship between the x and y values.

R2 makes the R value always positive, so the value shows if the data has no relationship: R2=0, if R2=1, then there is a perfect relationship between the x and y values.

Logistic Regression

If the data points clearly will not fit a linear regression then a polynomial regression coulld be better.

Polynomial Regression

Polynomial regression, like linear regression, uses the relationship between the variables x and y to find the best way to draw a line through the data points.

Multiple Regression

Multiple regression is like polynomial regression, but with more than one independent value, meaning that we try to predict a value based on two or more variables.

Standardising Data : Scaling and Normalisation

When your data has different values, and even different measurement units, it can be difficult to compare them.
What is speed compared to country? Or age compared to capacity?
The answer to this problem is scaling.
Scaling involves dividing the all the values by the maximum value of the data set.
In this for the data is said to be normalised.

The reverse of this process is called unnormalizing

Evaluating the Model

In Machine Learning we look at the predictions made by the majority of the data and compare these values with the results from a subset of the data.
Typically we take 80% of the data to train the model, and 20% of the data to test the model.

Train/Test

If the two sets of data give very similar results for the relationship then we can use the model to predict values from new data.
If r2test ~=R2train can proceed with model and we can use the model to predict future values.

Standaridize data

d = {'UK': 0, 'USA': 1, 'N': 2} df['Nationality'] = df['Nationality'].map(d) d = {'YES': 1, 'NO': 0} df['Like'] = df['Like'].map(d) Then we have to separate the feature columns from the target column. The feature columns are the columns that we try to predict from, and the target column is the column with the values we try to predict. In the example "Like" is the target column, the remainder are feature columns

Decision Tree

A decision tree is a sequential filter which looks at each feature in turn and assesses the number of samples which pass the a certain test, and splits the results into "passes" and "fails".
Then each "pass" and "fail" is asssessed agianst the next criteria until no further samples can be split,
ie there are either no "passes" or 100% passes at which point the process stops.

In our example we look at a 14 year old car with a top speed of 82, and an engine 1100cc from the UK.

The algorithm randomly selects a feature to start, and proceeds to evaluate for each of the other features.

The Decision Tree starts with the Origin.
It looks at all the cars with an Origin <=1.5
Including the target sample, it finds 9 cars out of the 26 from the USA or UK.
Applying the GINI formula, this gives a value of 0.453
A value of 0.0 would mean all of the samples got the same result, and 0.5 would mean that the split is done exactly in the middle.
So in this case almost half the samples will get a "pass" and half "fail" and this is what we see: a roughly 2:1 split -9 pass and 17 fail.
So a 1/3 pass and 2/3 fail the test.

Moving down the "pass" arm, the 9 samples form the first filter are then assessed for engine size <=1400
This yields 3 cars passing and 1 fail. Applying the GINI formula to the remaining sets yields only 1 pass and 3 fail with a GINI of zero so no futher splits are possible - [1,0] and [3,0].

Moving down the "fail" arm of the diagram, we have 22 samples (26 -4 from the "pass" arm).
These are assessed on the basis of Origin <4.5 which gives 6<=4.5 and 16>4.5.
These samples are then assessed against Age for pass/fail, and the process is repeated for each feature until there can be no further splits (GINI=0).

Different Results

You will see that the Decision Tree gives you different results if you run it enough times, even if you feed it with the same data.
That is because the Decision Tree does not give us a 100% certain answer.
It is based on the probability of an outcome, and the answer will vary.

The Confusion matrix

The Confusion matrix is a table that is used in classification problems to assess where errors in the model were made.


Predictions for a 14 year old car with a top speed of 82, and an engine 1100cc from the UK

The diagram shows a plot of the actual true and false values along the x-axis against the predicted true and false values on the y-axis.

In our example it shows the confidence in the car being 'liked' or 'not liked' and the colours show the confidence of the prediction: yellow indicates a very strong correspondence
and purple a weak correspondence.

Predictions for a 5 year old car with a top speed of 100, and an engine 4500cc from the USA

Confusion matrix metrics

Once we have a Confusion Matrix, we can calculate different measures to quantify the quality of the model in terms of
Accuracy - (True Positive + True Negative) / Total Predictions
Precision - Of the positives predicted, what percentage is truly positive?
Sensitivity - (sometimes called Recall) measures how good the model is at predicting positives.
Specificity - How well the model is at prediciting negative results? Similar to sensitivity, but looks at it from the persepctive of negative results.
F1-score - F-score is the "harmonic mean" of precision and sensitivity.
It considers both false positive and false negative cases and is good for imbalanced datasets.

The module sklearn contains the functions that can calculate the various metrics based on the actual/predicted data.
from sklearn import metrics
from sklearn.metrics import r2_score

The metrics are then available as follows:
Accuracy = metrics.accuracy_score(actual, predicted)
Precision = metrics.precision_score(actual, predicted)
Sensitivity_recall = metrics.recall_score(actual, predicted)
Specificity = metrics.recall_score(actual, predicted, pos_label=0)
F1_score = metrics.f1_score(actual, predicted)

For our example we get:
Accuracy= 0.653
Precision=0.653
Sensitivity_recall= 1.0
Specificity=0.0
F1_score=0.7906

So, in our example, we have an ~65% accuracy with a sensitivity of 1.0, so we can say the model is reasonably accurate and good at predicting positive results.
The F1 score is ~79% which shows the model has a fairly strong ability to distingguish between positive and negative results.

Once we have these metrics for the model, then we can assess what the model needs to make it more accurate. In our case, we only have relatively few results.
By getting more data points it would be expected that the metrics would improve and become more accurate.
Ideally, more than 100 data points would be a good starting point for a reliable model, and increasing data points to over 200 would mean there would be much more training and test data
from which the model can draw predictions.

Conclusion

The techiques outlined above show how to analyse data for single and multi-variant data and how to visualise and analyse the data to describe the confidence in the predictions.