In the last post we looked at the basics of machine learning and, using open source geochemistry data, created a lithology classifier with the data mining program Orange. In this blog we'll delve deeper into the importance of data preparation and what to take into account before creating a model.
UNDERSTANDING THE DATA
In Machine Learning we're obviously working with data. Understanding what data you're working with is extremely important. Knowing where it come from, how it was collected, what it represents and the errors/quality of the data are important things to take into account prior to creating ML models.
For the example used in this blog we'll stick to the geochemistry theme with data provided from GEOROC. In particular we'll look at the geochemistry of basalts sampled from 5 different tectonic settings; Continental Flood Basalts, Convergent Margin Basalts, Ocean Island Basalts, Mid-Ocean Ridge Basalts and basalts sampled from Seamounts. Using Orange we will create a ML model that classifies basalts into these tectonic settings based upon the geochemistry. But firstly let's get to know the data a little bit.
The spreadsheet above is the raw data provided by GEOROC. It gives us all the journals/papers that published the geochemistry data for basalts, the sample locations, the age etc etc. From looking at the individual research I could work out when and what labs assayed the geochemical data, and what technique they used etc. From the methodology of each paper I could also see if Quality Assurance and Quality Control (QAQC) assessments had been completed to verify the data is true and accurate. For the sake of this example I'm just going to assume the data is all sweet, which I should definitely definitely not be doing.
The phrase 'Garbage in = Garbage out' appears to be a common theme in most papers and articles I've read regarding Machine Learning. It effectively means if you throw a bunch of unclean data into a machine learning algorithm you're going to get a shitty model.
INITIAL CLEAN
When you first open up your data set its good to do an initial clean in preparation for ML. This might include getting rid of features (columns) that will be useless in creating models, removing duplicate instances (rows) and making sure all the data is the correct type (text is recognised as text/string and numeric data is recognised as integers/float). I'll use Excel and Orange to do my initial data clean in this example, but Python can also be used.
Firstly I'm going to think about features I don't need. I'm creating a model to classify the tectonic settings of mafic rocks from geochemistry, so I don't think I'm going to get much information from the sample name, sample number, location etc, so they can go. Actually the only features I really need are the geochemistry and the tectonic setting of each basalt (I've left a couple extra columns in the image below for interests sake but will remove them in Orange). I've also gone through and looked for any duplicates using excel, which would have affected my model.
MISSING VALUES
As you can see in the spreadsheet above I've got an extremely rare, perfectly populated dataset with no missing values. This is because I have already cleaned up all features/instances that had missing data. You could almost expect every raw dataset to contain missing values, which can easily get mistaken as zero values. To put it into context; if I asked someone how many shots of tequila they put into my beer, them replying with a 'no' (value of zero) and them replying with a 'not sure' (missing value) are two very different things. It's important to know the difference.
There are three things you can do with missing values in a dataset;
Drop - Dropping effectively means removing features or instances that are missing data. If you have a large dataset with plenty of features this is probably a good way to go about it. Just remove.
Impute - This is a way to calculate what the missing values would have been given the rest of the information in the dataset. One way to do it is statistically, using the mean or mode of the desired feature. In my head this can get a bit sketchy with a potential loss of information.
Flag - The other option is to flag the missing values. 'If the model is shit then this missing data in this feature is probably the reason'.
OUTLIERS
Another important statistical component before making your model is outliers. Orange has a widget that can detect outliers using two possible methods; one that handles normally distributed data and one that handles non-normally distributed data. These can significantly change the weight of your model but sometimes having outliers can be a good thing. For example, if you got a few hits of gold on your exploration program that lie well and truly outside of the normally distributed data thats something to get excited about. Other times outliers can be not so good.
This is where knowing your dataset is important and you make a decision on whether you want outliers to be included into a model.
FEATURE SCALING & STANDARDISATION
We can also scale and/or standardise our data prior to creating a machine learning model. Scaling effectively takes the upper and lower bounds of a feature (lets say our SiO2 count varies from 31.5%-66.6%) and then scales it to a defined ratio, commonly 0 - 1 (31.5% = 0 and 66.6% = 1). Uniformly scaled features can reduce the standard deviations in our data and in turn reduce the amount of outliers.
Standardising your data means to rescale it so it possess a normal distribution, with a mean of 0 and a standard deviation of 1. This is very useful for comparing data that has different units and removing unnecessary feature weights. It's a good thing to do before creating a model. Orange has a few different ways to scale and normalise features. This is a good link for more information about feature scaling.
DIMENSIONALITY REDUCTION
Dimensionality reduction is the act of reducing the amount of features that go into building your machine learning model. The idea of using less features is to increase model speed, computer storage space, and can just make a more accurate model in general.
Firstly it's good to wrap your head around what dimensions mean. The scatter plot below shows the correlation between Rare Earth Elements Lanthanum (La) and Cerium (Ce). Both of these features are considered a 'dimension' of the dataset, and in this case we are viewing a two dimensional scatter plot (add another element to get a 3D visualisation etc).
There are two main ways of reducing features in your data set; only selecting the features you want in your initial clean (feature selection, like how I deleted out columns I didn't want earlier) or by combining input features to create a smaller dataset with 'basically' the same information as the original dataset. A dimensionality reduction technique commonly used that does this is called Principal Component Analysis (PCA).
PCA creates a new set of features (called Principal Components) from your original dataset by combining linearly similar features. Continuing with the La-Ce example, we can see that the elements have a linear relationship and share very similar properties in basalts (something to do with them being extremely similar elements perhaps). Because these features are so similar, they will provide the model with basically the same information. We can either remove one feature, or create a principal component (PC) from them.
The way it works is that PCA creates a new axis along the linear trend of the data and (in this example) converts the data from 2 dimensional into 1 dimensional. The new axis of the PC is effectively a line of best bit for linear data and as seen below becomes a 1 dimensional feature. You might notice some of the data points don't sit exactly on the axis line but for the sake of reducing features the extra information gained from the points is negligible, and the data is then just plotted on that PC.
This example seems pretty straight forward for reducing 2 dimensions to 1, but also works for any amount of features. Geologically, if we assume these basalts are relatively unaltered and are comprised of a stock standard geochemical make up, then we can also assume a lot of the REE concentrations are going to be similar for each sample. If all features share a similar linear relationship they can just be projected as a single linear principal component with minimal loss in data.
The features above posses an almost perfect 1:1 positive linear relationship. However there are likely different relationships in your data that will require other principal components. Orange has a PCA widget with many different parameters including selecting the amount of principal components for your your data, which is a bit of a trial and error situation. The general rule of thumb I've seen from PCA write ups is that the less principal components used the better. A good explanation found here.
MODEL MAKING
Now we've gone over some of the data cleaning, we can compare ugly data with our new pretty data and see how the machine learning classification models compare. Firstly I'll make a simple orange workflow, plugging the raw data from GEOROC straight into a different classification models and see what we get. In the 'Select Columns' widget as seen below, I have chosen Tectonic Setting as our Target Variable and left all the other remaining features available. We'll use and compare the ML algorithms Random Forest, Logistic Regression, Support Vector Machine, Naive Bayes, K-Nearest Neighbour and AdaBoost.
As seen below in the Test and Score widget our Area Under Curve (AUC -tradeoff between sensitivity and specificity, more here) and Classification Accuracy (CA) are really really good, so models must be all sweet, on we ride. But if we have a look in the Rank widget and see what features are providing our model information we can see that the paper Citations, Rock Name and Material features are our three biggest players. This is because the Citations are directly related to the tectonic setting, and also the Rock Name and Material contain the citation number. These columns are useless for our model so I'll do a first pass clean and get rid of them, remove any duplicates and correctly identify my data types, just like I did earlier in this blog.
Now that I've sorted out my features the Test and Score widget comes back with a lower and more reasonable accuracy (especially for Naive Bayes) and a better looking list of ranked features. However I saw a lot of missing values in my geochem datasheet, and looking at the Rank widget now I can see that 'missing values were imputed as needed'. This means that all the holes in my data have been filled up with values that might not necessarily be scientifically accurate. To remove this problem I'll just (aggressively) drop a bunch of features and instances that are missing values.
After dropping features/instances with missing values (image below) we can see the classification accuracy is pretty similar to beforehand, but at least we know now all the data is legit.
You can see where we're we are going with this, the cleaner your data the better the model. In the next step I'll combine the Standardised Normalisation, Outlier Removal and PCA dimensionality reduction. I've used the 'Preprocess' widget for to standardise our data around a mean of 0 with a standard deviation of 1, and the 'Outliers' widget to remove 15% of outliers as seen below.
A good widget for looking at distribution and to visualise outliers is the 'Distributions' widget, and a good widget for looking at feature relationships that could be reduced by PCA is 'Correlation's.
I've bought in the PCA widget and, using trial and error, discovered that having 6 Principal Components gives the best information response from my data.
Looking at the Test and Score widget now we can see that the SVM algorithm is pumping out a booming 96.2% classification accuracy and is by far our best model.
So looking at our workflow, we are effectively running a SVM model that is 96.2% accurate at classifying basalts into their tectonic setting. The model is being trained from 6 features/principal components which means it's using a lot less computing power to run and won't take up as much space as using all the features.
Below is a confusion matrix of our SVM classification and we can see that we have a really good predictions using these components. Now given the geochemistry of a random basalt we are able to use this machine learning model to classify the tectonic setting.
From an economic/exploration point of view, imagine all these basalt samples are now quartz vein samples in an orogenic gold terrain. Some contain different geochemical and geochronological properties with an intricate relationship to gold mineralisation, and instead of trying to classify tectonic setting you're trying to classify whether a vein is mineralised or not. Using ML you could predict and classify these vein samples based upon characteristics that human eyes might not pick up on. Food for thought.
Using orange we've been able to see how cleaning data can drastically improve your model data. As mentioned at the beginning of this post it is also important to know about the origins of your data and how it was collected. Understanding your data will make cleaning it all that more easier.
Perfect!! I was looking for a project for practice. I will do same analysis on spyder with python. I will definitaly share result with you. If I can succeed, I will share this article and my report to people for comparison about orange and raw coding.