I'm convinced that the rise of machine learning is among the most important trends of our time.
Machine learning underlies many of the services you use today, including things like speech recognition, and recommendations from Amazon, and even whether an online store lets you use your credit card for your latest purchase. As a contract software engineer my goal for understanding machine learning is to start thinking and communicating intelligently about whether machine learning can help organisations that employ my services.
Understanding Machine Learning - Part 1 is the result of my research into understanding machine learning, I decided to create a tutorial style blog series to assist others who maybe interested in understanding machine learning too.
What is Machine Learning?
The core thing machine learning does is finds patterns in data. It then uses those patterns to predict the future.
For example, you could use machine learning to detect credit card fraud. Suppose you have data about previous credit card transactions, you could find patterns in that data that will let you detect when a new credit card transaction is likely to be fraudulent. Or, maybe you want to decide when it's time to do preventive maintenance on a factory robot. Once again, you could look at existing data, and find patterns that predict when a robot is going to fail. There are lots more, but the core idea is that machine learning lets you find patterns in data, then use those patterns to predict the future.
What does it mean to learn? For example, how did you learn to read? Well, learning requires identifying patterns. In reading, for instance, you identify letters, and then the patterns of letters together to form words. You then had to recognize those patterns when you saw them again. That's what learning means, just as when you learn to read. And that's what machine learning does with data that we provide.
Suppose I have data about credit card transactions and I have only four records, each one has three fields; the customer's name, the amount of the transaction, and whether or not it was fraudulent.
This data suggests fraudulent transactions are when a name starts with P, they're a criminal. Well, probably not! The problem with having so little data is that it's easy to find patterns, but it's hard to find patterns that are correct, correct in the sense that they are predictive, they help us understand whether a new transaction is likely to be fraudulent.
So suppose I have more data.
Now I have more records and I have more fields in each one, now I know where the card was issued, where it was used, the age of the user.
Now what's the pattern for fraudulent transactions? Well, turns out that if you look at that, there really is a pattern in this data. It is that a transactions is fraudulent if the cardholder is in their 20s, if the card is issued in the USA, and used in Russia, and the amount is more than $1000. You could have found that pattern, I bet, if you looked at this data for a little while. But once again, do we know that that pattern is truly predictive? Probably not. We don't have enough data. To do this well, you'd have to have enough data that people just can't find the patterns. You have to use software. That's where machine learning comes in.
Machine Learning in a Nutshell
Machine learning in a nutshell looks like this. You start with data that contains patterns. You then feed that data into a machine learning algorithms, that finds patterns in the data. This algorithm generates something called a model.
A model is functionality, typically code, that's able to recognize patterns when presented with new data. Applications can then use that model by supplying new data to see if this data matches known patterns, such as supplying data about a new transaction. The model can return a probability of whether this transaction is fraudulent. It knows that because of the patterns.
Machine learning lets us find patterns in existing data, then create and use a model that recognizes those patterns in new data
Asking the Right Question
The first problem you face in the machine learning process is deciding what question to ask. Asking the right question is the most important part of the process. And the reason why this is true is, if you ask the wrong question, you won't get the answer you care about.
Choosing what question to ask is really important, and then you've got to ask yourself, do you have the right data to answer this question? Maybe, for example, the question you want to ask is how can I predict whether a credit card transaction is going to be fraudulent? Well, maybe it's the case that the most predicted piece of data for doing this is whether the customer is a homeowner or a renter. Or maybe it's how long they live at a current address. You might not have this data, and you won't know this until some later point, if ever.
Ask yourself, do you think you have the right data to answer the question? Because if you don't, you won't get an answer you need
You also want to ask yourself this, do you know how you'll measure success? Because ultimately what you're going to get is a model that makes predictions. How good must those predictions be to make this entire process qualify as a success?
For example, for credit card transactions, if you find that you're accurate about fraud prediction in, say, 8 out of 10 cases, is that good enough? How about 6 out of 10? Do you demand 9 out of 10? How do you decide? Knowing this up front is important, because if you don't, you will never know when you're done.
The Machine Learning Process
To start, you choose the data that you want to work with. You often are going to work with domain experts in the area to do this, people who know a lot about, say, transaction fraud or robot failure detection, or whatever problem you're trying to solve. These are the ones who know what data is most likely to be predictive.
But the data you start with, the raw data, is almost never in the right form. It has duplicates, it has missing data, it has extra stuff. Typically you've got to apply some pre-processing to that data. The result is some prepared data, data that's been worked on to be more appropriate as an input for machine learning. Do you do this just once? Oh, no. You commonly iterate until the data is ready.
The machine learning process is iterative, you repeat things over and over, in both big and small ways.
The truth here is that in typical machine learning projects, you'll spend most of your time right here, working on the data, getting it ready, getting it clean, getting it prepared. Once you have that data, you can then begin applying learning algorithms to the data.
The result of this is a model, but is it your final model? No. It's a candidate model. Is the first model you create the best one? Almost certainly not, and you can't know that until you've produced several, and so once again, you iterate. You do this until you have a model that you like, that you think is good enough to actually deploy.
Like most fields, machine learning has its own unique jargon, which you must understand.
Let's start with the idea of training data. Training data just means the prepared data that's used to create a model. So, training data is used to train to create a model.
There are two big broad categories of machine learning. One is called supervised learning, and what it means is that the value you want to predict is actually in the training data. For instance, in the example for predicting credit card fraud, whether or not a given transaction was fraudulent is actually contained in each record. That data in the jargon of machine learning is labeled.
The alternative, is called unsupervised learning, and here the value you want to predict is not in the training data. The data is unlabeled.
The machine learning process starts with data. It might be relational data, it might be from a NoSQL database, it might be binary data. Wherever it comes from, though, you need to read this raw data into some data preprocessing modules typically chosen from the things your machine learning technology provides. You have to do this because raw data is very rarely in the right shape to be processed by machine learning algorithms.
You'll spend lots of your time, often the majority of your time, in a machine learning project on this aspect of the process. For example, maybe there are holes in your data, missing values, or duplicates, or maybe there's redundant data where the same thing is expressed in two different ways in different fields, or maybe there's information that you know will not be predictive, it won't help you create a good model.
The goal is to create training data. The training data commonly has columns. Those columns are called features. So, for example, in the simple illustration I showed of data for credit card fraud, there were columns containing the country the card was issued in, the country was card was used in, the amount of the transaction. Those are all features in the jargon of machine learning. And because we're talking now about supervised learning, the value we're trying to predict, such as whether a given transaction is fraudulent, is also in the training data. In the jargon of machine learning, we call that the target value.
Categorising Machine Learning Problems
It's common to group machine learning problems into categories.
The problem here is that we have data, and we'd like to find a line or a curve that best fits that data. Regression problems are typically supervised learning scenarios.
Data is grouped into classes, at least two, sometimes more than two. When new data comes in, we want to determine which class that data belongs to. This is commonly used with supervised learning.
Here we have data, we want to find clusters in that data. This is a good example of when we're going to use unsupervised learning, because we don't have labeled data. We don't know necessarily what we're looking for. An example question here is something like, what are our customer segments? We might not know these things up front, but we can use machine learning, unsupervised machine learning, to help us figure that out.
Styles of Machine Learning Algorithms
The kinds of problems that machine learning addresses aren't the only thing that can be categorized. It's also useful to think about the styles of machine learning algorithms that are used to solve those problems. For example, there are decision tree algorithms. There are algorithms that use neural networks, which in some ways emulate how the brain works. There are Bayesian algorithms that use Bayes' theorem to work up probabilities. There are K-means algorithms that are used for clustering, and there are lots more.
Training and Testing a Model
Let's take a closer look at the process of creating a model, training a model. We start with our training data, which we've worked with until it's beautiful, pristine, just what we need. Because we're using supervised learning, the target value is part of the training data. In the case of the credit card example, for instance, that target value is whether a transaction is fraudulent or not.
Our first problem is to choose the features that we think will be most predictive of that target value. For example, in the credit card case, maybe we decide that the country in which the card was issued, the country it's used in, and the age of the user are the most likely features to help us predict whether it's fraudulent. We then input that training data into our chosen learning algorithm. We only send in 75%, of all the data for the features we've chosen
If we have, for example, training data that has about 100 features, how about 200? Which ones are predictive? How many should we use? 5, 10, 50? This is why people who have domain knowledge about some particular problem, are so valuable. It's because they can help us do this. It can be a hard problem. In any case, the result of this is to generate a candidate model.
The next problem is to work out whether or not this model is any good. And so, we do that in supervised learning like this. We input test data to a candidate model. That test data is the remaining 25%, the data we held back for the features we're using, in this case, 1, 3, and 6. We use that data, because our candidate model can now generate target values from that test data. But here's the thing. We know what those target values should be, because they are in the training data. All we have to do is compare the target values produced by our candidate model from the test data with the real target values, which are in the training data. That's how we could figure out whether or not our model is predictive or not when we're doing supervised learning.
Suppose our model's just not very good. How can we improve it? Well, there are some usual options. One of them is, maybe we've chosen the wrong features. Let's choose different ones. How about 1, 2, and 5 this time? Or maybe it's the case that we have the wrong data, let's get some new data, or at least some more example data. Or maybe the problem is the algorithm. Maybe it's the case that we can modify some parameters in our algorithm, they commonly have them, or choose another one entirely.
Whatever we do will generate another candidate model, and we'll test it, and the process repeats. It iterates. Iteration is a fancy way of saying trial and error. So, don't be confused. This process is called machine learning, but notice how much people do. People make decisions about features, about algorithms, about parameters. The process is very human, even though it's called machine learning.
Using a Model
In some ways, this is the most important topic of all, because until models are used, they don't really have much value. An application, for example, can call a model, providing the values for the features the model requires. Remember, models make predictions based on the features that were chosen when the model was trained. The model can then return a value, predicted using these features. That value might be whether or not it actually is fraudulent, estimated revenue, a list of movie recommendations, or something else. The point here is that machine learning can help people create better applications.
And that's it. Let me end by summarising the key points. First, machine learning has come of age. It's no longer some technology that's only for researchers in faraway labs. Machine learning also isn't hard to understand, it can be hard to do well.
Part 2 coming soon...