In the last half of the twentieth century the majority of the workforce in the developed world has moved from manual labor to what is known as "knowledge work". Things are much more ambiguous now; job assignments such as "maximize profits," "minimize risk," and "find the best marketing strategy" are all too common. The fire hose of information available to us from the World Wide Web makes the jobs of knowledge workers even harder. Making sense of all the data with our job in mind is becoming a more essential skill, as Hal Varian, chief economist at Google, said :
Key terminology :
We'll go through an example of building a bird classification system. This sort of system is an interesting topic often associated with machine learning called "expert systems". By creating a computer program to recognize birds, we've replaced an ornithologist with a computer. The ornithologist is a bird expert, so we've created an expert system.
Below table are some values for four parts of various birds that we decided to measure. We choose to measure weight, wingspan, whether it has webbed feet, and the color of its back. The four things we've measured are called "features" ; these are also called "attributes", but we'll stick with the term features in the follow chapters. Each of the rows in table is an "instance" made up of features :
The first two features in table are numeric and can take on decimal values. The third feature (Webbed feet) is binary: it can only be 1 or 0. The fourth feature (back color) is an enumeration over the color palette we're using, and I just chose some very common colors.
On task in machine learning is "classification" and there are many machine learning algorithms that are good at it. The class in this example is the bird species ; more specifically, we can reduce our classes to specific species or everything else. Say we've decided on a machine learning algorithm to use for classification. What we need to do next is train the algorithm, or allow it to learn. To train the algorithm we feed it quality data known as a "training set". A training set is the set of training examples we'll use to train our machine learning algorithms. From upper table, our training set has six "training examples". Each training example has four features and one "target variable". The target variable is what we'll be trying to predict with our machine learning algorithms. In classification the target variable takes on a nominal value, and in the task of regression its value could be continuous. In a training set the target variable is known. The machine learns by finding some relationship between the features and the target variable. The target variable is the species. In the classification problem the target variables are called "classes", and there is assumed to be a finite number of classes.
To test machine learning algorithms what's usually done is to have a training set of data and a separate dataset, called a "test set". Initially the program is fed the training examples ; this is when the machine learning takes place. Next, the test set is fed to the program. The target variable for each example from the test set isn't given and the program decides which class each example should belong to. The target variable or class that the training example belongs to is then compared to the predicated value, and we can get a sense for how accurate the algorithm is.
In our bird classification example, assume we've tested the program and it meets our desired level of accuracy. Can we see what the machine has learned? This is called "knowledge representation". The answer is it depends. Some algorithm have knowledge representation that's more readable by humans than others. The knowledge representation may be in the form of a set of rules ; it may be a probability distribution or an example from the training set. In some cases, we may not be interested in building an expert system but interested only in the knowledge representation that's acquired from training a machine learning algorithm.
Key tasks of machine learning :
The example covered previously was for the task of classification. In classification, our job is to predict what class an instance of data should fall into. Another task in machine learning is "regression". Regression is the predication of a numeric value. Most people have probably seen an example of regression with best-fit line drawn through some points to generalize the data points. Classification and regression are examples of "supervised learning". This set of problems is known as supervised because we're telling the algorithm what to predict.
The opposite of supervised learning is a set of tasks known as "unsupervised learning". In unsupervised learning, there's no label or target value given for the data. A task where we group similar items together is known as "clustering". In unsupervised learning, we may also want to find statistical values that describe the data. This is known as "density estimation". Another task of unsupervised learning may be reducing the data from many features to a small number so that we can properly visualize it in two or three dimensions. Below table (Table 1.2) lists some common tasks in machine learning with algorithms used to solve those tasks :
How to choose the right algorithm :
With all the different algorithms in table 1.2, how can you choose which one to use? First, you need to consider your goal. What are you trying to get out of this? What data do you have or can you collect? Those are the big questions.
If you're trying to predict or forecast a target value, then you need to look into supervised learning. If not, then unsupervised learning is the place you want to be. If you've chosen supervised learning, what's your target value? Is it a discrete value like Yes/No, 1/2/3, A/B/C, or Red/Yellow/Black? If so, then you want to look intoclassification. If the target value can take on a number of values, say any value from 0.00 to 100.00, or -999 to 999 etc, then you need to look int regression.
If you're not trying to predict a target value, then you need to look into unsupervised learning. Are you trying to fit your data into some discrete groups? If so and that's all you need, you should look into clustering. Do you need to have some numerical estimate of how strong the fit is into each group? If you answer yes, then you probably should look into a density estimation algorithm.
The rules given here should point you in the right direction but are not unbreakable laws. From later chapters (ch9), you can use classification techniques for regression, blurring the distinction made within supervised learning. The second thing you need to consider is your data.
You should spend some time getting to know your data, and the more you know about it, the better you'll be able to build a successful application. Things to know about your data are these: Are the features nominal or continuous? Are there missing values in the features? If there are missing values, why are there missing values? Are there outliers in the data? All of these features about your data can help you narrow the algorithm selection process.
With the algorithm narrowed, there're no single answer to what the best algorithm is or what will give you the best results. You're going to have to try different algorithms and see how they perform. There are other machine learning techniques that you can use to improve the performance of a machine learning algorithm. The relative performance of two algorithms may change after you process the input data. So the point is that finding the best algorithm is an iterative process of trial and error.
Steps in developing a machine learning application :
Our approach to understanding and developing an application using machine learning here will follow a procedure similar to :
1. Collect data
2. Prepare the input data
3. Analyze the input data
4. Training the algorithm
5. Test the algorithm
6. Using it
This message was edited 22 times. Last update was at 09/07/2012 11:34:59