2012年7月8日 星期日

[ ML In Action ] Machine learning basics




Preface :
In the last half of the twentieth century the majority of the workforce in the developed world has moved from manual labor to what is known as "knowledge work". Things are much more ambiguous now; job assignments such as "maximize profits," "minimize risk," and "find the best marketing strategy" are all too common. The fire hose of information available to us from the World Wide Web makes the jobs of knowledge workers even harder. Making sense of all the data with our job in mind is becoming a more essential skill, as Hal Varian, chief economist at Google, said :
I keep saying the sexy job in the next ten years will be statistician. People think I'm joking, but who would've guessed that computer engineers would've been the sexy job of the 1990's? The ability to take data - to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it - that's going to be a hugely important skill in the next decades, not only at the professional level but even at the educational level for elementary school kids, for high school kids, for college kids. Because now we really do have essentially free and ubiquitous data. So the complementary scarce factor is the ability to understand that data and extract value from it. I think statisticians are part of it, but it's just a part. You also want to be able to visualize the data, communicate the data, and utilize it effectively. But I do think those skills - of being able to access, understand, and communicate the insights you get from data analysis - are going to extremely important. Managers need to be able to access and understand the data themselves.

Key terminology :
We'll go through an example of building a bird classification system. This sort of system is an interesting topic often associated with machine learning called "expert systems". By creating a computer program to recognize birds, we've replaced an ornithologist with a computer. The ornithologist is a bird expert, so we've created an expert system.

Below table are some values for four parts of various birds that we decided to measure. We choose to measure weight, wingspan, whether it has webbed feet, and the color of its back. The four things we've measured are called "features" ; these are also called "attributes", but we'll stick with the term features in the follow chapters. Each of the rows in table is an "instance" made up of features :


The first two features in table are numeric and can take on decimal values. The third feature (Webbed feet) is binary: it can only be 1 or 0. The fourth feature (back color) is an enumeration over the color palette we're using, and I just chose some very common colors.

On task in machine learning is "classification" and there are many machine learning algorithms that are good at it. The class in this example is the bird species ; more specifically, we can reduce our classes to specific species or everything else. Say we've decided on a machine learning algorithm to use for classification. What we need to do next is train the algorithm, or allow it to learn. To train the algorithm we feed it quality data known as a "training set". A training set is the set of training examples we'll use to train our machine learning algorithms. From upper table, our training set has six "training examples". Each training example has four features and one "target variable". The target variable is what we'll be trying to predict with our machine learning algorithms. In classification the target variable takes on a nominal value, and in the task of regression its value could be continuous. In a training set the target variable is known. The machine learns by finding some relationship between the features and the target variable. The target variable is the species. In the classification problem the target variables are called "classes", and there is assumed to be a finite number of classes.

To test machine learning algorithms what's usually done is to have a training set of data and a separate dataset, called a "test set". Initially the program is fed the training examples ; this is when the machine learning takes place. Next, the test set is fed to the program. The target variable for each example from the test set isn't given and the program decides which class each example should belong to. The target variable or class that the training example belongs to is then compared to the predicated value, and we can get a sense for how accurate the algorithm is.

In our bird classification example, assume we've tested the program and it meets our desired level of accuracy. Can we see what the machine has learned? This is called "knowledge representation". The answer is it depends. Some algorithm have knowledge representation that's more readable by humans than others. The knowledge representation may be in the form of a set of rules ; it may be a probability distribution or an example from the training set. In some cases, we may not be interested in building an expert system but interested only in the knowledge representation that's acquired from training a machine learning algorithm.

Key tasks of machine learning :
The example covered previously was for the task of classification. In classification, our job is to predict what class an instance of data should fall into. Another task in machine learning is "regression". Regression is the predication of a numeric value. Most people have probably seen an example of regression with best-fit line drawn through some points to generalize the data points. Classification and regression are examples of "supervised learning". This set of problems is known as supervised because we're telling the algorithm what to predict.

The opposite of supervised learning is a set of tasks known as "unsupervised learning". In unsupervised learning, there's no label or target value given for the data. A task where we group similar items together is known as "clustering". In unsupervised learning, we may also want to find statistical values that describe the data. This is known as "density estimation". Another task of unsupervised learning may be reducing the data from many features to a small number so that we can properly visualize it in two or three dimensions. Below table (Table 1.2) lists some common tasks in machine learning with algorithms used to solve those tasks :


How to choose the right algorithm :
With all the different algorithms in table 1.2, how can you choose which one to use? First, you need to consider your goal. What are you trying to get out of this? What data do you have or can you collect? Those are the big questions.

If you're trying to predict or forecast a target value, then you need to look into supervised learning. If not, then unsupervised learning is the place you want to be. If you've chosen supervised learning, what's your target value? Is it a discrete value like Yes/No, 1/2/3, A/B/C, or Red/Yellow/Black? If so, then you want to look intoclassification. If the target value can take on a number of values, say any value from 0.00 to 100.00, or -999 to 999 etc, then you need to look int regression.

If you're not trying to predict a target value, then you need to look into unsupervised learning. Are you trying to fit your data into some discrete groups? If so and that's all you need, you should look into clustering. Do you need to have some numerical estimate of how strong the fit is into each group? If you answer yes, then you probably should look into a density estimation algorithm.

The rules given here should point you in the right direction but are not unbreakable laws. From later chapters (ch9), you can use classification techniques for regression, blurring the distinction made within supervised learning. The second thing you need to consider is your data.

You should spend some time getting to know your data, and the more you know about it, the better you'll be able to build a successful application. Things to know about your data are these: Are the features nominal or continuous? Are there missing values in the features? If there are missing values, why are there missing values? Are there outliers in the data? All of these features about your data can help you narrow the algorithm selection process.

With the algorithm narrowed, there're no single answer to what the best algorithm is or what will give you the best results. You're going to have to try different algorithms and see how they perform. There are other machine learning techniques that you can use to improve the performance of a machine learning algorithm. The relative performance of two algorithms may change after you process the input data. So the point is that finding the best algorithm is an iterative process of trial and error.

Steps in developing a machine learning application :
Our approach to understanding and developing an application using machine learning here will follow a procedure similar to :
1. Collect data
You could collect the samples by scraping a website and extracting data, or you could get information from an RSS feed or an API. You could have a device collect wind speed measurements and send them to you, or blood glucose levels, or anything you can measure.

2. Prepare the input data
Once you have this data, you need to make sure it's in a useable format. The format we'll using here is the Python list. Also you may need to do some algorithm-specific formatting here. Some algorithms need features in a special format, some algorithms can deal with target variables and features a strings, and some need them to be integers. etc

3. Analyze the input data
This is looking at the data from the previous task. This could be as simple as looking at the data you've parsed in a text editor to make sure steps1 and 2 are actually working and you don't have a bunch of empty values.

4. Training the algorithm
This is where the machine learning take place. This step and the next step are where the "core" algorithms lie, depending on the algorithm. You feed the algorithm good clean data from the first two steps and extract knowledge or information.
In the case of "unsupervised learning", there's no training step because you don't have a target value. Everything is used in the next step.

5. Test the algorithm
This is where the information learned in the previous step is put to use. When you're evaluating an algorithm, you'll test it to see how well it does.

6. Using it
Here you make a real program to do some task, and once again you see if all the previous steps worked as you expected. You might encounter some new data and have to revisit steps 1-5

This message was edited 22 times. Last update was at 09/07/2012 11:34:59

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...