程式扎記: 8月 2015

2015年8月30日星期日

[ Big Data 研究 ] 04 管理 Linux Container 虛擬網路 - Part2

使用自訂的虛擬橋接器 (Bridge)
雖然 Linux 系統提供了功能完整的虛擬網路元件, 讓 LXC 虛擬主機不僅能連結內部網路, 甚至可以連接到外部網路. 但在某些實際狀況下, 我們需要封閉的網路架構來做應用系統的測試, 這時候就需要產生一個自訂的虛擬橋接器, 將測試用的虛擬主機全部連結到自訂虛擬橋接器的網路區段內, 以達到我們的需求.

手動建立虛擬橋接器
要產生自訂的虛擬橋接器, 需要使用 "brctl" 這個指令. 但在使用此指令前, 必須安裝 "bridge-utils" 套件. 而該套件在安裝 LXC 核心模組時, 就已經順便安裝道系統內了. 接著請使用下面命令產生一個名為 "br01" 的自訂虛擬橋接器:

# brctl addbr br01
# brctl show br01
bridge name bridge id STP enabled interfaces
br01 8000.000000000000 no

目前並沒有任何一台 LXC 虛擬主機連接到 br01 上, 所以 "interface" 欄位上面是空白的. 在使用 br01 虛擬橋接器之前, 必須先使用下面指令啟動之:

# ifconfig br01 up
# ip addr show br01
42: br01: UP

,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default
link/ether b6:ff:78:b5:3e:6e brd ff:ff:ff:ff:ff:ff
inet6 fe80::b4ff:78ff:feb5:3e6e/64 scope link
valid_lft forever preferred_lft forever
使用自訂虛擬橋接器
修改 "虛擬主機組態設定檔" 是為了修改 LXC 虛擬主機預設使用的虛擬橋接器, 而該設定檔可以使用 "tree" 指令查看 LXC 虛擬主機的檔案結構得知:

確認存放位址後, 接著就來修改此檔案, 以 myUS14 虛擬主機作為示範:

# vi /var/lib/lxc/myUS14/config

修改後, myUS14 虛擬主機就會從預設的 "lxcbr0", 改為連結到 "br01" 的虛擬橋接器上.

橋街內外部網路
在前面的步驟中, 我們手動產生了 "br01" 虛擬橋接器. 藉由此虛擬橋接器, 可以建構出實際應用上所需要的封閉網路區段, 不過單單只有內部網路可以溝通, 對於系統測試來說可能還不夠, 如果能讓 br01 虛擬橋接器連結實體主機所使用的網路卡, 就可以讓此網路區段內的虛擬主機, 可以透過實體網路卡連接外部網路.

接下來我們要讓 br01 虛擬橋接器連接實體主機的 eth0 網路卡, 並把 myUS14 和 ubuntu14 虛擬主機連結到此虛擬橋接器上, 再分別以不同方式設定, 取得 IP 位址. 首先使用以下命令讓 br01 虛擬橋接器連結 eth0 實體網路卡 (如果你的實體主機也是虛擬主機的話如 VMWare workstation, 此步會造成 putty 連線中斷!):

# brctl addif br01 eth0
# brctl show
bridge name bridge id STP enabled interfaces
br01 8000.000c297488fb no eth0

而你可以使用下面命令將 eth0 從虛擬橋接器 br01 中移除:

# brctl delif br01 eth0

如果從虛擬橋接器 br01 移除 eth0, 此時該橋接器便是一個 isolated 的環境. 為了讓連接到該橋接器上的虛擬主機能夠收到 DHCP 的 IP, 我們另外啟動了一個 dnsmasq 服務:

# dnsmasq -u lxc-dnsmasq --strict-order --bind-interfaces --pid-file=/run/lxc/dnsmasq_br01.pid --conf-file= --listen-address 10.1.100.1 --dhcp-range 10.1.100.200,10.1.100.250 --dhcp-lease-max=10 --dhcp-no-override --except-interface=lo --interface=br01 --dhcp-leasefile=/var/lib/misc/dnsmasq.br01.leases --dhcp-authoritative

這時候啟動 myUS14 虛擬主機應該能拿到 10.1.100.x 的 IP 位址:

# lxc-start -n myUS14 -d // 啟動虛擬主機 myUS14 到背景中
# lxc-console -n myUS14 // 登入虛擬主機
ubuntu@myUS14:~$ ip addr show eth0
47: eth0: mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 00:16:3e:65:9b:85 brd ff:ff:ff:ff:ff:ff
inet 10.1.100.239/24 brd 10.1.100.255 scope global eth0

接下來我們要分別修改 myUS14 和 ubuntu14 的虛擬主機網路卡設定檔, 將前者設定為 DHCP Client, 後者則使用手動固定的方式取得 IP 位址. 首先是 myUS14:

# vi /var/lib/lxc/myUS14/rootfs/etc/network/interfaces

預設 LXC 虛擬主機預設就是透過 DHCP Server 的道派發的 IP 位址, 所以上面我們並未作任何修改而是保留設定. 而因為目前 br0 虛擬橋接器已經連結了實體主機的網路卡, 這意味此時myUS14 是透過與實體主機相同的 DHCP 拿到相同網段的 IP 位址. 接著要來設定 ubuntu14 虛擬主機的固定IP 位址 :

# vi /var/lib/lxc/ubuntu14/config // 修改預設虛擬橋接器為 br01
# vi /var/lib/lxc/ubuntu14/rootfs/etc/network/interfaces // 設定固定 IP 10.1.100.245/24

view plain copy to clipboard print ?

# This file describes the network interfaces available on your system

# and how to activate them. For more information, see interfaces(5).



# The loopback network interface

auto lo

iface lo inet loopback



auto eth0

iface eth0 inet static

address 10.1.100.245

netmask 255.255.255.0

gateway 10.1.100.1

dns-nameservers 168.95.1.1

上面的 gateway, ip 與 netmask 請依據主機上的網路設定給適當的值 (此時 myUS14 上面的 IP 是 10.1.254.239). 接著兩個虛擬主機應該能透過虛擬橋接器 br01 互通: (可以透過 'Ctrl+a' then 'q' 來離開虛擬主機的 Console)

# lxc-start -n ubuntu14 -d // 啟動虛擬主機 ubuntu14 到背景中
# lxc-console -n ubuntu14 // 登入虛擬主機
ubuntu@ubuntu14:~$ ip addr show eth0
49: eth0: mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 00:16:3e:49:de:a3 brd ff:ff:ff:ff:ff:ff
inet 10.1.100.245/24 brd 10.1.100.255 scope global eth0
...
ubuntu@ubuntu14:~$ ping -c 3 10.1.100.239 // 確認可以連接到虛擬主機 myUS14
PING 10.1.100.239 (10.1.100.239) 56(84) bytes of data.
64 bytes from 10.1.100.239: icmp_seq=1 ttl=64 time=0.091 ms
...

連接網際網路
上面剛建立好的 br01 虛擬橋接器並未提供 NAT 轉址功能, 沒辦法讓連結到 br01 上的 LXC 虛擬主機透過轉址的方式, 與外部網路溝通. 而在 Linux 系統內要實現 NAT 轉址功能, 就需要命令 iptables 才行. 請在實體主機上輸入以下命令:

// -A: 新增規則
// -o: 指定封包由哪張網卡出去
// -s: 設定封包的來源
// -j: 後續的動作為 'MASQUERADE'
# iptables -t nat -A POSTROUTING -o eth0 -s 10.1.100.0/24 -j MASQUERADE
# iptables -t nat -L -n
Chain PREROUTING (policy ACCEPT)
target prot opt source destination

Chain INPUT (policy ACCEPT)
target prot opt source destination

Chain OUTPUT (policy ACCEPT)
target prot opt source destination

Chain POSTROUTING (policy ACCEPT)
target prot opt source destination
MASQUERADE all -- 10.0.3.0/24 !10.0.3.0/24
MASQUERADE all -- 10.0.100.0/24 !10.0.100.0/24
MASQUERADE all -- 10.1.100.0/24 0.0.0.0/0

此時登入虛擬主機 myUS14 確認外部網路連接:

# lxc-console -n myUS14 // 連結虛擬主機 myUS14
ubuntu@myUS14:~$ ping -c 3 www.google.com.tw
PING www.google.com.tw (64.233.189.94) 56(84) bytes of data.
64 bytes from tl-in-f94.1e100.net (64.233.189.94): icmp_seq=1 ttl=127 time=37.9 ms
...

LXC 虛擬主機獨佔實體網路卡
LXC 核心模組預設的 'lxcbr0' 虛擬橋接器透過 'dnsmasq' 這個套件讓 lxcbr0 可以提供 DHCP 與 DNS Cache Server 服務, 也因為命另 iptables 的協助, 虛擬主機可以提供 NAT 轉址方式來連接到 Internet. 然而在實際的應用面上, 虛擬主機可並不只是單單用來測試系統開發, 或者當作重要網路服務, 應用系統的備援, 而是有可能會直接取代正在線上運作的實體主來提供各項服務. 因此虛擬主機的網路效能就變得非常重要! 因此 LXC 核心模組提供功能讓 LXC 虛擬主機能直接使用實體網路卡與外部網路溝通!

要讓 LXC 虛擬主機獨佔主機上的網路卡前, 我們需先確定主機上的網路資源, 畢竟要有兩張以上的實體網路卡, 才能讓 LXC 虛擬主機獨佔其中一張. 架設你的實體主機上有兩張實體網路卡 eth0 與 eth1, 並打算讓虛擬主機獨佔 eth1, 則你可以可修改虛擬主機組態設定檔如下 (以虛擬主機 myUS14 為例):

# vi /var/lib/lxc/myUS14/config

view plain copy to clipboard print ?

# Template used to create this container: /usr/share/lxc/templates/lxc-ubuntu

# Parameters passed to the template:

# For additional config options, please look at lxc.container.conf(5)



# Common configuration

lxc.include = /usr/share/lxc/config/ubuntu.common.conf



# Container specific configuration

lxc.rootfs = /var/lib/lxc/myUS14/rootfs

lxc.mount = /var/lib/lxc/myUS14/fstab

lxc.utsname = myUS14

lxc.arch = amd64



# Network configuration

#lxc.network.type = veth

#lxc.network.flags = up

#lxc.network.link = lxcbr0

#lxc.network.link = br01

#lxc.network.hwaddr = 00:16:3e:65:9b:85



lxc.network.type = phys

lxc.network.link = eth1

完成之後, 重新啟動虛擬主機以讓設定生效.

Supplement
* Linux BRIDGE-STP-HOWTO: About The Linux Modular Bridge And STP

2015年8月27日星期四

[ DM Practical MLT] (4) Algorithms - Inferring rudimentary rules

Introduction
Here’s an easy way to find very simple classification rules from a set of instances. Called 1R for 1-rule, it generates a one-level decision tree expressed in the form of a set of rules that all test one particular attribute. 1R is a simple, cheap method that often comes up with quite good rules for characterizing the structure in data. It turns out that simple rules frequently achieve surprisingly high accuracy. Perhaps this is because the structure underlying many real-world datasets is quite rudimentary, and just one attribute is sufficient to determine the class of an instance quite accurately. In any event, it is always a good plan to try the simplest things first.

The idea is this: we make rules that test a single attribute and branch accordingly. Each branch corresponds to a different value of the attribute. It is obvious what is the best classification to give each branch: use the class that occurs most often in the training data. Then the error rate of the rules can easily be determined. Just count the errors that occur on the training data, that is, the number of instances that do not have the majority class.

Each attribute generates a different set of rules, one rule for every value of the attribute. Evaluate the error rate for each attribute’s rule set and choose the best. It’s that simple! Figure 4.1 shows the algorithm in the form of pseudocode.

To see the 1R method at work, consider the weather data of Table 1.2 (we will encounter it many times again when looking at how learning algorithms work). To classify on the final column, play, 1R considers four sets of rules, one for each attribute. These rules are shown in Table 4.1

An asterisk indicates that a random choice has been made between two equally likely outcomes. The number of errors is given for each rule, along with the total number of errors for the rule set as a whole. 1R chooses the attribute that produces rules with the smallest number of errors—that is, the first and third rule sets. Arbitrarily breaking the tie between these two rule sets gives:

view plaincopy to clipboardprint?
outlook: sunny -> no  
         overcast -> yes  
         rainy -> yes  

We noted at the outset that the game for the weather data is unspecified. Oddly enough, it is apparently played when it is overcast or rainy but not when it is sunny. Perhaps it’s an indoor pursuit.

Missing values and numeric attributes
Although a very rudimentary learning method, 1R does accommodate both missing values and numeric attributes. It deals with these in simple but effective ways.Missing is treated as just another attribute value so that, for example, if the weather data had contained missing values for the outlook attribute, a rule set formed on outlook would specify four possible class values, one each for sunny, overcast, and rainy and a fourth for missing.

We can convert numeric attributes into nominal ones using a simple discretization method. First, sort the training examples according to the values of the numeric attribute. This produces a sequence of class values. For example, sorting the numeric version of the weather data (Table 1.3) according to the values of temperature produces the sequence:

Discretization involves partitioning this sequence by placing breakpoints in it. One possibility is to place breakpoints wherever the class changes:

view plaincopy to clipboardprint?
yes | no | yes yes yes | no no | yes yes yes | no | yes yes | no  

Choosing breakpoints halfway between the examples on either side places them at 64.5, 66.5, 70.5, 72, 77.5, 80.5, and 84. However, the two instances with value 72 cause a problem because they have the same value of temperature but fall into different classes. The simplest fix is to move the breakpoint at 72 up one example, to 73.5, producing a mixed partition in which no is the majority class.

A more serious problem is that this procedure tends to form a large number of categories. The 1R method will naturally gravitate toward choosing an attribute that splits into many categories, because this will partition the dataset into many classes, making it more likely that instances will have the same class as the majority in their partition. In fact, the limiting case is an attribute that has a different value for each instance—that is, an identification code attribute that pinpoints instances uniquely—and this will yield a zero error rate on the training set because each partition contains just one instance. Of course, highly branching attributes do not usually perform well on test examples; indeed, the identification code attribute will never predict any examples outside the training set correctly. This phenomenon is known as overfitting.

For 1R, overfitting is likely to occur whenever an attribute has a large number of possible values. Consequently, when discretizing a numeric attribute a rule is adopted that dictates a minimum number of examples of the majority class in each partition. Suppose that minimum is set at three. This eliminates all but two of the preceding partitions. Instead, the partitioning process begins

view plaincopy to clipboardprint?
yes no yes yes | yes...  

ensuring that there are three occurrences of yes, the majority class, in the first partition. However, because the next example is also yes, we lose nothing by including that in the first partition, too. This leads to a new division:

view plaincopy to clipboardprint?
yes no yes yes yes | no no yes yes yes | no yes yes no  

where each partition contains at least three instances of the majority class, except the last one, which will usually have less. Partition boundaries always fall between examples of different classes.

Whenever adjacent partitions have the same majority class, as do the first two partitions above, they can be merged together without affecting the meaning of the rule sets. Thus the final discretization is:

view plaincopy to clipboardprint?
yes no yes yes yes no no yes yes yes | no yes yes no  

which leads to the rule set

view plaincopy to clipboardprint?
temperature: <= 77.5 -> yes  
             > 77.5 -> no  

Discussion
Surprisingly, despite its simplicity 1R did astonishingly—even embarrassingly—well in comparison with state-of-the-art learning methods, and the rules it produced turned out to be just a few percentage points less accurate, on almost all of the datasets, than the decision trees produced by a state-of-the-art decision tree induction scheme. These trees were, in general, considerably larger than 1R’s rules. Rules that test a single attribute are often a viable alternative to more complex structures, and this strongly encourages a simplicity-first methodology in which the baseline performance is established using simple, rudimentary techniques before progressing to more sophisticated learning methods, which inevitably generate output that is harder for people to interpret.

The 1R procedure learns a one-level decision tree whose leaves represent the various different classes. A slightly more expressive technique is to use a different rule for each class. Each rule is a conjunction of tests, one for each attribute. For numeric attributes the test checks whether the value lies within a given interval; for nominal ones it checks whether it is in a certain subset of that attribute’s values. These two types of tests—intervals and subset—are learned from the training data pertaining to each class. For a numeric attribute, the endpoints of the interval are the minimum and maximum values that occur in the training data for that class. For a nominal one, the subset contains just those values that occur for that attribute in the training data for the class. Rules representing different classes usually overlap, and at prediction time the one with the most matching tests is predicted. This simple technique often gives a useful first impression of a dataset. It is extremely fast and can be applied to very large quantities of data.

Lab Demo
First of all, we have to create out input training data (first line is headers, the one with '*' is target header to predict):
- data/weathers.dat

view plaincopy to clipboardprint?
Outlook,Temperature,Humidity,Windy,*Play  
sunny,hot,high,false,no  
sunny,hot,high,true,no  
overcast,hot,high,false,yes  
rainy,mild,high,false,yes  
rainy,cool,normal,false,yes  
rainy,cool,normal,true,no  
overcast,cool,normal,true,yes  
sunny,mild,high,false,no  
sunny,cool,normal,false,yes  
rainy,mild,normal,false,yes  
sunny,mild,normal,true,yes  
overcast,mild,high,true,yes  
overcast,hot,normal,false,yes  
rainy,mild,high,true,no   

Then we can know how 1R work with below sample code (Whole source code can be reached from GML):
- OneR.groovy

view plaincopy to clipboardprint?
package dm.basic.oner  
  
import dm.input.SimpleIn  
import flib.util.CountMap  
import flib.util.Tuple  
  
class OneR {  
    public static void main(args)  
    {  
        // Reading training data  
        SimpleIn si = new SimpleIn(new File("data/weathers.dat"))  
          
        int ei=-1;  
        def headers = si.getHeaders()  
        ei=headers.findIndexOf { h-> h.startsWith('*')}  
        printf("\t[Info] Target Header: %d (%s)\n", ei, headers[ei])  
        printf("\t[Info] Start 1R Algorithm...\n")  
        def trainMap = [:]  
        printf("\t\t0) Initialize...\n")  
        headers.size().times{i->  
            if(i==ei) return  
            def cmm = [:].withDefault { k -> return new CountMap()}  
            trainMap[i] = new Tuple(headers[i], [], cmm, new TreeSet())  
        }  
          
        printf("\t\t1) Start Analyze data...\n")  
        Iterator datIter = si.iterator()  
        while(datIter.hasNext())  
        {  
            def r = datIter.next()  
            def p = r[ei]  
            for(int i=0; i
            {                                 
                if(i==ei) continue  
                Tuple t = trainMap[i]  
                t.get(1).add(r[i])  
                t.get(2)[r[i]].count(p)  
                t.get(3).add(r[i])  
            }  
        }  
          
        printf("\t\t2) Pick attribute with less error...\n")  
        int gErr=Integer.MAX_VALUE  
        String attr=null  
        for(Tuple t:trainMap.values())  
        {  
            int err=0  
            printf("\t\t\tHeader='%s':\n", t.get(0))  
            for(String v:t.get(3))  
            {  
                CountMap cm = t.get(2)[v]  
                def maj = cm.major()            /*Major category/predict list*/  
                def mc = maj[0]                 /*Pickup first major category/predict*/  
                def mcc = cm.getCount(mc)       /*Major category/predict count*/  
                def tc = cm.size()              /*Total size of this attribute*/  
                printf("\t\t\t\tValue='%s'->%s (%d/%d)\n", v, mc, mcc, tc)  
                err+=(tc-mcc)  
            }  
            if(err
            {  
                gErr=err  
                attr=t.get(0)  
            }  
        }  
          
        printf("\t\t3) Generate 1R: %s\n", attr)  
    }  
}  

Execution Result:

訂閱：文章 (Atom)

程式扎記

標籤

2015年8月30日星期日

[ Big Data 研究 ] 04 管理 Linux Container 虛擬網路 - Part2

2015年8月27日星期四

[ DM Practical MLT] (4) Algorithms - Inferring rudimentary rules

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2015年8月30日 星期日

[ Big Data 研究 ] 04 管理 Linux Container 虛擬網路 - Part2

2015年8月27日 星期四

[ DM Practical MLT] (4) Algorithms - Inferring rudimentary rules

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

2015年8月30日星期日

2015年8月27日星期四