2017年6月3日 星期六

[ Python 套件 ] Pandas - Basic Introduction For ML/DataScience

Source From Here
DataFrames (link)
* Our first look at Pandas
* It works a lot like R (if you come from an R background)
* If you're not familiar with R, some things that Pandas does might seem backwards or contrary to the way Numpy works.
* Goal: Not to show you everything Pandas can do, rather just what we need for ML/data science
* If you have a question about something not covered, just ask!
* Most times: Load in data and immediately convert it into Numpy array
* Most features you won't use often, you'll just forget them

Below we are going to use function read_csv from Pandas to load data from CSV file "data_2d.csv":
>>> import pandas as pd
>>> X = pd.read_csv("tmp/data_2d.csv", header=None)
>>> type(X)

>>> X.info()

RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):
0 100 non-null float64
1 100 non-null float64
2 100 non-null float64
dtypes: float64(3)
memory usage: 2.4 KB

>>> X.head() // Show top 5 rows
0 1 2
0 17.930201 94.520592 320.259530
1 97.144697 69.593282 404.634472
2 81.775901 5.737648 181.485108
3 55.854342 70.325902 321.773638
4 49.366550 75.114040 322.465486

>>> X.head(2) // Top 2 rows
0 1 2
0 17.930201 94.520592 320.259530
1 97.144697 69.593282 404.634472

More About DataFrames - Selecting Rows and Columns (link)
Below will show a few functions of DataFrames on selecting rows & columns:
>>> M = X.as_matrix() // Covert DataFrames into numpy array for row/column selection
>>> type(M)


// Numpy: X[0] -> Select 0th row
// Pandas: X[0] -> Select column with name as 0 

>>> X[0][:10] // Select column 0 and show first 10 rows of it
0 17.930201
1 97.144697
2 81.775901
3 55.854342
4 49.366550
5 3.192702
6 49.200784
7 21.882804
8 79.509863
9 88.153887
Name: 0, dtype: float64

>>> X.head(10) // Double confirm with the selection
0 1 2
0 17.930201 94.520592 320.259530
1 97.144697 69.593282 404.634472
2 81.775901 5.737648 181.485108
3 55.854342 70.325902 321.773638
4 49.366550 75.114040 322.465486
5 3.192702 29.256299 94.618811
6 49.200784 86.144439 356.348093
7 21.882804 46.841505 181.653769
8 79.509863 87.397356 423.557743
9 88.153887 65.205642 369.229245


>>> type(X[0]) // Pandas use Series to represent column/row

>>> X.iloc[0] // Select 0th row
0 17.930201
1 94.520592
2 320.259530
Name: 0, dtype: float64

>>> X.ix[0] // Select 0th row
0 17.930201
1 94.520592
2 320.259530
Name: 0, dtype: float64

>>> type(X.ix[0])

>>> X[[0,2]][:3] // Select multiple column (0,2) and show top 3 rows of them
0 2
0 17.930201 320.259530
1 97.144697 404.634472
2 81.775901 181.485108

>>> X[ X[0] < 5 ] // Select row(s) of column 0 with value less than 5
0 1 2
5 3.192702 29.256299 94.618811
44 3.593966 96.252217 293.237183
54 4.593463 46.335932 145.818745
90 1.382983 84.944087 252.905653
99 4.142669 52.254726 168.034401

>>> (X[0] < 0)[:3]
0 False
1 False
2 False
Name: 0, dtype: bool

>>> type(X[0] < 0)

Even More About DataFrames - Column Names (link)
Here we will deal with another CSV file "international-airline-passengers.csv" and handle the column names issue:
// Loading the csv file and skip the footer from last three lines
>>> df = pd.read_csv("tmp/international-airline-passengers.csv", engine="python", skipfooter=3)
>>> df.columns
Index(['Month', 'International airline passengers: monthly totals in thousands. Jan 49 ? Dec 60'], dtype='object')
>>> df.columns = ["month", "passengers"] // Change the names of column
>>> df.columns
Index(['month', 'passengers'], dtype='object')
>>> df['passengers'][:3] // Access the first 3 rows of column 'passengers'
0 112
1 118
2 132
Name: passengers, dtype: int64

>>> df.passengers[:3]
0 112
1 118
2 132
Name: passengers, dtype: int64

>>> df['ones'] = 1 // Add new column 'ones' with all value=1
>>> df.head()
month passengers ones
0 1949-01 112 1
1 1949-02 118 1
2 1949-03 132 1
3 1949-04 129 1
4 1949-05 121 1

The apply() function (link)
What if we want to assign a new column value where each cell is derived from the values already in its row?
>>> from datetime import datetime
>>> datetime.strptime('1949-05', "%Y-%m")
datetime.datetime(1949, 5, 1, 0, 0)
>>> df['dt'] = df.apply(lambda row: datetime.strptime(row['month'], '%Y-%m'), axis=1) // Add one column 'dt' to translate column 'month' into dateteime object
>>> df.info()

RangeIndex: 144 entries, 0 to 143
Data columns (total 4 columns):
month 144 non-null object
passengers 144 non-null int64
ones 144 non-null int64
dt 144 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 4.6+ KB

Joins (link)
Consider we have below two CSV files:
- table1.csv
  1. user_id,email,age  
  2. 1,alic@gmail.com,20  
  3. 2,bob@gmail.com,25  
  4. 3,carol@gmail.com,30  
- table2.csv
  1. user_id,ad_id,click  
  2. 1,1,1  
  3. 1,2,0  
  4. 1,5,0  
  5. 2,3,0  
  6. 2,4,1  
  7. 2,1,0  
  8. 3,2,0  
  9. 3,1,0  
  10. 3,3,0  
  11. 3,4,0  
  12. 3,5,0  
Below we are going to join those two CSV files by column 'user_id':
>>> import pandas as pd
>>> t1 = pd.read_csv('table1.csv')
>>> t2 = pd.read_csv('table2.csv')
>>> t1
user_id email age
0 1 alic@gmail.com 20
1 2 bob@gmail.com 25
2 3 carol@gmail.com 30

>>> t2
user_id ad_id click
0 1 1 1
1 1 2 0
2 1 5 0
3 2 3 0
4 2 4 1
5 2 1 0
6 3 2 0
7 3 1 0
8 3 3 0
9 3 4 0
10 3 5 0

>>> m = pd.merge(t1, t2, on='user_id') // Join two DataFrame by column 'user_id'
>>> m.ix[0] // Show the first row
user_id 1
email alic@gmail.com
age 20
ad_id 1
click 1
Name: 0, dtype: object 

>>> t1.merge(t2, on='user_id') // Another way to join two DataFrame
user_id email age ad_id click
0 1 alic@gmail.com 20 1 1
1 1 alic@gmail.com 20 2 0
2 1 alic@gmail.com 20 5 0
3 2 bob@gmail.com 25 3 0
4 2 bob@gmail.com 25 4 1
5 2 bob@gmail.com 25 1 0
6 3 carol@gmail.com 30 2 0
7 3 carol@gmail.com 30 1 0
8 3 carol@gmail.com 30 3 0
9 3 carol@gmail.com 30 4 0
10 3 carol@gmail.com 30 5 0


1 則留言:

  1. https://stackoverflow.com/questions/11285613/selecting-columns-in-a-pandas-dataframe

    回覆刪除

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...