程式扎記

Source From Here
DataFrames (link)

* Our first look at Pandas
* It works a lot like R (if you come from an R background)
* If you're not familiar with R, some things that Pandas does might seem backwards or contrary to the way Numpy works.
* Goal: Not to show you everything Pandas can do, rather just what we need for ML/data science
* If you have a question about something not covered, just ask!
* Most times: Load in data and immediately convert it into Numpy array
* Most features you won't use often, you'll just forget them

Below we are going to use function read_csv from Pandas to load data from CSV file "data_2d.csv":

>>> import pandas as pd
>>> X = pd.read_csv("tmp/data_2d.csv", header=None)
>>> type(X)

>>> X.info()

RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):
0 100 non-null float64
1 100 non-null float64
2 100 non-null float64
dtypes: float64(3)
memory usage: 2.4 KB
>>> X.head() // Show top 5 rows
0 1 2
0 17.930201 94.520592 320.259530
1 97.144697 69.593282 404.634472
2 81.775901 5.737648 181.485108
3 55.854342 70.325902 321.773638
4 49.366550 75.114040 322.465486
>>> X.head(2) // Top 2 rows
0 1 2
0 17.930201 94.520592 320.259530
1 97.144697 69.593282 404.634472

More About DataFrames - Selecting Rows and Columns (link)
Below will show a few functions of DataFrames on selecting rows & columns:

>>> M = X.as_matrix() // Covert DataFrames into numpy array for row/column selection
>>> type(M)

// Numpy: X[0] -> Select 0th row
// Pandas: X[0] -> Select column with name as 0
>>> X[0][:10] // Select column 0 and show first 10 rows of it
0 17.930201
1 97.144697
2 81.775901
3 55.854342
4 49.366550
5 3.192702
6 49.200784
7 21.882804
8 79.509863
9 88.153887
Name: 0, dtype: float64
>>> X.head(10) // Double confirm with the selection
0 1 2
0 17.930201 94.520592 320.259530
1 97.144697 69.593282 404.634472
2 81.775901 5.737648 181.485108
3 55.854342 70.325902 321.773638
4 49.366550 75.114040 322.465486
5 3.192702 29.256299 94.618811
6 49.200784 86.144439 356.348093
7 21.882804 46.841505 181.653769
8 79.509863 87.397356 423.557743
9 88.153887 65.205642 369.229245

>>> type(X[0]) // Pandas use Series to represent column/row

>>> X.iloc[0] // Select 0th row
0 17.930201
1 94.520592
2 320.259530
Name: 0, dtype: float64
>>> X.ix[0] // Select 0th row
0 17.930201
1 94.520592
2 320.259530
Name: 0, dtype: float64
>>> type(X.ix[0])

>>> X[[0,2]][:3] // Select multiple column (0,2) and show top 3 rows of them
0 2
0 17.930201 320.259530
1 97.144697 404.634472
2 81.775901 181.485108
>>> X[ X[0] < 5 ] // Select row(s) of column 0 with value less than 5
0 1 2
5 3.192702 29.256299 94.618811
44 3.593966 96.252217 293.237183
54 4.593463 46.335932 145.818745
90 1.382983 84.944087 252.905653
99 4.142669 52.254726 168.034401
>>> (X[0] < 0)[:3]
0 False
1 False
2 False
Name: 0, dtype: bool
>>> type(X[0] < 0)

Even More About DataFrames - Column Names (link)
Here we will deal with another CSV file "international-airline-passengers.csv" and handle the column names issue:

// Loading the csv file and skip the footer from last three lines
>>> df = pd.read_csv("tmp/international-airline-passengers.csv", engine="python", skipfooter=3)
>>> df.columns
Index(['Month', 'International airline passengers: monthly totals in thousands. Jan 49 ? Dec 60'], dtype='object')
>>> df.columns = ["month", "passengers"] // Change the names of column
>>> df.columns
Index(['month', 'passengers'], dtype='object')
>>> df['passengers'][:3] // Access the first 3 rows of column 'passengers'
0 112
1 118
2 132
Name: passengers, dtype: int64
>>> df.passengers[:3]
0 112
1 118
2 132
Name: passengers, dtype: int64
>>> df['ones'] = 1 // Add new column 'ones' with all value=1
>>> df.head()
month passengers ones
0 1949-01 112 1
1 1949-02 118 1
2 1949-03 132 1
3 1949-04 129 1
4 1949-05 121 1

The apply() function (link)
What if we want to assign a new column value where each cell is derived from the values already in its row?

>>> from datetime import datetime
>>> datetime.strptime('1949-05', "%Y-%m")
datetime.datetime(1949, 5, 1, 0, 0)
>>> df['dt'] = df.apply(lambda row: datetime.strptime(row['month'], '%Y-%m'), axis=1) // Add one column 'dt' to translate column 'month' into dateteime object
>>> df.info()

RangeIndex: 144 entries, 0 to 143
Data columns (total 4 columns):
month 144 non-null object
passengers 144 non-null int64
ones 144 non-null int64
dt 144 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 4.6+ KB

Joins (link)
Consider we have below two CSV files:
- table1.csv

view plaincopy to clipboardprint?
user_id,email,age  
1,alic@gmail.com,20  
2,bob@gmail.com,25  
3,carol@gmail.com,30  

- table2.csv

view plaincopy to clipboardprint?
user_id,ad_id,click  
1,1,1  
1,2,0  
1,5,0  
2,3,0  
2,4,1  
2,1,0  
3,2,0  
3,1,0  
3,3,0  
3,4,0  
3,5,0  

Below we are going to join those two CSV files by column 'user_id':

>>> import pandas as pd
>>> t1 = pd.read_csv('table1.csv')
>>> t2 = pd.read_csv('table2.csv')
>>> t1
user_id email age
0 1 alic@gmail.com 20
1 2 bob@gmail.com 25
2 3 carol@gmail.com 30
>>> t2
user_id ad_id click
0 1 1 1
1 1 2 0
2 1 5 0
3 2 3 0
4 2 4 1
5 2 1 0
6 3 2 0
7 3 1 0
8 3 3 0
9 3 4 0
10 3 5 0
>>> m = pd.merge(t1, t2, on='user_id') // Join two DataFrame by column 'user_id'
>>> m.ix[0] // Show the first row
user_id 1
email alic@gmail.com
age 20
ad_id 1
click 1
Name: 0, dtype: object
>>> t1.merge(t2, on='user_id') // Another way to join two DataFrame
user_id email age ad_id click
0 1 alic@gmail.com 20 1 1
1 1 alic@gmail.com 20 2 0
2 1 alic@gmail.com 20 5 0
3 2 bob@gmail.com 25 3 0
4 2 bob@gmail.com 25 4 1
5 2 bob@gmail.com 25 1 0
6 3 carol@gmail.com 30 2 0
7 3 carol@gmail.com 30 1 0
8 3 carol@gmail.com 30 3 0
9 3 carol@gmail.com 30 4 0
10 3 carol@gmail.com 30 5 0

程式扎記

標籤

2017年6月3日星期六

[ Python 套件 ] Pandas - Basic Introduction For ML/DataScience

1 則留言:

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2017年6月3日 星期六