Source From HereDataFrames (
link)
* Our first look at Pandas
* It works a lot like R (if you come from an R background)
* If you're not familiar with R, some things that Pandas does might seem backwards or contrary to the way Numpy works.
* Goal: Not to show you everything Pandas can do, rather just what we need for ML/data science
* If you have a question about something not covered, just ask!
* Most times: Load in data and immediately convert it into Numpy array
* Most features you won't use often, you'll just forget them
Below we are going to use function
read_csv from Pandas to load data from CSV file "
data_2d.csv":
>>> import pandas as pd
>>> X = pd.read_csv("tmp/data_2d.csv", header=None)
>>> type(X)
>>> X.info()
RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):
0 100 non-null float64
1 100 non-null float64
2 100 non-null float64
dtypes: float64(3)
memory usage: 2.4 KB
>>> X.head() // Show top 5 rows
0 1 2
0 17.930201 94.520592 320.259530
1 97.144697 69.593282 404.634472
2 81.775901 5.737648 181.485108
3 55.854342 70.325902 321.773638
4 49.366550 75.114040 322.465486
>>> X.head(2) // Top 2 rows
0 1 2
0 17.930201 94.520592 320.259530
1 97.144697 69.593282 404.634472
More About DataFrames - Selecting Rows and Columns (
link)
Below will show a few functions of
DataFrames on selecting rows & columns:
>>> M = X.as_matrix() // Covert DataFrames into numpy array for row/column selection
>>> type(M)// Numpy: X[0] -> Select 0th row
// Pandas: X[0] -> Select column with name as 0 >>> X[0][:10] // Select column 0 and show first 10 rows of it0 17.930201
1 97.144697
2 81.775901
3 55.854342
4 49.366550
5 3.192702
6 49.200784
7 21.882804
8 79.509863
9 88.153887
Name: 0, dtype: float64>>> X.head(10) // Double confirm with the selection0 1 2
0 17.930201 94.520592 320.259530
1 97.144697 69.593282 404.634472
2 81.775901 5.737648 181.485108
3 55.854342 70.325902 321.773638
4 49.366550 75.114040 322.465486
5 3.192702 29.256299 94.618811
6 49.200784 86.144439 356.348093
7 21.882804 46.841505 181.653769
8 79.509863 87.397356 423.557743
9 88.153887 65.205642 369.229245>>> type(X[0]) // Pandas use Series to represent column/row>>> X.iloc[0] // Select 0th row0 17.930201
1 94.520592
2 320.259530
Name: 0, dtype: float64>>> X.ix[0] // Select 0th row0 17.930201
1 94.520592
2 320.259530
Name: 0, dtype: float64>>> type(X.ix[0])>>> X[[0,2]][:3] // Select multiple column (0,2) and show top 3 rows of them0 2
0 17.930201 320.259530
1 97.144697 404.634472
2 81.775901 181.485108>>> X[ X[0] < 5 ] // Select row(s) of column 0 with value less than 50 1 2
5 3.192702 29.256299 94.618811
44 3.593966 96.252217 293.237183
54 4.593463 46.335932 145.818745
90 1.382983 84.944087 252.905653
99 4.142669 52.254726 168.034401>>> (X[0] < 0)[:3]0 False
1 False
2 False
Name: 0, dtype: bool>>> type(X[0] < 0)
Even More About DataFrames - Column Names (
link)
Here we will deal with another CSV file "
international-airline-passengers.csv" and handle the column names issue:
// Loading the csv file and skip the footer from last three lines
>>> df = pd.read_csv("tmp/international-airline-passengers.csv", engine="python", skipfooter=3)
>>> df.columns
Index(['Month', 'International airline passengers: monthly totals in thousands. Jan 49 ? Dec 60'], dtype='object')
>>> df.columns = ["month", "passengers"] // Change the names of column
>>> df.columns
Index(['month', 'passengers'], dtype='object')
>>> df['passengers'][:3] // Access the first 3 rows of column 'passengers'
0 112
1 118
2 132
Name: passengers, dtype: int64
>>> df.passengers[:3]
0 112
1 118
2 132
Name: passengers, dtype: int64
>>> df['ones'] = 1 // Add new column 'ones' with all value=1
>>> df.head()
month passengers ones
0 1949-01 112 1
1 1949-02 118 1
2 1949-03 132 1
3 1949-04 129 1
4 1949-05 121 1
The apply() function (
link)
What if we want to assign a new column value where each cell is derived from the values already in its row?
>>> from datetime import datetime
>>> datetime.strptime('1949-05', "%Y-%m")
datetime.datetime(1949, 5, 1, 0, 0)
>>> df['dt'] = df.apply(lambda row: datetime.strptime(row['month'], '%Y-%m'), axis=1) // Add one column 'dt' to translate column 'month' into dateteime object
>>> df.info()
RangeIndex: 144 entries, 0 to 143
Data columns (total 4 columns):
month 144 non-null object
passengers 144 non-null int64
ones 144 non-null int64
dt 144 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 4.6+ KB
Joins (
link)
Consider we have below two CSV files:
- table1.csv
- user_id,email,age
- 1,alic@gmail.com,20
- 2,bob@gmail.com,25
- 3,carol@gmail.com,30
- table2.csv
- user_id,ad_id,click
- 1,1,1
- 1,2,0
- 1,5,0
- 2,3,0
- 2,4,1
- 2,1,0
- 3,2,0
- 3,1,0
- 3,3,0
- 3,4,0
- 3,5,0
Below we are going to join those two CSV files by column 'user_id':
https://stackoverflow.com/questions/11285613/selecting-columns-in-a-pandas-dataframe
回覆刪除