2017年7月24日 星期一

[ Python 常見問題 ] Pandas - Select rows from a DataFrame based on values in a column in pandas

Source From Here 
Question 
How to select rows from a DataFrame based on values in some column in pandas? In SQL I would use: 
  1. select * from table where colume_name = some_value.   
I tried to look at pandas documentation but did not immediately find the answer. 

How-To 
To select rows whose column value equals a scalar, some_value, use == (Indexing and Selecting Data): 
  1. df.loc[df['column_name'] == some_value]  
To select rows whose column value is in an iterable, some_values, use isin: 
  1. df.loc[df['column_name'].isin(some_values)]  
Combine multiple conditions with &: 
  1. df.loc[(df['column_name'] == some_value) & df['other_column'].isin(some_values)]  
To select rows whose column value does not equal some_value, use !=: 
  1. df.loc[df['column_name'] != some_value]  
isin returns a boolean Series, so to select rows whose value is not in some_values, negate the boolean Series using ~: 
  1. df.loc[~df['column_name'].isin(some_values)]  
For example, 
>>> import pandas as pd 
>>> import numpy as np 
>>> data_dict = {} 
>>> data_dict['A'] = 'foo bar foo bar foo bar foo foo'.split() 
>>> data_dict['B'] = 'one one two three two two one three'.split() 
>>> data_dict['C'] = np.arange(8) 
>>> data_dict['D'] = np.arange(8) * 2 
>>> df = pd.DataFrame(data_dict) 
>>> print(df) 
A B C D 
0 foo one 0 0 
1 bar one 1 2 
2 foo two 2 4 
3 bar three 3 6 
4 foo two 4 8 
5 bar two 5 10 
6 foo one 6 12 
7 foo three 7 14
 
>>> print(df.loc[df['A'] == 'foo']) // Select row(s) with column A as 'foo' 
A B C D 
0 foo one 0 0 
2 foo two 2 4 
4 foo two 4 8 
6 foo one 6 12 
7 foo three 7 14

If you have multiple values you want to include, put them in a list (or more generally, any iterable) and use isin: 
>>> print(df.loc[df['B'].isin(['one', 'three'])]) // Select row(s) with column B as value 'one' or 'three' 
A B C D 
0 foo one 0 0 
1 bar one 1 2 
3 bar three 3 6 
6 foo one 6 12 
7 foo three 7 14

Note, however, that if you wish to do this many times, it is more efficient to make an index first, and then use DataFrame.loc: 
>>> df = df.set_index(['B']) 
>>> df.__class__ 
 
>>> df.info 
 

B 
one foo 0 0 
one bar 1 2 
two foo 2 4 
three bar 3 6 
two foo 4 8 
two bar 5 10 
one foo 6 12 
three foo 7 14> 
>>> print(df.loc['one']) // Select row(s) with column B as value 'one' 
A C D 
B 
one foo 0 0 
one bar 1 2 
one foo 6 12


沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...