Source From Here
QuestionI want to convert a table, represented as a list of lists, into a Pandas DataFrame. As an extremely simplified example:
- a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
- df = pd.DataFrame(a)
How-To
You have four main options for converting types in pandas:
1. to_numeric()
2. astype()
3. infer_objects()
4. convert_dtypes()
1. to_numeric()
The best way to convert one or more columns of a DataFrame to numeric values is to use pandas.to_numeric(). This function will try to change non-numeric objects (such as strings) into integers or floating point numbers as appropriate.
Basic usage
The input to to_numeric() is a Series or a single column of a DataFrame:
- >>> s = pd.Series(["8", 6, "7.5", 3, "0.9"]) # mixed string and numeric values
- >>> s
- 0 8
- 1 6
- 2 7.5
- 3 3
- 4 0.9
- dtype: object
- >>> pd.to_numeric(s) # convert everything to float values
- 0 8.0
- 1 6.0
- 2 7.5
- 3 3.0
- 4 0.9
- dtype: float64
- # convert Series
- my_series = pd.to_numeric(my_series)
- # convert column "a" of a DataFrame
- df["a"] = pd.to_numeric(df["a"])
- # convert all columns of DataFrame
- df = df.apply(pd.to_numeric) # convert all columns of DataFrame
- # convert just columns "a" and "b"
- df[["a", "b"]] = df[["a", "b"]].apply(pd.to_numeric)
Error handling
But what if some values can't be converted to a numeric type? to_numeric() also takes an errors keyword argument that allows you to force non-numeric values to be NaN, or simply ignore columns containing these values. Here's an example using a Series of strings s which has the object dtype:
- >>> s = pd.Series(['1', '2', '4.7', 'pandas', '10'])
- >>> s
- 0 1
- 1 2
- 2 4.7
- 3 pandas
- 4 10
- dtype: object
- >>> pd.to_numeric(s) # or pd.to_numeric(s, errors='raise')
- ValueError: Unable to parse string
- >>> pd.to_numeric(s, errors='coerce')
- 0 1.0
- 1 2.0
- 2 4.7
- 3 NaN
- 4 10.0
- dtype: float64
- >>> pd.to_numeric(s, errors='ignore')
- # the original Series is returned untouched
- df.apply(pd.to_numeric, errors='ignore')
Downcasting
By default, conversion with to_numeric() will give you either a int64 or float64 dtype (or whatever integer width is native to your platform). That's usually what you want, but what if you wanted to save some memory and use a more compact dtype, like float32, or int8?
to_numeric() gives you the option to downcast to either 'integer', 'signed', 'unsigned', 'float'. Here's an example for a simple series s of integer type:
- >>> s = pd.Series([1, 2, -7])
- >>> s
- 0 1
- 1 2
- 2 -7
- dtype: int64
- >>> pd.to_numeric(s, downcast='integer')
- 0 1
- 1 2
- 2 -7
- dtype: int8
- >>> pd.to_numeric(s, downcast='float')
- 0 1.0
- 1 2.0
- 2 -7.0
- dtype: float32
The astype() method enables you to be explicit about the dtype you want your DataFrame or Series to have. It's very versatile in that you can try and go from one type to the any other.
Basic usage
Just pick a type: you can use a NumPy dtype (e.g. np.int16), some Python types (e.g. bool), or pandas-specific types (like the categorical dtype). Call the method on the object you want to convert and astype() will try and convert it for you:
- # convert all DataFrame columns to the int64 dtype
- df = df.astype(int)
- # convert column "a" to int64 dtype and "b" to complex type
- df = df.astype({"a": int, "b": complex})
- # convert Series to float16 type
- s = s.astype(np.float16)
- # convert Series to Python strings
- s = s.astype(str)
- # convert Series to categorical type - see docs for more details
- s = s.astype('category')
Be careful
astype() is powerful, but it will sometimes convert values "incorrectly". For example:
- >>> s = pd.Series([1, 2, -7])
- >>> s
- 0 1
- 1 2
- 2 -7
- dtype: int64
- >>> s.astype(np.uint8)
- 0 1
- 1 2
- 2 249
- dtype: uint8
3. infer_objects()
Version 0.21.0 of pandas introduced the method infer_objects() for converting columns of a DataFrame that have an object datatype to a more specific type (soft conversions). For example, here's a DataFrame with two columns of object type. One holds actual integers and the other holds strings representing integers:
- >>> df = pd.DataFrame({'a': [7, 1, 5], 'b': ['3','2','1']}, dtype='object')
- >>> df.dtypes
- a object
- b object
- dtype: object
- >>> df = df.infer_objects()
- >>> df.dtypes
- a int64
- b object
- dtype: object
4. convert_dtypes()
Version 1.0 and above includes a method convert_dtypes() to convert Series and DataFrame columns to the best possible dtype that supports the pd.NA missing value.
Here "best possible" means the type most suited to hold the values. For example, this a pandas integer type if all of the values are integers (or missing values): an object column of Python integer objects is converted to Int64, a column of NumPy int32 values will become the pandas dtype Int32.
With our object DataFrame df, we get the following result:
- >>> df.convert_dtypes().dtypes
- a Int64
- b string
- dtype: object
By default, this method will infer the type from object values in each column. We can change this by passing infer_objects=False:
- >>> df.convert_dtypes(infer_objects=False).dtypes
- a object
- b string
- dtype: object
沒有留言:
張貼留言