Pandas¶
pandas is a high performance data analysis
python package that widely used in data science. As usual, you can install
pandas into your system by using pip
.
$ pip3 install pandas
The conventional way to import pandas is;
import pandas as pd
Series¶
With pandas, you can use two useful data structure, Series and DataFrame. Series is one dimensional array with labeled index, and DataFrame is two dimensional version of Series. You can define a Series variable as,
>>> sr = pd.Series([1., 3., 5.])
>>> sr
0 1.0
1 3.0
2 5.0
dtype: float64
As seen in the above results, the Series has index and value pairs as its elements.
>>> sr.index
RangeIndex(start=0, stop=3, step=1)
>>> sr.values
array([ 1., 3., 5.])
You can recognize very familiar result from values
method. The actual data
in pandas will be stored as a Numpy array both in Series and DataFrame. The
powerfulness of Series (also in the DataFrame,) is named index. For example,
you may define a Series with index,
>>> sr = pd.Series([1., 3., 5.], ['a', 'b', 'c'])
>>> sr
a 1.0
b 3.0
c 5.0
dtype: float64
Now, you can call your data as a name of index, with list like method
>>> sr['a']
1.0
Or by using the object
>>> sr.c
5.0
Also you can pick up certain data conditionally,
>>> sr[sr > 2.0]
b 3.0
c 5.0
dtype: float64
The general way to make a Series data is to use a dictionary data type.
>>> d = {'bob': 100, 'alice': 25, 'chris': 75}
>>> sr_data = pd.Series(d)
>>> sr_data
alice 25
bob 100
chris 75
dtype: int64
>>> sr_data['bob'] + sr_data['alice']
125
DataFrame¶
DataFrame can store the Series data into each column, thus make a two dimensional data. As we covered in the Series example before, DataFrame can be initialized with a dictionary data type.
>>> d = {'gender': ['M', 'F', 'M'],
... 'money': [100, 25, 75],
... 'age': [23, 13, 38] }
>>> df = pd.DataFrame(d)
>>> df
age gender money
0 23 M 100
1 13 F 25
2 38 M 75
If you want to specify the order of columns,
>>> df = pd.DataFrame(d, columns=['money', 'age', 'gender'])
>>> df
money age gender
0 100 23 M
1 25 13 F
2 75 38 M
In DataFrame, you can also specify the name of index.
>>> df = pd.DataFrame(d, columns=['money', 'age', 'gender'], index=['bob', 'alice', 'chris'])
>>> df
money age gender
bob 100 23 M
alice 25 13 F
chris 75 38 M
You can add more column to existing DataFrame,
>>> df['debt'] = [300, 0, 2000]
>>> df
money age gender debt
bob 100 23 M 300
alice 25 13 F 0
chris 75 38 M 2000
As we did in the Series example, the conditional slicing is possible,
>>> df[df['money'] - df['debt'] > 0]
money age gender debt
alice 25 13 F 0
If you want to choose certain line, for example,
>>> df['bob']
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2522, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'bob'
This will be failed! Since the default index field for DataFrame is column
not row. So, if you want to pick up certain row, you need to use a special
index field for DataFrame, loc
.
>>> df.loc['bob']
money 100
age 23
gender M
debt 300
Name: bob, dtype: object
>>> df.loc['bob']['money']
100
>>> df.loc['bob','money']
100
>>> df.loc['bob'].money
100
The arithmetic calculation for DataFrames will be element-wise operation by matching each column and index name. For example, suppose we have another DataFrame,
>>> df2 = pd.DataFrame({'money': [20, 10, 200],
... 'debt': [-100, 20, -170]},
... index = ['chris', 'alice', 'bob'] )
>>> df2
debt money
chris -100 20
alice 20 10
bob -170 200
If we add df2
with df
,
>>> df + df2
age debt gender money
alice NaN 20 NaN 35
bob NaN 130 NaN 300
chris NaN 1900 NaN 95
Wait, this is not expected. NaN
, or “Not a Number”, is representing a
“missing data” in pandas. Since the data field “gender” and “age” are
not given for df2
DataFrame, so these field will be considered as
“missing data”. We have to fill the “missing” data with original ones,
by using combine_first
method.
>>> df3 = (df + df2).combine_first(df)
>>> df3
age debt gender money
alice 13.0 20 F 35
bob 23.0 130 M 300
chris 38.0 1900 M 95
combine_first
method will fill the missing data by finding correct one
in the given DataFrame.
We can also get some statistical data from DataFrame,
>>> df3.money.min()
35
>>> df3.money.idxmin()
'alice'
>>> df3.money.mean()
143.33333333333334
>>> df3.money.median()
95.0
If you want to see all statistical data at a glance,
>>> df3.describe()
age debt money
count 3.000000 3.000000 3.000000
mean 24.666667 683.333333 143.333333
std 12.583057 1055.098732 138.954429
min 13.000000 20.000000 35.000000
25% 18.000000 75.000000 65.000000
50% 23.000000 130.000000 95.000000
75% 30.500000 1015.000000 197.500000
max 38.000000 1900.000000 300.000000