Pandas

pandad is a high performance data analysis python package that widely used in data science. As usual, you can install pandas into your system by using pip.

$ pip3 install pandas

The conventional way to import pandas is;

import pandas as pd

Series

With pandas, you can use two useful data structure, Series and DataFrame. Series is one dimensional array with labeled index, and DataFrame is two dimensional version of Series. You can define a Series variable as,

>>> sr = pd.Series([1., 3., 5.])
>>> sr
0    1.0
1    3.0
2    5.0
dtype: float64

As seen in the above results, the Series has index and value pairs as its elements.

>>> sr.index
RangeIndex(start=0, stop=3, step=1)
>>> sr.values
array([ 1.,  3.,  5.])

You can recognize very familiar result from values method. The actual data in pandas will be stored as a Numpy array both in Series and DataFrame. The powerfulness of Series (also in the DataFrame,) is named index. For example, you may define a Series with index,

>>> sr = pd.Series([1., 3., 5.], ['a', 'b', 'c'])
>>> sr
a    1.0
b    3.0
c    5.0
dtype: float64

Now, you can call your data as a name of index, with list like method

>>> sr['a']
1.0

Or by using the object

>>> sr.c
5.0

Also you can pick up certain data conditionally,

>>> sr[sr > 2.0]
b    3.0
c    5.0
dtype: float64

The general way to make a Series data is to use a dictionary data type.

>>> d = {'bob': 100, 'alice': 25, 'chris': 75}
>>> sr_data = pd.Series(d)
>>> sr_data
alice     25
bob      100
chris     75
dtype: int64
>>> sr_data['bob'] + sr_data['alice']
125

DataFrame

DataFrame can store the Series data into each column, thus make a two dimensional data. As we covered in the Series example before, DataFrame can be initialized with a dictionary data type.

>>> d = {'gender': ['M', 'F', 'M'],
...      'money': [100, 25, 75],
...      'age': [23, 13, 38] }
>>> df = pd.DataFrame(d)
>>> df
   age gender  money
0   23      M    100
1   13      F     25
2   38      M     75

If you want to specify the order of columns,

>>> df = pd.DataFrame(d, columns=['money', 'age', 'gender'])
>>> df
   money  age gender
0    100   23      M
1     25   13      F
2     75   38      M

In DataFrame, you can also specify the name of index.

>>> df = pd.DataFrame(d, columns=['money', 'age', 'gender'], index=['bob', 'alice', 'chris'])
>>> df
       money  age gender
bob      100   23      M
alice     25   13      F
chris     75   38      M

You can add more column to existing DataFrame,

>>> df['debt'] = [300, 0, 2000]
>>> df
       money  age gender  debt
bob      100   23      M   300
alice     25   13      F     0
chris     75   38      M  2000

As we did in the Series example, the conditional slicing is possible,

>>> df[df['money'] - df['debt'] > 0]
       money  age gender  debt
alice     25   13      F     0

If you want to choose certain line, for example,

>>> df['bob']
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2522, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'bob'

This will be failed! Since the default index field for DataFrame is column not row. So, if you want to pick up certain row, you need to use a special index field for DataFrame, ix.

>>> df.ix['bob']
money     100
age        23
gender      M
debt      300
Name: bob, dtype: object
>>> df.ix['bob']['money']
100
>>> df.ix['bob','money']
100
>>> df.ix['bob'].money
100

The arithmetic calculation for DataFrames will be element-wise operation by matching each column and index name. For example, suppose we have another DataFrame,

>>> df2 = pd.DataFrame({'money': [20, 10, 200],
...                     'debt': [-100, 20, -170]},
...                      index = ['chris', 'alice', 'bob'] )
>>> df2
       debt  money
chris  -100     20
alice    20     10
bob    -170    200

If we add df2 with df,

>>> df + df2
       age  debt gender  money
alice  NaN    20    NaN     35
bob    NaN   130    NaN    300
chris  NaN  1900    NaN     95

Wait, this is not expected. NaN, or “Not a Number”, is representing a “missing data” in pandas. Since the data field “gender” and “age” are not given for df2 DataFrame, so these field will be considered as “missing data”. We have to fill the “missing” data with original ones, by using combine_first method.

>>> df3 = (df + df2).combine_first(df)
>>> df3
        age  debt gender  money
alice  13.0    20      F     35
bob    23.0   130      M    300
chris  38.0  1900      M     95

combine_first method will fill the missing data by finding correct one in the given DataFrame.

We can also get some statistical data from DataFrame,

>>> df3.money.min()
35
>>> df3.money.mean()
143.33333333333334
>>> df3.money.median()
95.0

If you want to see all statistical data at a glance,

>>> df3.describe()
             age         debt       money
count   3.000000     3.000000    3.000000
mean   24.666667   683.333333  143.333333
std    12.583057  1055.098732  138.954429
min    13.000000    20.000000   35.000000
25%    18.000000    75.000000   65.000000
50%    23.000000   130.000000   95.000000
75%    30.500000  1015.000000  197.500000
max    38.000000  1900.000000  300.000000