.. _ch05-python-pandas:

======
Pandas
======

`pandad <http://pandas.pydata.org>`_ is a high performance data analysis
python package that widely used in data science. As usual, you can install
pandas into your system by using ``pip``.

.. code-block:: console

   $ pip3 install pandas

The conventional way to import pandas is;

.. code-block:: python

   import pandas as pd

Series
------

With pandas, you can use two useful data structure, *Series* and *DataFrame*.
Series is one dimensional array with labeled index, and DataFrame is two
dimensional version of Series. You can define a Series variable as,

.. code-block:: python

   >>> sr = pd.Series([1., 3., 5.])
   >>> sr
   0    1.0
   1    3.0
   2    5.0
   dtype: float64

As seen in the above results, the Series has index and value pairs as its elements.

.. code-block:: python

   >>> sr.index
   RangeIndex(start=0, stop=3, step=1)
   >>> sr.values
   array([ 1.,  3.,  5.])

You can recognize very familiar result from ``values`` method. The actual data
in pandas will be stored as a Numpy array both in Series and DataFrame. The
powerfulness of Series (also in the DataFrame,) is *named* index. For example,
you may define a Series with index,

.. code-block:: python

   >>> sr = pd.Series([1., 3., 5.], ['a', 'b', 'c'])
   >>> sr
   a    1.0
   b    3.0
   c    5.0
   dtype: float64

Now, you can call your data as a name of index, with *list like* method

.. code-block:: python

   >>> sr['a']
   1.0

Or by using the object

.. code-block:: python

   >>> sr.c
   5.0

Also you can pick up certain data conditionally,

.. code-block:: python

   >>> sr[sr > 2.0]
   b    3.0
   c    5.0
   dtype: float64

The general way to make a Series data is to use a dictionary data type.

.. code-block:: python

   >>> d = {'bob': 100, 'alice': 25, 'chris': 75}
   >>> sr_data = pd.Series(d)
   >>> sr_data
   alice     25
   bob      100
   chris     75
   dtype: int64
   >>> sr_data['bob'] + sr_data['alice']
   125

DataFrame
---------

DataFrame can store the Series data into each column, thus make a two
dimensional data. As we covered in the Series example before, DataFrame
can be initialized with a dictionary data type.

.. code-block:: python

   >>> d = {'gender': ['M', 'F', 'M'],
   ...      'money': [100, 25, 75],
   ...      'age': [23, 13, 38] }
   >>> df = pd.DataFrame(d)
   >>> df
      age gender  money
   0   23      M    100
   1   13      F     25
   2   38      M     75

If you want to specify the order of columns,

.. code-block:: python

   >>> df = pd.DataFrame(d, columns=['money', 'age', 'gender'])
   >>> df
      money  age gender
   0    100   23      M
   1     25   13      F
   2     75   38      M

In DataFrame, you can also specify the name of index.

.. code-block:: python

   >>> df = pd.DataFrame(d, columns=['money', 'age', 'gender'], index=['bob', 'alice', 'chris'])
   >>> df
          money  age gender
   bob      100   23      M
   alice     25   13      F
   chris     75   38      M

You can add more column to existing DataFrame,

.. code-block:: python

   >>> df['debt'] = [300, 0, 2000]
   >>> df
          money  age gender  debt
   bob      100   23      M   300
   alice     25   13      F     0
   chris     75   38      M  2000

As we did in the Series example, the conditional slicing is possible,

.. code-block:: python

   >>> df[df['money'] - df['debt'] > 0]
          money  age gender  debt
   alice     25   13      F     0

If you want to choose certain line, for example,

.. code-block:: python

   >>> df['bob']
   Traceback (most recent call last):
     File "/usr/local/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2522, in get_loc
       return self._engine.get_loc(key)
     File "pandas/_libs/index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
     File "pandas/_libs/index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
     File "pandas/_libs/hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
     File "pandas/_libs/hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
   KeyError: 'bob'

This will be failed! Since the default index field for DataFrame is *column*
not row. So, if you want to pick up certain row, you need to use a special
index field for DataFrame, ``ix``.

.. code-block:: python

   >>> df.ix['bob']
   money     100
   age        23
   gender      M
   debt      300
   Name: bob, dtype: object
   >>> df.ix['bob']['money']
   100
   >>> df.ix['bob','money']
   100
   >>> df.ix['bob'].money
   100

The arithmetic calculation for DataFrames will be element-wise operation
by matching each column and index name. For example, suppose we have another
DataFrame,

.. code-block:: python

   >>> df2 = pd.DataFrame({'money': [20, 10, 200],
   ...                     'debt': [-100, 20, -170]},
   ...                      index = ['chris', 'alice', 'bob'] )
   >>> df2
          debt  money
   chris  -100     20
   alice    20     10
   bob    -170    200

If we add ``df2`` with ``df``,

.. code-block:: python

   >>> df + df2
          age  debt gender  money
   alice  NaN    20    NaN     35
   bob    NaN   130    NaN    300
   chris  NaN  1900    NaN     95

Wait, this is not expected. ``NaN``, or "Not a Number", is representing a
"missing data" in pandas. Since the data field "gender" and "age" are
not given for ``df2`` DataFrame, so these field will be considered as
"missing data". We have to fill the "missing" data with original ones,
by using ``combine_first`` method.

.. code-block:: python

   >>> df3 = (df + df2).combine_first(df)
   >>> df3
           age  debt gender  money
   alice  13.0    20      F     35
   bob    23.0   130      M    300
   chris  38.0  1900      M     95

``combine_first`` method will fill the missing data by finding correct one
in the given DataFrame.

We can also get some statistical data from DataFrame,

.. code-block:: python

   >>> df3.money.min()
   35
   >>> df3.money.mean()
   143.33333333333334
   >>> df3.money.median()
   95.0

If you want to see all statistical data at a glance,

.. code-block:: python

   >>> df3.describe()
                age         debt       money
   count   3.000000     3.000000    3.000000
   mean   24.666667   683.333333  143.333333
   std    12.583057  1055.098732  138.954429
   min    13.000000    20.000000   35.000000
   25%    18.000000    75.000000   65.000000
   50%    23.000000   130.000000   95.000000
   75%    30.500000  1015.000000  197.500000
   max    38.000000  1900.000000  300.000000