.. _ch05-python-pandas: ====== Pandas ====== `pandad `_ is a high performance data analysis python package that widely used in data science. As usual, you can install pandas into your system by using ``pip``. .. code-block:: console $ pip3 install pandas The conventional way to import pandas is; .. code-block:: python import pandas as pd Series ------ With pandas, you can use two useful data structure, *Series* and *DataFrame*. Series is one dimensional array with labeled index, and DataFrame is two dimensional version of Series. You can define a Series variable as, .. code-block:: python >>> sr = pd.Series([1., 3., 5.]) >>> sr 0 1.0 1 3.0 2 5.0 dtype: float64 As seen in the above results, the Series has index and value pairs as its elements. .. code-block:: python >>> sr.index RangeIndex(start=0, stop=3, step=1) >>> sr.values array([ 1., 3., 5.]) You can recognize very familiar result from ``values`` method. The actual data in pandas will be stored as a Numpy array both in Series and DataFrame. The powerfulness of Series (also in the DataFrame,) is *named* index. For example, you may define a Series with index, .. code-block:: python >>> sr = pd.Series([1., 3., 5.], ['a', 'b', 'c']) >>> sr a 1.0 b 3.0 c 5.0 dtype: float64 Now, you can call your data as a name of index, with *list like* method .. code-block:: python >>> sr['a'] 1.0 Or by using the object .. code-block:: python >>> sr.c 5.0 Also you can pick up certain data conditionally, .. code-block:: python >>> sr[sr > 2.0] b 3.0 c 5.0 dtype: float64 The general way to make a Series data is to use a dictionary data type. .. code-block:: python >>> d = {'bob': 100, 'alice': 25, 'chris': 75} >>> sr_data = pd.Series(d) >>> sr_data alice 25 bob 100 chris 75 dtype: int64 >>> sr_data['bob'] + sr_data['alice'] 125 DataFrame --------- DataFrame can store the Series data into each column, thus make a two dimensional data. As we covered in the Series example before, DataFrame can be initialized with a dictionary data type. .. code-block:: python >>> d = {'gender': ['M', 'F', 'M'], ... 'money': [100, 25, 75], ... 'age': [23, 13, 38] } >>> df = pd.DataFrame(d) >>> df age gender money 0 23 M 100 1 13 F 25 2 38 M 75 If you want to specify the order of columns, .. code-block:: python >>> df = pd.DataFrame(d, columns=['money', 'age', 'gender']) >>> df money age gender 0 100 23 M 1 25 13 F 2 75 38 M In DataFrame, you can also specify the name of index. .. code-block:: python >>> df = pd.DataFrame(d, columns=['money', 'age', 'gender'], index=['bob', 'alice', 'chris']) >>> df money age gender bob 100 23 M alice 25 13 F chris 75 38 M You can add more column to existing DataFrame, .. code-block:: python >>> df['debt'] = [300, 0, 2000] >>> df money age gender debt bob 100 23 M 300 alice 25 13 F 0 chris 75 38 M 2000 As we did in the Series example, the conditional slicing is possible, .. code-block:: python >>> df[df['money'] - df['debt'] > 0] money age gender debt alice 25 13 F 0 If you want to choose certain line, for example, .. code-block:: python >>> df['bob'] Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2522, in get_loc return self._engine.get_loc(key) File "pandas/_libs/index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'bob' This will be failed! Since the default index field for DataFrame is *column* not row. So, if you want to pick up certain row, you need to use a special index field for DataFrame, ``ix``. .. code-block:: python >>> df.ix['bob'] money 100 age 23 gender M debt 300 Name: bob, dtype: object >>> df.ix['bob']['money'] 100 >>> df.ix['bob','money'] 100 >>> df.ix['bob'].money 100 The arithmetic calculation for DataFrames will be element-wise operation by matching each column and index name. For example, suppose we have another DataFrame, .. code-block:: python >>> df2 = pd.DataFrame({'money': [20, 10, 200], ... 'debt': [-100, 20, -170]}, ... index = ['chris', 'alice', 'bob'] ) >>> df2 debt money chris -100 20 alice 20 10 bob -170 200 If we add ``df2`` with ``df``, .. code-block:: python >>> df + df2 age debt gender money alice NaN 20 NaN 35 bob NaN 130 NaN 300 chris NaN 1900 NaN 95 Wait, this is not expected. ``NaN``, or "Not a Number", is representing a "missing data" in pandas. Since the data field "gender" and "age" are not given for ``df2`` DataFrame, so these field will be considered as "missing data". We have to fill the "missing" data with original ones, by using ``combine_first`` method. .. code-block:: python >>> df3 = (df + df2).combine_first(df) >>> df3 age debt gender money alice 13.0 20 F 35 bob 23.0 130 M 300 chris 38.0 1900 M 95 ``combine_first`` method will fill the missing data by finding correct one in the given DataFrame. We can also get some statistical data from DataFrame, .. code-block:: python >>> df3.money.min() 35 >>> df3.money.mean() 143.33333333333334 >>> df3.money.median() 95.0 If you want to see all statistical data at a glance, .. code-block:: python >>> df3.describe() age debt money count 3.000000 3.000000 3.000000 mean 24.666667 683.333333 143.333333 std 12.583057 1055.098732 138.954429 min 13.000000 20.000000 35.000000 25% 18.000000 75.000000 65.000000 50% 23.000000 130.000000 95.000000 75% 30.500000 1015.000000 197.500000 max 38.000000 1900.000000 300.000000