Shape Changes

At times, you need to change the shape of your data. There are two main methods to achieve the same: pivot and melt. We'll study them in this section.

Pivot

For example, your data may look like the following:

>> san_francisco_dt
+------------+----------+-------+
| date       | variable | value |
+------------+----------+-------+
| 2017-09-01 | max-temp | 90    |
| 2017-09-01 | min-temp | 30    |
| 2017-09-02 | max-temp | 95    |
| 2017-09-02 | min-temp | 40    |
| 2017-09-03 | max-temp | 80    |
| 2017-09-03 | min-temp | 35    |
+------------+----------+-------+

As we can see, max-temp and min-temp should have been columns and the data should look like the following:

+------------+----------+----------+
| date       | max-temp | min-temp |
+------------+----------+----------+
| 2017-09-01 | 90       | 30       |
| 2017-09-02 | 95       | 30       |
| 2017-09-03 | 80       | 35       |
+------------+----------+----------+

We can use the pivot function to achieve this:

>> san_francisco_dt = san_francisco_dt.pivot(variables=['variable'], values=['value'])

In this case, the column variable has been pivoted for the values in column value. We can pick multiple columns to pivot against multiple value columns. Here is an example dataset:

>> san_francisco_dt
+------------+----------+----------+-------+
| date       | station  | variable | value |
+------------+----------+----------+-------+
| 2017-09-01 | downtown | max-temp | 90    |
| 2017-09-01 | downtown | min-temp | 30    |
| 2017-09-01 | uptown   | max-temp | 94    |
| 2017-09-01 | uptown   | min-temp | 32    |
| 2017-09-02 | downtown | max-temp | 93    |
| 2017-09-02 | downtown | min-temp | 32    |
| 2017-09-02 | uptown   | max-temp | 91    |
| 2017-09-02 | uptown   | min-temp | 30    |
| 2017-09-03 | downtown | max-temp | 94    |
| 2017-09-03 | downtown | min-temp | 35    |
| 2017-09-03 | uptown   | max-temp | 97    |
| 2017-09-03 | uptown   | min-temp | 36    |
+------------+----------+----------+-------+

We record temperature at two weather stations and each station provides max and min recordings. We would like the dataset to look like the following:

+------------+--------------------+--------------------+------------------+------------------+
| date       | downtown::max-temp | downtown::min-temp | uptown::max-temp | uptown::min-temp |
+------------+--------------------+--------------------+------------------+------------------+
| 2017-09-01 | 90                 | 30                 | 94               | 32               |
| 2017-09-02 | 93                 | 32                 | 91               | 30               |
| 2017-09-03 | 94                 | 35                 | 97               | 36               |
+------------+--------------------+--------------------+------------------+------------------+

This can be achieved by pivoting on two variables:

>> san_francisco_dt = san_francisco_dt.pivot(variables=['station', 'variable'], values=['value'])

Let's complicate this further. Let's say that each weather station also records the level of confidence on the reading based on some parameters. This level of confidence could be between 0 to 100. Here is how our new dataset looks like:

+------------+----------+----------+-------+------------+
| date       | station  | variable | value | confidence |
+------------+----------+----------+-------+------------+
| 2017-09-01 | downtown | max-temp | 90    | 99         |
| 2017-09-01 | downtown | min-temp | 30    | 94         |
| 2017-09-01 | uptown   | max-temp | 94    | 92         |
| 2017-09-01 | uptown   | min-temp | 32    | 92         |
| 2017-09-02 | downtown | max-temp | 93    | 90         |
| 2017-09-02 | downtown | min-temp | 32    | 94         |
| 2017-09-02 | uptown   | max-temp | 91    | 93         |
| 2017-09-02 | uptown   | min-temp | 30    | 99         |
| 2017-09-03 | downtown | max-temp | 94    | 96         |
| 2017-09-03 | downtown | min-temp | 35    | 92         |
| 2017-09-03 | uptown   | max-temp | 97    | 92         |
| 2017-09-03 | uptown   | min-temp | 36    | 92         |
+------------+----------+----------+-------+------------+

We now pivot two columns (station and variable ) against two sets of values (value, confidence). This is how we want it to look:

+------------+---------------------------+--------------------------------+---------------------------+--------------------------------+-------------------------+------------------------------+-------------------------+------------------------------+
| date       | downtown::max-temp::value | downtown::max-temp::confidence | downtown::min-temp::value | downtown::min-temp::confidence | uptown::max-temp::value | uptown::max-temp::confidence | uptown::min-temp::value | uptown::min-temp::confidence |
+------------+---------------------------+--------------------------------+---------------------------+--------------------------------+-------------------------+------------------------------+-------------------------+------------------------------+
| 2017-09-01 | 90                        | 99                             | 30                        | 94                             | 94                      | 92                           | 32                      | 92                           |
| 2017-09-02 | 93                        | 90                             | 32                        | 94                             | 91                      | 93                           | 30                      | 99                           |
| 2017-09-03 | 94                        | 96                             | 35                        | 92                             | 97                      | 92                           | 36                      | 92                           |
+------------+---------------------------+--------------------------------+---------------------------+--------------------------------+-------------------------+------------------------------+-------------------------+------------------------------+

Here is the code to achieve this:

>> san_francisco_dt = san_francisco_dt.pivot(variables=['station', 'variable'], values=['value', 'confidence'])

Melt

Melt operation is the opposite of pivot. Let's rework our examples in the reverse direction.

Given the following dataset:

+------------+----------+----------+
| date       | max-temp | min-temp |
+------------+----------+----------+
| 2017-09-01 | 90       | 30       |
| 2017-09-02 | 95       | 30       |
| 2017-09-03 | 80       | 35       |
+------------+----------+----------+

We want to change it to the following:

+------------+----------+-------+
| date       | variable | value |
+------------+----------+-------+
| 2017-09-01 | max-temp | 90    |
| 2017-09-01 | min-temp | 30    |
| 2017-09-02 | max-temp | 95    |
| 2017-09-02 | min-temp | 40    |
| 2017-09-03 | max-temp | 80    |
| 2017-09-03 | min-temp | 35    |
+------------+----------+-------+

And here is the code for the same:

>> san_francisco_dt = san_francisco_dt.melt(columns=['max-temp', 'min-temp'], new_names=['variable', 'value'])

Now, moving to the second example:

+------------+--------------------+--------------------+------------------+------------------+
| date       | downtown::max-temp | downtown::min-temp | uptown::max-temp | uptown::min-temp |
+------------+--------------------+--------------------+------------------+------------------+
| 2017-09-01 | 90                 | 30                 | 94               | 32               |
| 2017-09-02 | 93                 | 32                 | 91               | 30               |
| 2017-09-03 | 94                 | 35                 | 97               | 36               |
+------------+--------------------+--------------------+------------------+------------------+

We want it to look like the following:

+------------+----------+----------+-------+
| date       | station  | variable | value |
+------------+----------+----------+-------+
| 2017-09-01 | downtown | max-temp | 90    |
| 2017-09-01 | downtown | min-temp | 30    |
| 2017-09-01 | uptown   | max-temp | 94    |
| 2017-09-01 | uptown   | min-temp | 32    |
| 2017-09-02 | downtown | max-temp | 93    |
| 2017-09-02 | downtown | min-temp | 32    |
| 2017-09-02 | uptown   | max-temp | 91    |
| 2017-09-02 | uptown   | min-temp | 30    |
| 2017-09-03 | downtown | max-temp | 94    |
| 2017-09-03 | downtown | min-temp | 35    |
| 2017-09-03 | uptown   | max-temp | 97    |
| 2017-09-03 | uptown   | min-temp | 36    |
+------------+----------+----------+-------+

This gets tricky since we want to melt 4 columns of observations into 3 columns (2 identifiers and 1 observation). This requires some work.

>> san_francisco_dt.melt(columns={
...     "downtown::max-temp": ('downtown', 'max-temp'),
...     "downtown::min-temp": ('downtown', 'min-temp'),
...     "uptown::max-temp": ('uptown', 'max-temp'),
...     "uptown::min-temp": ('uptown', 'min-temp'),
...})

For each column name, you specify the tuples of variables that it should map to.

Another strategy is to use .row_apply():

def melt_columns(row):
    return ( (row[0], 'downtown', 'max-temp', row[1]),
            (row[0], 'downtown', 'min-temp', row[2]),
            (row[0], 'uptown', 'max-temp', row[3]),
            (row[0], 'uptown', 'min-temp', row[4]) )

>> san_francisco_dt = san_francisco_dt.row_apply(melt_columns)
>> san_francisco_dt.meta.column_names = ['date', 'station', 'variable', 'value']

results matching ""

    No results matching ""