Shape Changes
At times, you need to change the shape of your data. There are two main methods to achieve the same: pivot and melt. We'll study them in this section.
Pivot
For example, your data may look like the following:
>> san_francisco_dt
+------------+----------+-------+
| date | variable | value |
+------------+----------+-------+
| 2017-09-01 | max-temp | 90 |
| 2017-09-01 | min-temp | 30 |
| 2017-09-02 | max-temp | 95 |
| 2017-09-02 | min-temp | 40 |
| 2017-09-03 | max-temp | 80 |
| 2017-09-03 | min-temp | 35 |
+------------+----------+-------+
As we can see, max-temp and min-temp should have been columns and the data should look like the following:
+------------+----------+----------+
| date | max-temp | min-temp |
+------------+----------+----------+
| 2017-09-01 | 90 | 30 |
| 2017-09-02 | 95 | 30 |
| 2017-09-03 | 80 | 35 |
+------------+----------+----------+
We can use the pivot function to achieve this:
>> san_francisco_dt = san_francisco_dt.pivot(variables=['variable'], values=['value'])
In this case, the column variable has been pivoted for the values in column value. We can pick multiple columns to pivot against multiple value columns. Here is an example dataset:
>> san_francisco_dt
+------------+----------+----------+-------+
| date | station | variable | value |
+------------+----------+----------+-------+
| 2017-09-01 | downtown | max-temp | 90 |
| 2017-09-01 | downtown | min-temp | 30 |
| 2017-09-01 | uptown | max-temp | 94 |
| 2017-09-01 | uptown | min-temp | 32 |
| 2017-09-02 | downtown | max-temp | 93 |
| 2017-09-02 | downtown | min-temp | 32 |
| 2017-09-02 | uptown | max-temp | 91 |
| 2017-09-02 | uptown | min-temp | 30 |
| 2017-09-03 | downtown | max-temp | 94 |
| 2017-09-03 | downtown | min-temp | 35 |
| 2017-09-03 | uptown | max-temp | 97 |
| 2017-09-03 | uptown | min-temp | 36 |
+------------+----------+----------+-------+
We record temperature at two weather stations and each station provides max and min recordings. We would like the dataset to look like the following:
+------------+--------------------+--------------------+------------------+------------------+
| date | downtown::max-temp | downtown::min-temp | uptown::max-temp | uptown::min-temp |
+------------+--------------------+--------------------+------------------+------------------+
| 2017-09-01 | 90 | 30 | 94 | 32 |
| 2017-09-02 | 93 | 32 | 91 | 30 |
| 2017-09-03 | 94 | 35 | 97 | 36 |
+------------+--------------------+--------------------+------------------+------------------+
This can be achieved by pivoting on two variables:
>> san_francisco_dt = san_francisco_dt.pivot(variables=['station', 'variable'], values=['value'])
Let's complicate this further. Let's say that each weather station also records the level of confidence on the reading based on some parameters. This level of confidence could be between 0 to 100. Here is how our new dataset looks like:
+------------+----------+----------+-------+------------+
| date | station | variable | value | confidence |
+------------+----------+----------+-------+------------+
| 2017-09-01 | downtown | max-temp | 90 | 99 |
| 2017-09-01 | downtown | min-temp | 30 | 94 |
| 2017-09-01 | uptown | max-temp | 94 | 92 |
| 2017-09-01 | uptown | min-temp | 32 | 92 |
| 2017-09-02 | downtown | max-temp | 93 | 90 |
| 2017-09-02 | downtown | min-temp | 32 | 94 |
| 2017-09-02 | uptown | max-temp | 91 | 93 |
| 2017-09-02 | uptown | min-temp | 30 | 99 |
| 2017-09-03 | downtown | max-temp | 94 | 96 |
| 2017-09-03 | downtown | min-temp | 35 | 92 |
| 2017-09-03 | uptown | max-temp | 97 | 92 |
| 2017-09-03 | uptown | min-temp | 36 | 92 |
+------------+----------+----------+-------+------------+
We now pivot two columns (station and variable ) against two sets of values (value, confidence). This is how we want it to look:
+------------+---------------------------+--------------------------------+---------------------------+--------------------------------+-------------------------+------------------------------+-------------------------+------------------------------+
| date | downtown::max-temp::value | downtown::max-temp::confidence | downtown::min-temp::value | downtown::min-temp::confidence | uptown::max-temp::value | uptown::max-temp::confidence | uptown::min-temp::value | uptown::min-temp::confidence |
+------------+---------------------------+--------------------------------+---------------------------+--------------------------------+-------------------------+------------------------------+-------------------------+------------------------------+
| 2017-09-01 | 90 | 99 | 30 | 94 | 94 | 92 | 32 | 92 |
| 2017-09-02 | 93 | 90 | 32 | 94 | 91 | 93 | 30 | 99 |
| 2017-09-03 | 94 | 96 | 35 | 92 | 97 | 92 | 36 | 92 |
+------------+---------------------------+--------------------------------+---------------------------+--------------------------------+-------------------------+------------------------------+-------------------------+------------------------------+
Here is the code to achieve this:
>> san_francisco_dt = san_francisco_dt.pivot(variables=['station', 'variable'], values=['value', 'confidence'])
Melt
Melt operation is the opposite of pivot. Let's rework our examples in the reverse direction.
Given the following dataset:
+------------+----------+----------+
| date | max-temp | min-temp |
+------------+----------+----------+
| 2017-09-01 | 90 | 30 |
| 2017-09-02 | 95 | 30 |
| 2017-09-03 | 80 | 35 |
+------------+----------+----------+
We want to change it to the following:
+------------+----------+-------+
| date | variable | value |
+------------+----------+-------+
| 2017-09-01 | max-temp | 90 |
| 2017-09-01 | min-temp | 30 |
| 2017-09-02 | max-temp | 95 |
| 2017-09-02 | min-temp | 40 |
| 2017-09-03 | max-temp | 80 |
| 2017-09-03 | min-temp | 35 |
+------------+----------+-------+
And here is the code for the same:
>> san_francisco_dt = san_francisco_dt.melt(columns=['max-temp', 'min-temp'], new_names=['variable', 'value'])
Now, moving to the second example:
+------------+--------------------+--------------------+------------------+------------------+
| date | downtown::max-temp | downtown::min-temp | uptown::max-temp | uptown::min-temp |
+------------+--------------------+--------------------+------------------+------------------+
| 2017-09-01 | 90 | 30 | 94 | 32 |
| 2017-09-02 | 93 | 32 | 91 | 30 |
| 2017-09-03 | 94 | 35 | 97 | 36 |
+------------+--------------------+--------------------+------------------+------------------+
We want it to look like the following:
+------------+----------+----------+-------+
| date | station | variable | value |
+------------+----------+----------+-------+
| 2017-09-01 | downtown | max-temp | 90 |
| 2017-09-01 | downtown | min-temp | 30 |
| 2017-09-01 | uptown | max-temp | 94 |
| 2017-09-01 | uptown | min-temp | 32 |
| 2017-09-02 | downtown | max-temp | 93 |
| 2017-09-02 | downtown | min-temp | 32 |
| 2017-09-02 | uptown | max-temp | 91 |
| 2017-09-02 | uptown | min-temp | 30 |
| 2017-09-03 | downtown | max-temp | 94 |
| 2017-09-03 | downtown | min-temp | 35 |
| 2017-09-03 | uptown | max-temp | 97 |
| 2017-09-03 | uptown | min-temp | 36 |
+------------+----------+----------+-------+
This gets tricky since we want to melt 4 columns of observations into 3 columns (2 identifiers and 1 observation). This requires some work.
>> san_francisco_dt.melt(columns={
... "downtown::max-temp": ('downtown', 'max-temp'),
... "downtown::min-temp": ('downtown', 'min-temp'),
... "uptown::max-temp": ('uptown', 'max-temp'),
... "uptown::min-temp": ('uptown', 'min-temp'),
...})
For each column name, you specify the tuples of variables that it should map to.
Another strategy is to use .row_apply():
def melt_columns(row):
return ( (row[0], 'downtown', 'max-temp', row[1]),
(row[0], 'downtown', 'min-temp', row[2]),
(row[0], 'uptown', 'max-temp', row[3]),
(row[0], 'uptown', 'min-temp', row[4]) )
>> san_francisco_dt = san_francisco_dt.row_apply(melt_columns)
>> san_francisco_dt.meta.column_names = ['date', 'station', 'variable', 'value']