Row-wise Transformation

One of the common operation in cleaning data is transform the values of one column into another format. At times, it's as simple as converting the metric (e.g. changing fahrenheit to celsius or changing kilometers to meters). In such cases, you can use vectorized operations with broadcasting semantics. However, some times, it's more complicated.

Transforming data - one row at a time

Consider the following dataset where temperature reading is suffixed with the metric.

>> san_francisco_dt = dtables.load_tuples([
>>    ('max-temp', 'min-temp'),
>>    ('90F', '40F'),
>>    ('85F', '35F'),
>>    ('95F', '40F')
>> )]
>>
>> san_francisco_dt
+----------+----------+
| max-temp | min-temp |
+----------+----------+
| 90F      | 40F      |
| 85F      | 35F      |
| 95F      | 40F      |
+----------+----------+

DTable comes with a .row_apply() method that can be helpful here. The way you invoke it is as follows:

dt.row_apply(func)

DTable would iterate through the data row-by-row. It would form a tuple for each row and pass it to func as an argument. The func should return a tuple of values that would become rows in a new DTable object. An example would be helpful here.

Let's say, our func is as follows:

# Given a string of the form '90F', it returns 90 as integer value.
def noF(row):
    max_temp, min_temp = row
    max_temp_noF, min_temp_noF = max_temp[:-1], min_temp[:-1]
    return int(max_temp_noF), int(min_temp_noF)

If we apply it on our DTable of temperature readings, here is what we get:

>> san_francisco_dt.row_apply(noF)
+----------+----------+
| max-temp | min-temp |
+----------+----------+
| 90       | 40       |
| 85       | 35       |
| 95       | 40       |
+----------+----------+

One quick trick would be to add the modified columns as new columns:

>> san_francisco_dt[ ['max-temp-noF', 'min-temp-noF'] ] = \
...    san_francisco_dt.row_apply(noF)
>>
>> san_francisco_dt
+----------+----------+--------------+--------------+
| max-temp | min-temp | max-temp-noF | min-temp-noF |
+----------+----------+--------------+--------------+
| 90F      | 40F      | 90           | 40           |
| 85F      | 35F      | 85           | 35           |
| 95F      | 40F      | 95           | 40           |
+----------+----------+--------------+--------------+

Expanding Rows

Let's continue with the previous example where we converted temperature data from a string '90F' to an integer 90. What if we want to retain the temperature recording as well as the metric that we used to measure the temperature. In other words, given the following dataset:

+----------+----------+
| max-temp | min-temp |
+----------+----------+
| 90F      | 40F      |
| 85F      | 35F      |
| 95F      | 40F      |
+----------+----------+

We want to make it look like the following:

+----------+----------+-----------------+-----------------+
| max-temp | min-temp | max-temp-metric | min-temp-metric |
+----------+----------+-----------------+-----------------+
| 90       | 40       | F               | F               |
| 85       | 35       | F               | F               |
| 95       | 40       | F               | F               |
+----------+----------+-----------------+-----------------+

Interestingly, .row_apply can return more observations than it received as input. For example, we could write the following function:

def separate_metric(row):
    temp1, temp2 = row
    max_temp, min_temp = temp1[:-1] , temp2[:-1]
    max_temp_metric, min_temp_metric = temp1[-1], temp2[-1]
    return int(max_temp), int(min_temp), max_temp_metric, min_temp_metric

We can now pass this function to .row_apply:

>> san_francisco_dt
+----------+----------+
| max-temp | min-temp |
+----------+----------+
| 90F      | 40F      |
| 85F      | 35F      |
| 95F      | 40F      |
+----------+----------+
>>
>> san_francisco_dt[ ['max-temp', 'min-temp', 'max-temp-metric', 'min-temp-metric'] ] = \
...    san_francisco_dt.row_apply(separate_metric)
>>
>> san_francisco_dt
+----------+----------+-----------------+-----------------+
| max-temp | min-temp | max-temp-metric | min-temp-metric |
+----------+----------+-----------------+-----------------+
| 90       | 40       | F               | F               |
| 85       | 35       | F               | F               |
| 95       | 40       | F               | F               |
+----------+----------+-----------------+-----------------+

Shrinking Rows

.row_apply can also return less observations than it received in a row. And that can be used to collapse multiple columns into one. For example, given the following dataset:

+----------+----------+-----------------+-----------------+
| max-temp | min-temp | max-temp-metric | min-temp-metric |
+----------+----------+-----------------+-----------------+
| 90       | 40       | F               | F               |
| 85       | 35       | F               | F               |
| 95       | 40       | F               | F               |
+----------+----------+-----------------+-----------------+

If we want to convert it into the following:

+----------+----------+
| max-temp | min-temp |
+----------+----------+
| 90F      | 40F      |
| 85F      | 35F      |
| 95F      | 40F      |
+----------+----------+

We could use the following function as an argument to .row_apply

def collapse_metric(row):
    return "%d%s" % (row[0], row[2]), "%d%s" % (row[1], row[3])

Given a row of (90, 40, 'F', 'F'), this function would return ('90F', '40F').

>> san_francisco_dt
+----------+----------+-----------------+-----------------+
| max-temp | min-temp | max-temp-metric | min-temp-metric |
+----------+----------+-----------------+-----------------+
| 90       | 40       | F               | F               |
| 85       | 35       | F               | F               |
| 95       | 40       | F               | F               |
+----------+----------+-----------------+-----------------+
>>
>> san_francisco_dt[ ['max-temp', 'min-temp'] ] = \
...    san_francisco_dt.row_apply(collapse_metric)
>>
>> san_francisco_dt
+----------+----------+
| max-temp | min-temp |
+----------+----------+
| 90F      | 40F      |
| 85F      | 35F      |
| 95F      | 40F      |
+----------+----------+

Constraints

The func argument to .row_apply must adhere to following constraints:

  • It always receives one tuple as an argument (even if there is only one column in the table). And it must return a tuple (even if it returns a single value).
  • It must always return the same number of values whenever it is invoked.
  • It can also return None and in that case, no corresponding record would be generated on the output.

results matching ""

    No results matching ""