Column-wise Transformation

At times, there is a need to perform transformation column by column. Usually, we would take one column and transform it into completely another set of values. Some of these transformations can be achieved with arithmetic and/or comparison operations. They are straight-forward to realize with arithmetic/comparison operators. For example, here is the San Francisco temperature readings in Fahrenheit:

>> san_francisco_dt = dtables.load_tuples([
>>    ('max-temp', 'min-temp'),
>>    (90, 40),
>>    (85, 35),
>>    (95, 40)
>> )]
>>
>> san_francisco_dt
+----------+----------+
| max-temp | min-temp |
+----------+----------+
| 90       | 40       |
| 85       | 35       |
| 95       | 40       |
+----------+----------+

We can convert it to Celsius with vectorized operations and broadcasting:

>> (san_francisco_dt - 32) * 5/9
+----------+----------+
| max-temp | min-temp |
+----------+----------+
| 32.22    | 4.44     |
| 29.44    | 1.66     |
| 35       | 4.44     |
+----------+----------+

However, there are different types of transformations required from time to time that cannot be realized with arithmetic or comparison operations.

  • string transformation - e.g. strip out a few characters from each element of a string column.
  • datatype changes - e.g. you want to convert a column of float64 to int8.
  • reductions - e.g. you want to compute the statistical summaries of some columns.

DTable comes with a function .column_apply(func) that takes a func as an argument. It iterates through the columns in the table and invokes func with the values of that column as argument. It expects one or more columns in return. func basically performs a vectorized operation on the column values that it receives.

Let's look at some examples.

Case 1: convert the data type of columns. In the following table, the datatype of columns is float64. We want to convert it to int8 since that's good enough for us.

>> san_francisco_dt = dtables.load_tuples([
>>    ('max-temp', 'min-temp'),
>>    (90, 40),
>>    (85, 35),
>>    (95, 40)
>> )]
>>
>> san_francisco_dt
+----------+----------+
| max-temp | min-temp |
+----------+----------+
| 90       | 40       |
| 85       | 35       |
| 95       | 40       |
+----------+----------+
>>
>> san_francisco_dt.meta.data_types
[
    ('max-temp', 'float64'),
    ('min-temp', 'float64')
]
>>

We can use the following code to change the data type for this table:

>> from dtables.transforms import astype_int8
>>
>> san_francisco_dt[ ['max-temp', 'min-temp'] ] = san_francisco_dt.column_apply( astype_int8 )
>>
>> san_francisco_dt.meta.data_types
[
    ('max-temp', 'int8'),
    ('min-temp', 'int8')
]
>>

The function astype_int8 takes an ndarray as input and converts the data type to int8. As an aside, this is mostly a syntactic sugar for the following piece of code:

san_francisco_dt['max-temp'] = san_francisco_dt['max-temp'].to_ndarray().astype('int8')
san_francisco_dt['min-temp'] = san_francisco_dt['min-temp'].to_ndarray().astype('int8')

Case 2: Convert the values to lowercase.

>> san_francisco_dt
+----------+----------+
| max-temp | min-temp |
+----------+----------+
| 90F      | 40F      |
| 85F      | 35F      |
| 95F      | 40F      |
+----------+----------+
>>
>> from dtables.transforms import lower
>>
>> san_francisco_dt[ ['max-temp', 'min-temp'] ] = san_francisco_dt.column_apply(lower)
>>
>> san_francisco_dt
+----------+----------+
| max-temp | min-temp |
+----------+----------+
| 90f      | 40f      |
| 85f      | 35f      |
| 95f      | 40f      |
+----------+----------+

Again, this is just a syntactic sugar for:

>> from numpy.core.defchararray import lower
>>
>> san_francisco_dt['max-temp'] = lower(san_francisco_dt['max-temp'].to_ndarray())
>> san_francisco_dt['min-temp'] = lower(san_francisco_dt['min-temp'].to_ndarray())

Case 3: Find mean value

>> san_francisco_dt
+----------+----------+
| max-temp | min-temp |
+----------+----------+
| 90       | 40       |
| 85       | 35       |
| 95       | 40       |
+----------+----------+
>>
>> from dtables.transforms import mean
>> san_francisco_dt.column_apply(mean)
+----------+----------+
| max-temp | min-temp |
+----------+----------+
| 90       | 38.33    |
+----------+----------+

results matching ""

    No results matching ""