Aggregation

Introduction

We discuss the main functions (groupby!, groupby, and gatherby) to group observations in a data set in previous sections. In this section we introduce a function which can be used to apply a function on each group of observations.

combine

combine is a function which accepts a grouped data set - created by groupby!, groupby, or gatherby - and a set of operations in the form of cols => fun, where cols is a column selector and applies fun on each columns in cols. The operations can be passed as col => fun => :newname if user likes to have a specific column name for the output column. All columns selected by cols are assumed to be referring to the same columns in the input data set. However, when the passed fun is a byrow function then combine assumes every column in cols are referring to the existing columns in the output data set. Thus, unlike modify!/modify, combine only can access to the existing columns in the output data set when the fun is a byrow function.

The order of the output data set depends on the passed data set, i.e. for groupby the order of the output is sorted order of the grouping columns, and for gatherby data set the order of the output is based on the appearance of observations in the original data set. Since for most situations, the stability of grouping is not needed, passing stable = false in groupby/gatherby can improve the performance, but when stable = false for gatherby, the order of the output is undefined.

By default, combine outputs the grouping columns in the final result, however, passing dropgroupcols = true removes them from the final output.

Examples

julia> ds = Dataset(g = [1,2,1,2,1,2], x = 1:6)
6×2 Dataset
 Row │ g         x        
     │ identity  identity
     │ Int64?    Int64?   
─────┼────────────────────
   1 │        1         1
   2 │        2         2
   3 │        1         3
   4 │        2         4
   5 │        1         5
   6 │        2         6

julia> combine(groupby(ds, :g), :x=>[sum, mean])
2×3 Dataset
 Row │ g         sum_x     mean_x   
     │ identity  identity  identity
     │ Int64?    Int64?    Float64?
─────┼──────────────────────────────
   1 │        1         9       3.0
   2 │        2        12       4.0

julia> combine(gatherby(ds, :g), :x => [maximum, minimum], 2:3 => byrow(-) => :range)
2×4 Dataset
 Row │ g         maximum_x  minimum_x  range    
     │ identity  identity   identity   identity
     │ Int64?    Int64?     Int64?     Int64?   
─────┼──────────────────────────────────────────
   1 │        1          5          1         4
   2 │        2          6          2         4

julia> ds = Dataset(rand(1:10, 10, 4), :auto)
10×4 Dataset
 Row │ x1        x2        x3        x4       
     │ identity  identity  identity  identity
     │ Int64?    Int64?    Int64?    Int64?   
─────┼────────────────────────────────────────
   1 │        9         1         6         3
   2 │       10         7        10         6
   3 │        7         7         3         9
   4 │        9         4        10         8
   5 │        7         3         4         5
   6 │        2         6         5         6
   7 │        1         6         6         1
   8 │       10         2         7         6
   9 │        5        10         9         6
  10 │        1         1         3         4

julia> combine(gatherby(ds, 1), r"x" => sum)
6×5 Dataset
 Row │ x1        sum_x1    sum_x2    sum_x3    sum_x4   
     │ identity  identity  identity  identity  identity
     │ Int64?    Int64?    Int64?    Int64?    Int64?   
─────┼──────────────────────────────────────────────────
   1 │        9        18         5        16        11
   2 │       10        20         9        17        12
   3 │        7        14        10         7        14
   4 │        2         2         6         5         6
   5 │        1         2         7         9         5
   6 │        5         5        10         9         6

julia> ds = Dataset(g = [1,2,1,2,1,2], x = 1:6)
6×2 Dataset
 Row │ g         x        
     │ identity  identity
     │ Int64?    Int64?   
─────┼────────────────────
   1 │        1         1
   2 │        2         2
   3 │        1         3
   4 │        2         4
   5 │        1         5
   6 │        2         6

julia> combine(gatherby(ds, :g), :x=>[maximum, minimum], 2:3=>byrow(-)=>:range, dropgroupcols = true)
2×3 Dataset
 Row │ maximum_x  minimum_x  range    
     │ identity   identity   identity
     │ Int64?     Int64?     Int64?   
─────┼────────────────────────────────
   1 │         5          1         4
   2 │         6          2         4

combine treats each columns in cols individually, thus, a function can be applied to each column by cols => fun form. Julia broadcasting can be used to apply multiple functions on multiple columns, however, note that Julia broadcasting can only be used properly when the column selector is in the form of an abstract vector, i.e. vector of column names or column indices. If other form of column selector is needed to be used in this case, it must be transformed into the abstract vector form, e.g. by using names , like names(ds, r"x").

Any reduction on multiple columns should be go through a byrow approach.

In special cases, where users like to apply a multivariate function on a set of columns, the columns which are going to be the argument of the multivariate function must be passed as a Tuple of column names or column indices.

julia> ds = Dataset(g = [1,1,1,2,2,2],
            x1 = [1.2, 2.3, 1.3, 2.4, 4.5, 5.1],
            x2 = [11, 12.0, 11.0, 12.3, 14.5, 16.9])
6×3 Dataset
Row │ g         x1        x2       
    │ identity  identity  identity
    │ Int64?    Float64?  Float64?
────┼──────────────────────────────
  1 │        1       1.2      11.0
  2 │        1       2.3      12.0
  3 │        1       1.3      11.0
  4 │        2       2.4      12.3
  5 │        2       4.5      14.5
  6 │        2       5.1      16.9

julia> combine(gatherby(ds, 1, isgathered=true), (2,3)=>cor)
2×2 Dataset
Row │ g         cor_x1_x2
    │ identity  identity  
    │ Int64?    Float64?  
────┼─────────────────────
  1 │        1   0.996616
  2 │        2   0.944252