Aggregation
Introduction
In previous sections we discuss the main functions (groupby!
, groupby
, and gatherby
) to group observations in a data set. In this section we introduce a function which can be used to apply a function on each group of observations.
combine
combine
is a function which accepts a grouped data set - created by groupby!
, groupby
, or gatherby
- and a set of operations in the form of cols => fun
, where cols
is a column selector, and applies fun
on each columns in cols
. The operations can be passed as col => fun => :newname
if user likes to have a specific column name for the output column. All columns selected by cols
are assumed to be referring to the same columns in the input data set. However, when the passed fun
is a byrow
function then combine
assumes every column in cols
are referring to the existing columns in the output data set. Thus, unlike modify!
/modify
, combine
only can access to the existing columns in the output data set when the fun
is a byrow
function.
The order of the output data set depends on the passed data set, i.e. for groupby
the order of the output is sorted order of the grouping columns, and for gatherby
data set the order of the output is based on the appearance of observations in the original data set. Since for most situations, the stability of grouping is not needed, passing stable = false
in groupby/gatherby
can improve the performance, but when stable = false
for gatherby
, the order of the output is undefined.
By default, combine
outputs the grouping columns in the final result, however, passing dropgroupcols = true
removes them from the final output.
By default
combine
send each group of data to different threads for processing (multithreaded processing), however, passingthreads = false
changes this.
Examples
julia> ds = Dataset(g = [1,2,1,2,1,2], x = 1:6)
6×2 Dataset
Row │ g x
│ identity identity
│ Int64? Int64?
─────┼────────────────────
1 │ 1 1
2 │ 2 2
3 │ 1 3
4 │ 2 4
5 │ 1 5
6 │ 2 6
julia> combine(groupby(ds, :g), :x=>[IMD.sum, mean])
2×3 Dataset
Row │ g sum_x mean_x
│ identity identity identity
│ Int64? Int64? Float64?
─────┼──────────────────────────────
1 │ 1 9 3.0
2 │ 2 12 4.0
julia> combine(gatherby(ds, :g), :x => [IMD.maximum, IMD.minimum], 2:3 => byrow(-) => :range)
2×4 Dataset
Row │ g maximum_x minimum_x range
│ identity identity identity identity
│ Int64? Int64? Int64? Int64?
─────┼──────────────────────────────────────────
1 │ 1 5 1 4
2 │ 2 6 2 4
julia> ds = Dataset(rand(1:10, 10, 4), :auto)
10×4 Dataset
Row │ x1 x2 x3 x4
│ identity identity identity identity
│ Int64? Int64? Int64? Int64?
─────┼────────────────────────────────────────
1 │ 9 1 6 3
2 │ 10 7 10 6
3 │ 7 7 3 9
4 │ 9 4 10 8
5 │ 7 3 4 5
6 │ 2 6 5 6
7 │ 1 6 6 1
8 │ 10 2 7 6
9 │ 5 10 9 6
10 │ 1 1 3 4
julia> combine(gatherby(ds, 1), r"x" => IMD.sum)
6×5 Dataset
Row │ x1 sum_x1 sum_x2 sum_x3 sum_x4
│ identity identity identity identity identity
│ Int64? Int64? Int64? Int64? Int64?
─────┼──────────────────────────────────────────────────
1 │ 9 18 5 16 11
2 │ 10 20 9 17 12
3 │ 7 14 10 7 14
4 │ 2 2 6 5 6
5 │ 1 2 7 9 5
6 │ 5 5 10 9 6
julia> ds = Dataset(g = [1,2,1,2,1,2], x = 1:6)
6×2 Dataset
Row │ g x
│ identity identity
│ Int64? Int64?
─────┼────────────────────
1 │ 1 1
2 │ 2 2
3 │ 1 3
4 │ 2 4
5 │ 1 5
6 │ 2 6
julia> combine(gatherby(ds, :g), :x=>[IMD.maximum, IMD.minimum], 2:3=>byrow(-)=>:range, dropgroupcols = true)
2×3 Dataset
Row │ maximum_x minimum_x range
│ identity identity identity
│ Int64? Int64? Int64?
─────┼────────────────────────────────
1 │ 5 1 4
2 │ 6 2 4
combine
treats each columns in cols
individually, thus, a function can be applied to each column by cols => fun
form. combine
normalises cols => funs
to col1 => funs
, col2 => funs
, ..., where col1
refers to the first column in the column selector cols
, col2
refers to the second one, .... When col => funs
is passed to the function where col
refers to a single column, combine
normalises it as col => fun1
, col => fun2
, ..., where fun1
, fun2
,... are the first, second, ... functions in passed funs
.
Any reduction on multiple columns should be go through a byrow
approach.
In special cases, where users like to apply a multivariate function on a set of columns, the columns which are going to be the argument of the multivariate function must be passed as a Tuple of column names or column indices.
julia> ds = Dataset(g = [1,1,1,2,2,2],
x1 = [1.2, 2.3, 1.3, 2.4, 4.5, 5.1],
x2 = [11, 12.0, 11.0, 12.3, 14.5, 16.9])
6×3 Dataset
Row │ g x1 x2
│ identity identity identity
│ Int64? Float64? Float64?
────┼──────────────────────────────
1 │ 1 1.2 11.0
2 │ 1 2.3 12.0
3 │ 1 1.3 11.0
4 │ 2 2.4 12.3
5 │ 2 4.5 14.5
6 │ 2 5.1 16.9
julia> combine(gatherby(ds, 1, isgathered=true), (2,3)=>cor)
2×2 Dataset
Row │ g cor_x1_x2
│ identity identity
│ Int64? Float64?
────┼─────────────────────
1 │ 1 0.996616
2 │ 2 0.944252