Aggregation
Introduction
We discuss the main functions (groupby!
, groupby
, and gatherby
) to group observations in a data set in previous sections. In this section we introduce a function which can be used to apply a function on each group of observations.
combine
combine
is a function which accepts a grouped data set - created by groupby!
, groupby
, or gatherby
- and a set of operations in the form of cols => fun
, where cols
is a column selector and applies fun
on each columns in cols
. The operations can be passed as col => fun => :newname
if user likes to have a specific column name for the output column. All columns selected by cols
are assumed to be referring to the same columns in the input data set. However, when the passed fun
is a byrow
function then combine
assumes every column in cols
are referring to the existing columns in the output data set. Thus, unlike modify!
/modify
, combine
only can access to the existing columns in the output data set when the fun
is a byrow
function.
The order of the output data set depends on the passed data set, i.e. for groupby
the order of the output is sorted order of the grouping columns, and for gatherby
data set the order of the output is based on the appearance of observations in the original data set. Since for most situations, the stability of grouping is not needed, passing stable = false
in groupby/gatherby
can improve the performance, but when stable = false
for gatherby
, the order of the output is undefined.
By default, combine
outputs the grouping columns in the final result, however, passing dropgroupcols = true
removes them from the final output.
Examples
julia> ds = Dataset(g = [1,2,1,2,1,2], x = 1:6)
6×2 Dataset
Row │ g x
│ identity identity
│ Int64? Int64?
─────┼────────────────────
1 │ 1 1
2 │ 2 2
3 │ 1 3
4 │ 2 4
5 │ 1 5
6 │ 2 6
julia> combine(groupby(ds, :g), :x=>[sum, mean])
2×3 Dataset
Row │ g sum_x mean_x
│ identity identity identity
│ Int64? Int64? Float64?
─────┼──────────────────────────────
1 │ 1 9 3.0
2 │ 2 12 4.0
julia> combine(gatherby(ds, :g), :x => [maximum, minimum], 2:3 => byrow(-) => :range)
2×4 Dataset
Row │ g maximum_x minimum_x range
│ identity identity identity identity
│ Int64? Int64? Int64? Int64?
─────┼──────────────────────────────────────────
1 │ 1 5 1 4
2 │ 2 6 2 4
julia> ds = Dataset(rand(1:10, 10, 4), :auto)
10×4 Dataset
Row │ x1 x2 x3 x4
│ identity identity identity identity
│ Int64? Int64? Int64? Int64?
─────┼────────────────────────────────────────
1 │ 9 1 6 3
2 │ 10 7 10 6
3 │ 7 7 3 9
4 │ 9 4 10 8
5 │ 7 3 4 5
6 │ 2 6 5 6
7 │ 1 6 6 1
8 │ 10 2 7 6
9 │ 5 10 9 6
10 │ 1 1 3 4
julia> combine(gatherby(ds, 1), r"x" => sum)
6×5 Dataset
Row │ x1 sum_x1 sum_x2 sum_x3 sum_x4
│ identity identity identity identity identity
│ Int64? Int64? Int64? Int64? Int64?
─────┼──────────────────────────────────────────────────
1 │ 9 18 5 16 11
2 │ 10 20 9 17 12
3 │ 7 14 10 7 14
4 │ 2 2 6 5 6
5 │ 1 2 7 9 5
6 │ 5 5 10 9 6
julia> ds = Dataset(g = [1,2,1,2,1,2], x = 1:6)
6×2 Dataset
Row │ g x
│ identity identity
│ Int64? Int64?
─────┼────────────────────
1 │ 1 1
2 │ 2 2
3 │ 1 3
4 │ 2 4
5 │ 1 5
6 │ 2 6
julia> combine(gatherby(ds, :g), :x=>[maximum, minimum], 2:3=>byrow(-)=>:range, dropgroupcols = true)
2×3 Dataset
Row │ maximum_x minimum_x range
│ identity identity identity
│ Int64? Int64? Int64?
─────┼────────────────────────────────
1 │ 5 1 4
2 │ 6 2 4
combine
treats each columns in cols
individually, thus, a function can be applied to each column by cols => fun
form. Julia
broadcasting can be used to apply multiple functions on multiple columns, however, note that Julia
broadcasting can only be used properly when the column selector is in the form of an abstract vector, i.e. vector of column names or column indices. If other form of column selector is needed to be used in this case, it must be transformed into the abstract vector form, e.g. by using names
, like names(ds, r"x")
.
Any reduction on multiple columns should be go through a byrow
approach.
In special cases, where users like to apply a multivariate function on a set of columns, the columns which are going to be the argument of the multivariate function must be passed as a Tuple of column names or column indices.
julia> ds = Dataset(g = [1,1,1,2,2,2],
x1 = [1.2, 2.3, 1.3, 2.4, 4.5, 5.1],
x2 = [11, 12.0, 11.0, 12.3, 14.5, 16.9])
6×3 Dataset
Row │ g x1 x2
│ identity identity identity
│ Int64? Float64? Float64?
────┼──────────────────────────────
1 │ 1 1.2 11.0
2 │ 1 2.3 12.0
3 │ 1 1.3 11.0
4 │ 2 2.4 12.3
5 │ 2 4.5 14.5
6 │ 2 5.1 16.9
julia> combine(gatherby(ds, 1, isgathered=true), (2,3)=>cor)
2×2 Dataset
Row │ g cor_x1_x2
│ identity identity
│ Int64? Float64?
────┼─────────────────────
1 │ 1 0.996616
2 │ 2 0.944252