Group observations
Introduction
InMemoryDatasets uses two approaches to group observations: sorting, and hashing. In sorting approach, it sorts the data set based on given columns and finds the starts and ends of each group based on the sorted values. In hashing approach, it uses a hybrid algorithm to group observations. Each of these approaches has some advantages over the other one and for any particular problem one of them might be more suitable than the other one.
By default, the functions for grouping observations exploits all core available to Julia for group observations, however, passing
threads = false
change this.
groupby!
and groupby
The main functions for grouping observations based on sorting approach are groupby!
and groupby
. The groupby!
function replaces the original data set with the sorted one and attaches a meta information about the grouping orders to the replaced data set, on the other hand, the groupby
function performs the sorting phase, however, it creates a view of the main data set where the meta information is attached to it. The output of groupby
is basically a view of the sorted data set.
The syntax for calling groupby!
and groupby
is the same as the sort!
function. This means groupby!
and groupby
accept all keyword arguments that the sort!
function supports, these include:
rev
with default value offalse
mapformats
with default value oftrue
, i.e. by default these functions group data sets based on the formatted values.stable
with default value oftrue
alg
which by default is set toHeapSortAlg
, and it can be set asQuickSort
too.
Removing formats of columns that are used for
groupby!
withmapformats = true
removes the grouping information too, i.e. the data set will not be marked as grouped/sorted data set .
Examples
julia> ds = Dataset(g = [1, 2, 1, 2, 1, 2], x = [12.0, 12.3, 11.0, 13.0, 15.0, 13.2])
6×2 Dataset
Row │ g x
│ identity identity
│ Int64? Float64?
─────┼────────────────────
1 │ 1 12.0
2 │ 2 12.3
3 │ 1 11.0
4 │ 2 13.0
5 │ 1 15.0
6 │ 2 13.2
julia> groupby!(ds, 1)
6×2 Grouped Dataset with 2 groups
Grouped by: g
Row │ g x
│ identity identity
│ Int64? Float64?
─────┼────────────────────
1 │ 1 12.0
2 │ 1 11.0
3 │ 1 15.0
4 │ 2 12.3
5 │ 2 13.0
6 │ 2 13.2
julia> ds # ds has been replaced with its grouped version
6×2 Grouped Dataset with 2 groups
Grouped by: g
Row │ g x
│ identity identity
│ Int64? Float64?
─────┼────────────────────
1 │ 1 12.0
2 │ 1 11.0
3 │ 1 15.0
4 │ 2 12.3
5 │ 2 13.0
6 │ 2 13.2
julia> ds = Dataset(group = ["c1", "c2", "c1", "c3", "c1", "c3"], x = 1:6)
6×2 Dataset
Row │ group x
│ identity identity
│ String? Int64?
─────┼────────────────────
1 │ c1 1
2 │ c2 2
3 │ c1 3
4 │ c3 4
5 │ c1 5
6 │ c3 6
julia> groupby(ds, :group)
6×2 View of Grouped Dataset, Grouped by: group
group x
identity identity
String? Int64?
────────────────────
c1 1
c1 3
c1 5
c2 2
c3 4
c3 6
julia> ds # ds is untouched
6×2 Dataset
Row │ group x
│ identity identity
│ String? Int64?
─────┼────────────────────
1 │ c1 1
2 │ c2 2
3 │ c1 3
4 │ c3 4
5 │ c1 5
6 │ c3 6
julia> salary = Dataset(id = 1:10,
salary=[100, 120, 301, 95, 200, 75, 150, 67, 90, 110])
10×2 Dataset
Row │ id salary
│ identity identity
│ Int64? Int64?
─────┼────────────────────
1 │ 1 100
2 │ 2 120
3 │ 3 301
4 │ 4 95
5 │ 5 200
6 │ 6 75
7 │ 7 150
8 │ 8 67
9 │ 9 90
10 │ 10 110
julia> s_grp(x) = x < 100 ? 1 : x < 200 ? 2 : 3
s_grp (generic function with 1 method)
julia> setformat!(salary, :salary => s_grp)
10×2 Dataset
Row │ id salary
│ identity s_grp
│ Int64? Int64?
─────┼──────────────────
1 │ 1 2
2 │ 2 2
3 │ 3 3
4 │ 4 1
5 │ 5 3
6 │ 6 1
7 │ 7 2
8 │ 8 1
9 │ 9 1
10 │ 10 2
julia> groupby(salary, 2)
10×2 View of Grouped Dataset, Grouped by: salary
id salary
identity s_grp
Int64? Int64?
──────────────────
4 1
6 1
8 1
9 1
1 2
2 2
7 2
10 2
3 3
5 3
julia> ds = Dataset(x=[1,1,2,2], y=[1,2,1,2], z=[1,1,1,1])
julia> groupby!(ds, [:x, :y]) # groupby by more than one column
4×3 Grouped Dataset with 4 groups
Grouped by: x, y
Row │ x y z
│ identity identity identity
│ Int64? Int64? Int64?
─────┼──────────────────────────────
1 │ 1 1 1
2 │ 1 2 1
3 │ 2 1 1
4 │ 2 2 1
The groupby!
and groupby
functions accept the output of the groupby
function. Thus, some may use these functions to incrementally group a data set.
When the groupby!
function is used on a data set, the data set is marked as a grouped data set and the functions which handle grouped data set differently are signalled when the grouped data sets are passed as their arguments. Two of those functions are modify!
and modify
functions. When a grouped data set is passed to these two functions, InMemoryDatasets applies each modification within each group. The modify!
and modify
functions treat the view of a grouped data set (produced by the groupby
function) in the same way without changing the order of the original data set. For better performance, set stable = false
when the groupby
function is used in conjunction with modify!
or modify
.
Examples
julia> ds = Dataset(g = [2, 1, 1, 2, 2],
x1_int = [0, 0, 1, missing, 2],
x2_int = [3, 2, 1, 3, -2],
x1_float = [1.2, missing, -1.0, 2.3, 10],
x2_float = [missing, missing, 3.0, missing, missing],
x3_float = [missing, missing, -1.4, 3.0, -100.0])
5×6 Dataset
Row │ g x1_int x2_int x1_float x2_float x3_float
│ identity identity identity identity identity identity
│ Int64? Int64? Int64? Float64? Float64? Float64?
─────┼───────────────────────────────────────────────────────────────
1 │ 2 0 3 1.2 missing missing
2 │ 1 0 2 missing missing missing
3 │ 1 1 1 -1.0 3.0 -1.4
4 │ 2 missing 3 2.3 missing 3.0
5 │ 2 2 -2 10.0 missing -100.0
julia> groupby!(ds, 1)
5×6 Grouped Dataset with 2 groups
Grouped by: g
Row │ g x1_int x2_int x1_float x2_float x3_float
│ identity identity identity identity identity identity
│ Int64? Int64? Int64? Float64? Float64? Float64?
─────┼───────────────────────────────────────────────────────────────
1 │ 1 0 2 missing missing missing
2 │ 1 1 1 -1.0 3.0 -1.4
3 │ 2 0 3 1.2 missing missing
4 │ 2 missing 3 2.3 missing 3.0
5 │ 2 2 -2 10.0 missing -100.0
julia> modify(ds, r"int" => x -> x .- IMD.maximum(x))
5×6 Grouped Dataset with 2 groups
Grouped by: g
Row │ g x1_int x2_int x1_float x2_float x3_float
│ identity identity identity identity identity identity
│ Int64? Int64? Int64? Float64? Float64? Float64?
─────┼───────────────────────────────────────────────────────────────
1 │ 1 -1 0 missing missing missing
2 │ 1 0 -1 -1.0 3.0 -1.4
3 │ 2 -2 0 1.2 missing missing
4 │ 2 missing 0 2.3 missing 3.0
5 │ 2 0 -5 10.0 missing -100.0
julia> sale = Dataset(date = [Date(2012, 11), Date(2013, 5), Date(2012, 4),
Date(2013, 1), Date(2014, 8), Date(2013, 2)],
sale = [100, 200, 140, 200, 132, 150])
6×2 Dataset
Row │ date sale
│ identity identity
│ Date? Int64?
─────┼──────────────────────
1 │ 2012-11-01 100
2 │ 2013-05-01 200
3 │ 2012-04-01 140
4 │ 2013-01-01 200
5 │ 2014-08-01 132
6 │ 2013-02-01 150
julia> setformat!(sale, :date=>year)
6×2 Dataset
Row │ date sale
│ year identity
│ Date? Int64?
─────┼─────────────────
1 │ 2012 100
2 │ 2013 200
3 │ 2012 140
4 │ 2013 200
5 │ 2014 132
6 │ 2013 150
julia> spct(x) = x ./ IMD.sum(x) .* 100
spct (generic function with 1 method)
julia> modify(groupby(sale, :date), :sale => spct => :sale_pct)
6×3 Dataset
Row │ date sale sale_pct
│ year identity identity
│ Date? Int64? Float64?
─────┼───────────────────────────
1 │ 2012 100 41.6667
2 │ 2013 200 36.3636
3 │ 2012 140 58.3333
4 │ 2013 200 36.3636
5 │ 2014 132 100.0
6 │ 2013 150 27.2727
ungroup!
The ungroup!
function is a utility function that removes the grouped
mark from a grouped data set produced by groupby!
. The function doesn't change the permutation of the data set, thus, even the data set is not any more grouped, it is still sorted, and it is very efficient to re-group it. However, note that the last modified time of the data set is updated when ungroup!
is called on a data set.
The ungroup!
function can be used in scenarios that one needs to modify a data set but it is not desired to apply a specific modification within each group, instead the modification is needed to be applied to the whole column. In these kind of situations, first ungroup!
is used to remove the grouping mark and then the modify!
function can be used on the data set. The groupby!
function can be used afterward to mark the data set as grouped data set.
gatherby
The gatherby
function uses the hashing approach to group observations based on a set of columns. InMemoryDatasets uses a hybrid algorithm to gather observations which sometimes does this without using the hash
function. The gatherby
function doesn't sort the data set, instead, it uses the hybrid algorithm to group observations. gatherby
can be particularly useful when sorting is computationally expensive. Another benefit of gatherby
is that, by default, it keeps the order of observations in each group the same as their appearance in the original data set.
The gatherby
function uses the formatted values for gathering the observations into groups, however, using mapformats = false
changes this behaviour.
The syntax for using the gatherby
function is gatherby(ds, cols)
where ds
is the data set and cols
is any column selector which indicates the columns which are going to be used in gathering.
Examples
julia> ds = Dataset(grp = [1, 2, 3, 3, 1, 3, 2, 1],
x = [true, false, true, true, true, true, false, false])
8×2 Dataset
Row │ grp x
│ identity identity
│ Int64? Bool?
─────┼────────────────────
1 │ 1 true
2 │ 2 false
3 │ 3 true
4 │ 3 true
5 │ 1 true
6 │ 3 true
7 │ 2 false
8 │ 1 false
julia> gatherby(ds, :x)
8×2 View of GatherBy Dataset, Gathered by: x
grp x
identity identity
Int64? Bool?
────────────────────
1 true
3 true
3 true
1 true
3 true
2 false
2 false
1 false
julia> setformat!(ds, 1=>isodd)
8×2 Dataset
Row │ grp x
│ isodd identity
│ Int64? Bool?
─────┼──────────────────
1 │ true true
2 │ false false
3 │ true true
4 │ true true
5 │ true true
6 │ true true
7 │ false false
8 │ true false
julia> gatherby(ds, 1)
8×2 View of GatherBy Dataset, Gathered by: grp
grp x
isodd identity
Int64? Bool?
──────────────────
true true
true true
true true
true true
true true
true false
false false
false false
Similar to groupby!/groupby
functions, gatherby
can be passed to functions which operate on grouped data sets.
As mentioned before, the result of gatherby
is stable, i.e. the observations order within each group will be the order of their appearance in the original data set. However, when this stability is not needed and there are many groups in the data set, passing stable = false
improves the performance by sacrificing the stability.
The gatherby
function has two extra keyword arguments, isgathered
and eachrow
, which by default are set to false
. When the isgathered
argument is set to true
, InMemoryDatasets assumes that the observations are currently gathered by some rules and it only finds the starts and ends of each group and marks the data set as gathered. So users can manually group observations by setting this keyword argument. When the eachrow
argument is set to true
, InMemoryDatasets does the gathering and then mark each row of the input data set as an individual group. This option is handy for transposing data sets.
Iterate eachgroup
User can use eachgroup
to iterate each group of a grouped data set. Each element of eachgroup
is a SubDataset
.
Examples
julia> ds = Dataset(rand(1:10, 10, 3), :auto)
10×3 Dataset
Row │ x1 x2 x3
│ identity identity identity
│ Int64? Int64? Int64?
─────┼──────────────────────────────
1 │ 7 8 10
2 │ 4 1 5
3 │ 7 2 5
4 │ 4 7 4
5 │ 5 9 6
6 │ 9 5 3
7 │ 9 8 2
8 │ 7 9 6
9 │ 2 3 8
10 │ 1 6 2
julia> i_gds = eachgroup(groupby(ds, 1));
julia> map(nrow, i_gds)
6-element Vector{Int64}:
1
1
2
1
3
2
julia> i_gds[1]
1×3 SubDataset
Row │ x1 x2 x3
│ identity identity identity
│ Int64? Int64? Int64?
─────┼──────────────────────────────
1 │ 1 6 2
julia> i_gds[end]
2×3 SubDataset
Row │ x1 x2 x3
│ identity identity identity
│ Int64? Int64? Int64?
─────┼──────────────────────────────
1 │ 9 5 3
2 │ 9 8 2