Sorting Datasets
Introduction
Sorting is one of the key tasks for Datasets. Actually, when we group a data set by given set of columns, InMemoryDatasets does sorting behind the scene and groups the observations based on their sorted values. The joining algorithms also uses the sorting functions for finding the matched observations. One may sort a data set based on a set of columns by either their formatted values, or their actual values. In this section we go through the main functions for sorting Datasets.
Note that, by default, InMemoryDatasets uses parallel algorithms for sorting observations. User may stop parallel sorting by passing
threads = false
.
sort!/sort
The sort!
function accepts a Dataset and a set of columns and sorts the given Dataset based on provided columns. By default the sort!
function does the sorting based on the formatted values, however, using mapformats = false
forces the sorting be done based on the actual values. sort!
doesn't create a new dataset, it only replaces the original one with the sorted one. If the original data set needed to be untouched the sort
function must be used. By default, both sort!
and sort
functions do a stable sort using a hybrid Heap
sort algorithm. If the stability of the sort is not needed, using the keyword option stable = false
can improve the performance. User can also change the default sorting algorithm to hybrid QuickSort
by using the alg = QuickSort
option. For most problems QuickSort
algorithm is faster than the default algorithm, however, its worst case scenario is slower than the default algorithm.
By default the ascending sorting is used for the sorting task, and using rev = true
changes it to descending ordering, and for multiple columns a vector of true
, false
can be supplied for this option, i.e. each column can be sorted in ascending or descending order independently. Note that:
Datasets uses
isless
for checking the order of values.Datasets prints extra information when it shows a sorted data set.
Like Julia Base, the missing values are treated larger than any other values.
sort
creates a copy of data and permutes each column of it and attaches some attributes to the new data set. To sort a data set without creating a new data set or modifying the original data set, someone may use thegroupby
function. Thegroupby
function sorts and then creates a meta information about the sorted data set. Thegroupby
function will be discussed in another section in detail.
Examples
julia> ds = Dataset(x = [5,4,3,2,1], y = [42,52,4,1,55])
5×2 Dataset
Row │ x y
│ identity identity
│ Int64? Int64?
─────┼────────────────────
1 │ 5 42
2 │ 4 52
3 │ 3 4
4 │ 2 1
5 │ 1 55
julia> sort!(ds, :x);
julia> ds
5×2 Sorted Dataset
Sorted by: x
Row │ x y
│ identity identity
│ Int64? Int64?
─────┼────────────────────
1 │ 1 55
2 │ 2 1
3 │ 3 4
4 │ 4 52
5 │ 5 42
julia> sort(ds, :y, rev = true)
5×2 Sorted Dataset
Sorted by: y
Row │ x y
│ identity identity
│ Int64? Int64?
─────┼────────────────────
1 │ 1 55
2 │ 4 52
3 │ 5 42
4 │ 3 4
5 │ 2 1
julia> ds = Dataset(x = [5, 4, missing, 4],
y = [3, missing, missing , 1])
4×2 Dataset
Row │ x y
│ identity identity
│ Int64? Int64?
─────┼────────────────────
1 │ 5 3
2 │ 4 missing
3 │ missing missing
4 │ 4 1
julia> sort(ds, 1:2)
4×2 Sorted Dataset
Sorted by: x, y
Row │ x y
│ identity identity
│ Int64? Int64?
─────┼────────────────────
1 │ 4 1
2 │ 4 missing
3 │ 5 3
4 │ missing missing
julia> sort(ds, 1:2, rev = [false, true])
4×2 Sorted Dataset
Sorted by: x, y
Row │ x y
│ identity identity
│ Int64? Int64?
─────┼────────────────────
1 │ 4 missing
2 │ 4 1
3 │ 5 3
4 │ missing missing
The following examples show how the sorting functions work with formats.
julia> ds = Dataset(state = ["CA", "TX", "IL", "IL", "IL", "CA", "TX", "TX"],
date = [Date("2020-01-01"), Date("2020-03-01"), Date("2020-01-01"),
Date("2020-03-01"), Date("2020-02-01"), Date("2021-03-01"),
Date("2021-02-01"), Date("2020-02-01")],
qt = [123, 143, 144, 199, 153, 144, 134, 188])
8×3 Dataset
Row │ state date qt
│ identity identity identity
│ String? Date? Int64?
─────┼────────────────────────────────
1 │ CA 2020-01-01 123
2 │ TX 2020-03-01 143
3 │ IL 2020-01-01 144
4 │ IL 2020-03-01 199
5 │ IL 2020-02-01 153
6 │ CA 2021-03-01 144
7 │ TX 2021-02-01 134
8 │ TX 2020-02-01 188
julia> setformat!(ds, :date=>month)
8×3 Dataset
Row │ state date qt
│ identity month identity
│ String? Date? Int64?
─────┼───────────────────────────
1 │ CA 1 123
2 │ TX 3 143
3 │ IL 1 144
4 │ IL 3 199
5 │ IL 2 153
6 │ CA 3 144
7 │ TX 2 134
8 │ TX 2 188
julia> sort(ds, [2,1])
8×3 Sorted Dataset
Sorted by: date, state
Row │ state date qt
│ identity month identity
│ String? Date? Int64?
─────┼───────────────────────────
1 │ CA 1 123
2 │ IL 1 144
3 │ IL 2 153
4 │ TX 2 134
5 │ TX 2 188
6 │ CA 3 144
7 │ IL 3 199
8 │ TX 3 143
julia> sort(ds, [2,1], mapformats = false)
8×3 Sorted Dataset
Sorted by: date, state
Row │ state date qt
│ identity month identity
│ String? Date? Int64?
─────┼───────────────────────────
1 │ CA 1 123
2 │ IL 1 144
3 │ IL 2 153
4 │ TX 2 188
5 │ IL 3 199
6 │ TX 3 143
7 │ TX 2 134
8 │ CA 3 144
julia> sort(ds, [1,2], mapformats = false)
8×3 Sorted Dataset
Sorted by: state, date
Row │ state date qt
│ identity month identity
│ String? Date? Int64?
─────┼───────────────────────────
1 │ CA 1 123
2 │ CA 3 144
3 │ IL 1 144
4 │ IL 2 153
5 │ IL 3 199
6 │ TX 2 188
7 │ TX 3 143
8 │ TX 2 134
sortperm
The sortperm(ds, cols)
function returns a permutation vector perm
that puts ds[perm, :]
in sorted order based on cols
. Similar to sort!
/sort
, this function accepts rev
, alg
, mapformats
and stable
options.
Examples
julia> ds = Dataset(x = [1,4,3,2,1], y = [420,52,4,1,55])
5×2 Dataset
Row │ x y
│ identity identity
│ Int64? Int64?
─────┼────────────────────
1 │ 1 420
2 │ 4 52
3 │ 3 4
4 │ 2 1
5 │ 1 55
julia> p = sortperm(ds, [1,2])
5-element Vector{Int32}:
5
1
4
3
2
julia> ds[p, :]
5×2 Dataset
Row │ x y
│ identity identity
│ Int64? Int64?
─────┼────────────────────
1 │ 1 55
2 │ 1 420
3 │ 2 1
4 │ 3 4
5 │ 4 52
In the following example we show the performance difference between the default algorithm and QuickSort
algorithm for sorting Integers.
julia> ds = Dataset(x = rand(Int, 10^8));
julia> @time sortperm(ds,1, stable = false);
21.663503 seconds (604 allocations: 2.887 GiB, 0.11% gc time)
julia> @time sortperm(ds,1, stable = false, alg = QuickSort);
4.818334 seconds (591 allocations: 2.887 GiB, 7.05% gc time)
unsort!
The unsort!
function undo the last sort operation that has been done on a data set, i.e. when a sort!
function has been applied to a data set, directly or indirectly (e.g. the groupby!
function is one of the functions which uses sort!
behind the scene), the unsort!
function can undo it.
julia> ds = Dataset(x = [1,3,1,2,6,1,4,4],
y = [100,150,90,110,100,80,50,30])
8×2 Dataset
Row │ x y
│ identity identity
│ Int64? Int64?
────┼────────────────────
1 │ 1 100
2 │ 3 150
3 │ 1 90
4 │ 2 110
5 │ 6 100
6 │ 1 80
7 │ 4 50
8 │ 4 30
julia> sort!(ds, 1:2);
julia> ds
8×2 Sorted Dataset
Sorted by: x, y
Row │ x y
│ identity identity
│ Int64? Int64?
─────┼────────────────────
1 │ 1 80
2 │ 1 90
3 │ 1 100
4 │ 2 110
5 │ 3 150
6 │ 4 30
7 │ 4 50
8 │ 6 100
julia> unsort!(ds)
8×2 Dataset
Row │ x y
│ identity identity
│ Int64? Int64?
─────┼────────────────────
1 │ 1 100
2 │ 3 150
3 │ 1 90
4 │ 2 110
5 │ 6 100
6 │ 1 80
7 │ 4 50
8 │ 4 30
issorted
/issorted!
The issorted
function checks if a data set is sorted by given column(s). The syntax for the function is issorted(ds, cols)
, and by default the mapformats
keyword argument is set to true
and the rev
keyword argument is set to false
. The issorted!
function does the same job, however, if it returns true
it marks the input data set as a sorted data set, i.e. it attaches some meta information to the data set.
Examples
julia> ds = Dataset(x1 = [1, 4, 7], x2 = [3.0, 1.1, -10.0], x3 = ["one", "two", "three"])
3×3 Dataset
Row │ x1 x2 x3
│ identity identity identity
│ Int64? Float64? String?
─────┼──────────────────────────────
1 │ 1 3.0 one
2 │ 4 1.1 two
3 │ 7 -10.0 three
julia> issorted(ds, 1)
true
julia> issorted(ds, 2)
false
julia> issorted(ds, 2, rev = true)
true
julia> fmt(x) = x == "one" ? 1 : x=="two" ? 2 : 3
fmt (generic function with 1 method)
julia> setformat!(ds, :x3=>fmt)
3×3 Dataset
Row │ x1 x2 x3
│ identity identity fmt
│ Int64? Float64? String?
─────┼─────────────────────────────
1 │ 1 3.0 1
2 │ 4 1.1 2
3 │ 7 -10.0 3
julia> issorted(ds, 3)
true
julia> issorted!(ds, 1:3, rev = [false, true, false])
true
julia> ds
3×3 Sorted Dataset
Sorted by: x1, x2, x3
Row │ x1 x2 x3
│ identity identity fmt
│ Int64? Float64? String?
─────┼─────────────────────────────
1 │ 1 3.0 1
2 │ 4 1.1 2
3 │ 7 -10.0 3
Performance improvement using formats
In some scenarios the performance of sort can be improved by using formats. For example, when we know for a specific column there is only a few numbers after the decimal point, using a format can improve the performance of the sort. In the following example we are using the @btime
macro from the BenchmarkTools
package to demonstrate this;
julia> ;# column :x1 has at most 2 digits after the decimal point
julia> ds = Dataset(x1 = round.(rand(10^6),digits = 2),
x2 = repeat(1:100, 10^4));
julia> @btime sortperm(ds, 1);
37.169 ms (751 allocations: 31.53 MiB)
julia> custom_fmt(x) = round(x * 100);
julia> setformat!(ds, 1=>custom_fmt);
julia> @btime sortperm(ds, 1);
5.678 ms (317 allocations: 17.19 MiB)
The 6 times improvement in the performance is due to the fact that the formatted values in the data set are basically integer rather than float (the actual values) and the algorithms for sorting integers are usually faster than those for sorting double precision numbers.
This can be generalised to situations where the numbers after the decimal point is not known, e.g. suppose we have the following data set
julia> ds = Dataset(x = rand(10^6))
in this case we cannot use the round trick directly, however, we can create an alias of :x
and partially apply round
trick on alias column and sort the data set based on both columns,
julia> fmt(x) = round(Int, x*100) # split data up to 100 parts
julia> modify!(ds, :x => identity => :_tmp) # alias of :x - This is an instance operation
julia> setformat!(ds, :_tmp=>fmt)
julia> @btime sortperm(ds, [:x], alg = QuickSort); # without using formats
36.460 ms (508 allocations: 29.60 MiB)
julia> @btime sortperm(ds, [:_tmp, :x], alg = QuickSort);
17.550 ms (505 allocations: 25.79 MiB)
This works because the second method exploits integer and float sorting algorithms and also utilises multiple cpus more efficiently.
Another trick can be used for situations when a data set contains a column of string values where the values can be treated as numbers, e.g. in the following code :x1
is basically integer values with "id"
been attached to each value, here we use a customised format that extracts the numeric values from :x1
;
julia> ds = Dataset(x1 = "id" .* string.(rand(1:100000, 10^6)));
julia> @btime sortperm(ds, 1);
257.070 ms (580 allocations: 27.70 MiB)
julia> custom_fmt(x) = parse(Int, @views x[3:end])
custom_fmt (generic function with 1 method)
julia> setformat!(ds, 1=>custom_fmt);
julia> @btime sortperm(ds, 1);
15.456 ms (278 allocations: 17.57 MiB)