Filter observations
Introduction
In this section, the InMemoryDatasets' APIs for filtering observations are discussed. We provides information about four main ways to filter observations based on some conditions, 1) using the byrow
function, 2) using the mask
function, 3) using the contains
and related functions, 4) and using Julia broadcasting.
byrow
byrow
has been discussed previously in details. However, in this section we are going to use it for filtering observations. To use byrow(ds, fun, cols, ...)
for filtering observations, we set fun
argument to all
or any
, and supply the conditions by using the by
keyword option. The supplied by
will be checked for each observation in all selected columns. The function returns a boolean vector where its j
th elements will be equivalent to the result of all(by, [col1[j], col2[j], ...])
or any(by, [col1[j], col2[j], ...])
when all
or any
is set as the fun
argument, respectively.
The main feature of byrow(ds, fun, cols, by = ...)
when fun
is all/any
is that the by
keyword argument can be a vector of functions. Thus, when a multiple columns are supplied as cols
each column can have its own by
. To filter based on formatted value the mapformats
keyword argument must be set to true
.
Naturally, other fun
s supported by byrow
which return a Vector{Bool}
or BitVector
can be used to filter observations, too.
filter
and filter!
The filter
function exploits the byrow
function to filter a data set. Basically, it calls byrow(ds, type, cols; ...)
to return a boolean vector which its true elements indicate the filtered rows, and subsequently, it calls findall
and getindex
to extract those filtered rows. Thus, user must pass a value to type
which byrow(ds, type, cols; ...)
returns a boolean vector (or a BitVector
).
If view = false
a freshly allocated Dataset
is returned, otherwise, a SubDataset
view into ds
is returned.
The missings
keyword argument controls how the missing values should be interpreted by filter
. By default, the missing values are left as missing
, however, user can set it as false
or true
to force filter
to interpret the missing values as false
or true
, respectively.
Beside type
and missings
, any passed keyword arguments to filter
will be passed to the corresponding byrow
function. For a list of keyword arguments supported by a given type
, see the help of byrow
for that specific type, e.g. in Julia REPL type ?byrow(type)
for a given type
to see the documentation of the selected type
.
The following provides more details about type = all
and type = any
.
type = all
and type = any
When type = all
(type = any
), filter
filters each row that all (any) of its values are true
(testing by isequal
). User can pass the keyword argument by
to replace isequal
with any other predicators. The by
keyword argument can be a single function or a vector of functions. When a single function is passed to the by
keyword argument, all columns use it as predicator, however, by passing a vector of functions, user can pass a separate predicator to each column.
By default, when type
is all
or any
, the filter
function uses the actual values of each row, however, by passing mapformats = true
, the formatted values will be used instead.
Note that, multithreading is on by default for types all
and any
, thus, filter
exploits all the cores available to Julia
for performing the computations. User can pass threads = false
to disable this feature.
delete
and delete!
The delete
and delete!
functions are similar to the filter
and filter!
functions, respectivley, however, instead of filtering rows, these functions remove the filtered rows.
Examples
The first expression creates a data set, and in the second one we use byrow
to filter all
rows which the values of all columns are equal to 1.
julia> ds = Dataset(x1 = 1, x2 = 1:10, x3 = repeat(1:2, 5))
10×3 Dataset
Row │ x1 x2 x3
│ identity identity identity
│ Int64? Int64? Int64?
─────┼──────────────────────────────
1 │ 1 1 1
2 │ 1 2 2
3 │ 1 3 1
4 │ 1 4 2
5 │ 1 5 1
6 │ 1 6 2
7 │ 1 7 1
8 │ 1 8 2
9 │ 1 9 1
10 │ 1 10 2
julia> byrow(ds, all, :, by = isequal(1))
10-element Vector{Bool}:
1
0
0
0
0
0
0
0
0
0
Note that only the first row is meeting the condition. As another example, let's see the code which filter all rows which the numbers in all columns are odd.
julia> filter(ds, :, by = isodd)
5×3 Dataset
Row │ x1 x2 x3
│ identity identity identity
│ Int64? Int64? Int64?
─────┼──────────────────────────────
1 │ 1 1 1
2 │ 1 3 1
3 │ 1 5 1
4 │ 1 7 1
5 │ 1 9 1
In the next example we are going to filter all rows which the value of any of column is greater than 5.
julia> byrow(ds, any, :, by = >(5))
10-element Vector{Bool}:
0
0
0
0
0
1
1
1
1
1
The next example shows how a vector of functions can be supplied:
julia> byrow(ds, all, 2:3, by = [>(5), isodd])
10-element Vector{Bool}:
0
0
0
0
0
0
1
0
1
0
In the next example we pass the missings
keyword argument:
julia> ds = Dataset(x = [2, 4, 6, missing], y = [1, 2, 3, 4])
4×2 Dataset
Row │ x y
│ identity identity
│ Int64? Int64?
─────┼────────────────────
1 │ 2 1
2 │ 4 2
3 │ 6 3
4 │ missing 4
julia> filter(ds, [:x, :y], by = iseven, missings = false)
1×2 Dataset
Row │ x y
│ identity identity
│ Int64? Int64?
─────┼────────────────────
1 │ 4 2
julia> filter(ds, [:x, :y], by = iseven, missings = true)
2×2 Dataset
Row │ x y
│ identity identity
│ Int64? Int64?
─────┼────────────────────
1 │ 4 2
2 │ missing 4
We can use the combination of modify!/modify
and byrow
to filter observations based on all values in a column, e.g. in the following example we filter all rows which :x2
and :x3
are larger than their means:
julia> modify!(ds, 2:3 .=> (x -> x .> mean(x)) .=> [:_tmp1, :_tmp2])
10×5 Dataset
Row │ x1 x2 x3 _tmp1 _tmp2
│ identity identity identity identity identity
│ Int64? Int64? Int64? Bool? Bool?
─────┼──────────────────────────────────────────────────
1 │ 1 1 1 false false
2 │ 1 2 2 false true
3 │ 1 3 1 false false
4 │ 1 4 2 false true
5 │ 1 5 1 false false
6 │ 1 6 2 true true
7 │ 1 7 1 true false
8 │ 1 8 2 true true
9 │ 1 9 1 true false
10 │ 1 10 2 true true
julia> filter(ds, r"_tm") # translate to ds[byrow(ds, all, r"_tm"), :]
3×5 Dataset
Row │ x1 x2 x3 _tmp1 _tmp2
│ identity identity identity identity identity
│ Int64? Int64? Int64? Bool? Bool?
────┼──────────────────────────────────────────────────
1 │ 1 6 2 true true
2 │ 1 8 2 true true
3 │ 1 10 2 true true
Note that to drop the temporary columns we can use the
select!
function.
In the following example we use different function for type
. By passing type = isequal
we filter observations which are equal for all columns in each row.
julia> ds = Dataset(x1 = [1,2,3,1,2,3], x2 = [1,2,1,2,1,2])
6×2 Dataset
Row │ x1 x2
│ identity identity
│ Int64? Int64?
─────┼────────────────────
1 │ 1 1
2 │ 2 2
3 │ 3 1
4 │ 1 2
5 │ 2 1
6 │ 3 2
julia> filter(ds, :, type = isequal)
2×2 Dataset
Row │ x1 x2
│ identity identity
│ Int64? Int64?
─────┼────────────────────
1 │ 1 1
2 │ 2 2
mask
mask
is a function which calls a function (or a vector of functions) on all observations of a set of selected columns. The syntax for mask
is very similar to map
function:
mask(ds, funs, cols, [mapformats = true, missings = false, threads = true])
however, unlike map
, the function doesn't return the whole modified dataset, it returns a boolean data set with the same number of rows as ds
and the same number of columns as the length of cols
, while fun
has been called on each observation. The return value of fun
must be true
, false
, or missing
. The combination of mask
and byrow
can be used to filter observations.
Compared to filter/!
(delete/!
), the mask
function has the following default behaviour:
mask
returns a boolean data set which shows exactly which observation will be selected whenfun
is called on it.- By default, the
mask
function filters observations based on their formatted values. And to change this we should passmapformats = false
. - By default, the
mask
function will treat the missing values asfalse
, however, this behaviour can be modified by using the keyword optionmissings
. This option can be set astrue
,false
(default value), ormissing
.
Examples
julia> ds = Dataset(x1 = repeat(1:2, 5), x2 = 1:10, x3 = repeat([missing, 2], 5))
10×3 Dataset
Row │ x1 x2 x3
│ identity identity identity
│ Int64? Int64? Int64?
─────┼──────────────────────────────
1 │ 1 1 missing
2 │ 2 2 2
3 │ 1 3 missing
4 │ 2 4 2
5 │ 1 5 missing
6 │ 2 6 2
7 │ 1 7 missing
8 │ 2 8 2
9 │ 1 9 missing
10 │ 2 10 2
julia> setformat!(ds, 2 => isodd)
10×3 Dataset
Row │ x1 x2 x3
│ identity isodd identity
│ Int64? Int64? Int64?
────┼────────────────────────────
1 │ 1 true missing
2 │ 2 false 2
3 │ 1 true missing
4 │ 2 false 2
5 │ 1 true missing
6 │ 2 false 2
7 │ 1 true missing
8 │ 2 false 2
9 │ 1 true missing
10 │ 2 false 2
julia> mask(ds, isequal(1), :) # simple use case
10×3 Dataset
Row │ x1 x2 x3
│ identity identity identity
│ Bool? Bool? Bool?
─────┼──────────────────────────────
1 │ true true false
2 │ false false false
3 │ true true false
4 │ false false false
5 │ true true false
6 │ false false false
7 │ true true false
8 │ false false false
9 │ true true false
10 │ false false false
julia> _tmp = mask(ds, isequal(1), :, mapformats = false) # use the actual values instead of formatted values
10×3 Dataset
Row │ x1 x2 x3
│ identity identity identity
│ Bool? Bool? Bool?
────┼──────────────────────────────
1 │ true true false
2 │ false false false
3 │ true false false
4 │ false false false
5 │ true false false
6 │ false false false
7 │ true false false
8 │ false false false
9 │ true false false
10 │ false false false
julia> filter(_tmp, :, type = any) # OR ds[byrow(_tmp, any, :), :]. This uses the result of previous run
5×3 Dataset
Row │ x1 x2 x3
│ identity isodd identity
│ Int64? Int64? Int64?
─────┼────────────────────────────
1 │ 1 true missing
2 │ 1 true missing
3 │ 1 true missing
4 │ 1 true missing
5 │ 1 true missing
julia> mask(ds, [isodd, ==(2)], 2:3, missings = missing) # using a vector of functions and setting missings option
10×2 Dataset
Row │ x2 x3
│ identity identity
│ Bool? Bool?
─────┼────────────────────
1 │ true missing
2 │ false true
3 │ true missing
4 │ false true
5 │ true missing
6 │ false true
7 │ true missing
8 │ false true
9 │ true missing
10 │ false true
Using contains
and related functions
Filtering a data set based on another data set should be done via contains
, semijoin
, semijoin!
, antijoin
, and antijoin!
functions. These functions are discussed in the section about joining data sets, and here we just provide some examples about how to use them for filtering a data set.
Additionally, these functions can be used in situations when a data set needed to be filter based on a set of values. In these cases, a temporary data set can be formed by given values and then one of the aforementioned functions can be used.
Examples
julia> ds1 = Dataset(x = [1,7,4,5], y = [.1,.2,.3,.4])
4×2 Dataset
Row │ x y
│ identity identity
│ Int64? Float64?
─────┼────────────────────
1 │ 1 0.1
2 │ 7 0.2
3 │ 4 0.3
4 │ 5 0.4
julia> ds2 = Dataset(x = [1,3,5,7,11])
5×1 Dataset
Row │ x
│ identity
│ Int64?
─────┼──────────
1 │ 1
2 │ 3
3 │ 5
4 │ 7
5 │ 11
julia> contains(ds1,ds2, on = :x)
4-element Vector{Bool}:
1
1
0
1
julia> semijoin(ds1,ds2, on = :x)
3×2 Dataset
Row │ x y
│ identity identity
│ Int64? Float64?
─────┼────────────────────
1 │ 1 0.1
2 │ 7 0.2
3 │ 5 0.4
julia> vals = [.05,.01,.1,.4];
julia> _tmp = Dataset(vals = vals)
4×1 Dataset
Row │ vals
│ identity
│ Float64?
─────┼──────────
1 │ 0.05
2 │ 0.01
3 │ 0.1
4 │ 0.4
julia> antijoin!(ds1, _tmp, on = :y=>:vals)
2×2 Dataset
Row │ x y
│ identity identity
│ Int64? Float64?
─────┼────────────────────
1 │ 7 0.2
2 │ 4 0.3
julia> ; Since vals is a vector we can directly use filter
julia> ds1 = Dataset(x = [1,7,4,5], y = [.1,.2,.3,.4]);
julia> filter!(ds1, :y, by = !in(Set(vals)))
2×2 Dataset
Row │ x y
│ identity identity
│ Int64? Float64?
─────┼────────────────────
1 │ 7 0.2
2 │ 4 0.3
Julia broadcasting
Note that, in general,
byrow
,filter
, orfilter!
are preferred methods to filter data sets compared tobroadcasting
For simple use case (e.g. when working on a single column) we can use broadcasting directly. For example if we are interested in rows which the first column is greater than 5 we can directly use (assume the data set is called ds
):
ds[ds[!, 1] .> 1, :]
or use the column names.
Examples
In the following examples we use .
for broadcasting, and its important to include it in your code when you are going to use this option for filtering observations.
julia> ds = Dataset(x1 = repeat(1:2, 5), x2 = 1:10, x3 = repeat([missing, 2], 5))
10×3 Dataset
Row │ x1 x2 x3
│ identity identity identity
│ Int64? Int64? Int64?
─────┼──────────────────────────────
1 │ 1 1 missing
2 │ 2 2 2
3 │ 1 3 missing
4 │ 2 4 2
5 │ 1 5 missing
6 │ 2 6 2
7 │ 1 7 missing
8 │ 2 8 2
9 │ 1 9 missing
10 │ 2 10 2
julia> ds[ds.x1 .== 2, :]
5×3 Dataset
Row │ x1 x2 x3
│ identity identity identity
│ Int64? Int64? Int64?
────┼──────────────────────────────
1 │ 2 2 2
2 │ 2 4 2
3 │ 2 6 2
4 │ 2 8 2
5 │ 2 10 2
julia> ds[(ds.x1 .== 1) .& (ds.x2 .> 5), :]
2×3 Dataset
Row │ x1 x2 x3
│ identity identity identity
│ Int64? Int64? Int64?
────┼──────────────────────────────
1 │ 1 7 missing
2 │ 1 9 missing
julia> using BenchmarkTools
julia> ds = Dataset(rand(1:1000, 10^6, 10), :auto);
julia> @btime ds[ds.x1 .== 100, :];
1.579 ms (480 allocations: 251.73 KiB)
julia> @btime filter(ds, :x1, by = ==(100));
508.430 μs (568 allocations: 1.21 MiB)
Related functions
There are few other functions in InMemoryDatasets which can be used to filter observations. Those are
completecases
dropmissing
dropmissing!
duplicates
unique
unique!
The completecases
, dropmissing
, dropmissing!
functions use byrow
to find or filter missing observations. The duplicates
, unique
, and unique!
function can be used to filter duplicates rows in a data set.