How InMemoryDatasets treats missing values?
Comparing data sets
==
of two data sets or two DatasetColumn
s falls back to isequal
.
Every column supports missing
The Dataset()
constructor automatically converts each column of a data set to allow missing
when constructs a data set. All algorithms in InMemoryDatasets are optimised to minimised the overhead of supporting missing
type.
Functions which skip missing values
InMemoryDatasets has a set of functions which removes missing values. The following list summarises the details of how InMemoryDatasets removes/skips/ignores missing values (for the rest of this section INTEGERS
refers to {U/Int8, U/Int16, U/Int32, U/Int64}
and FLOATS
refers to {Float16, Float32, Float64}
):
IMD.argmax
: ForINTEGERS
,FLOATS
,TimeType
, andAbstractString
skip missing values. When all values aremissing
, it returnsmissing
.IMD.argmin
: ForINTEGERS
,FLOATS
,TimeType
, andAbstractString
skip missing values. When all values aremissing
, it returnsmissing
.IMD.cummax
: ForINTEGERS
,FLOATS
, andTimeType
ignore missing values, however, by passingmissings = :skip
it jumps over missing values. When all values aremissing
, it returns the input.IMD.cummax!
: ForINTEGERS
,FLOATS
, andTimeType
ignore missing values, however, by passingmissings = :skip
it jumps over missing values. When all values aremissing
, it returns the input.IMD.cummin
: ForINTEGERS
,FLOATS
, andTimeType
ignore missing values, however, by passingmissings = :skip
it jumps over missing values. When all values aremissing
, it returns the input.IMD.cummin!
: ForINTEGERS
,FLOATS
, andTimeType
ignore missing values, however, by passingmissings = :skip
it jumps over missing values. When all values aremissing
, it returns the input.IMD.cumprod
: ForINTEGERS
andFLOATS
ignore missing values, however, by passingmissings = :skip
it jumps over missing values. When all values aremissing
, it returns the input.IMD.cumprod!
: ForINTEGERS
andFLOATS
ignore missing values, however, by passingmissings = :skip
it jumps over missing values. When all values aremissing
, it returns the input.IMD.cumsum
: ForINTEGERS
andFLOATS
ignore missing values, however, by passingmissings = :skip
it jumps over missing values. When all values aremissing
, it returns the input.IMD.cumsum!
: ForINTEGERS
andFLOATS
ignore missing values, however, by passingmissings = :skip
it jumps over missing values. When all values aremissing
, it returns the input.IMD.extrema
: ForINTEGERS
,FLOATS
, andTimeType
skip missing values. When all values aremissing
, it returns(missing, missing)
.IMD.findmax
: ForINTEGERS
,FLOATS
,TimeType
, andAbstractString
skip missing values. When all values aremissing
, it returns(missing, missing)
.IMD.findmin
: ForINTEGERS
,FLOATS
,TimeType
, andAbstractString
skip missing values. When all values aremissing
, it returns(missing, missing)
.IMD.maximum
: ForINTEGERS
,FLOATS
,TimeType
, andAbstractString
skip missing values. When all values aremissing
, it returnsmissing
.IMD.mean
: ForINTEGERS
andFLOATS
skip missing values. When all values aremissing
, it returnsmissing
IMD.median
: ForINTEGERS
andFLOATS
skip missing values. When all values aremissing
, it returnsmissing
IMD.median!
: ForINTEGERS
andFLOATS
skip missing values. When all values aremissing
, it returnsmissing
IMD.minimum
: ForINTEGERS
,FLOATS
,TimeType
, andAbstractString
skip missing values. When all values aremissing
, it returnsmissing
.IMD.std
: ForINTEGERS
andFLOATS
skip missing values. When all values aremissing
, it returnsmissing
IMD.sum
: ForINTEGERS
andFLOATS
skip missing values. When all values aremissing
, it returnsmissing
IMD.var
: ForINTEGERS
andFLOATS
skip missing values. When all values aremissing
, it returnsmissing
julia> x = [1,1,missing]
3-element Vector{Union{Missing, Int64}}:
1
1
missing
julia> IMD.sum(x)
2
julia> IMD.mean(x)
1.0
julia> IMD.maximum(x)
1
julia> IMD.minimum(x)
1
julia> IMD.findmax(x)
(1, 1)
julia> IMD.findmin(x)
(1, 1)
julia> IMD.cumsum(x)
3-element Vector{Union{Missing, Int64}}:
1
2
2
julia> IMD.cumsum(x, missings = :skip)
3-element Vector{Union{Missing, Int64}}:
1
2
missing
julia> IMD.cumprod(x, missings = :skip)
3-element Vector{Union{Missing, Int64}}:
1
1
missing
julia> IMD.median(x)
1.0
Some remarks
var
and std
will return missing
when dof = true
and an AbstractVector
of length one is passed as their argument. This is different from the behaviour of these functions defined in the Statistics
package.
julia> IMD.var([1])
missing
julia> IMD.std([1])
missing
julia> Statistics.var([1])
NaN
julia> Statistics.std([1])
NaN
Multithreaded functions
The IMD.sum
, IMD.minimum
, and IMD.maximum
functions also support the threads
keyword argument. When it is set to true
, they exploit all cores for calculation.
Other functions
The following functions are also exported by InMemoryDatasets:
bfill
: backward fillingbfill!
: backward filling in-placeffill
: forward fillingffill!
: forward filling in-placelag
: Create a lag-k of the provided vectorlag!
: Replace its input with a lag-k valueslead
: Create a lead-k of the provided vectorlead!
: Replace its input with a lead-k valuestopk
: Return top(bottom) k values of a vector. It ignoresmissing
values, unless all values aremissing
which it returns[missing]
.topkperm
: Return the indices of top(bottom) k values of a vector. It ignoresmissing
values, unless all values aremissing
which it returns[missing]
.
and the following functions are not exported but are available via dot
notation:
InMemoryDatasets.n
orIMD.n
: Return number of non-missing elementsInMemoryDatasets.nmissing
orIMD.nmissing
: Return number ofmissing
elements
julia> x = [13, 1, missing, 10]
4-element Vector{Union{Missing, Int64}}:
13
1
missing
10
julia> topk(x, 2)
2-element Vector{Int64}:
13
10
julia> topk(x, 2, rev = true)
2-element Vector{Int64}:
1
10
julia> IMD.n(x)
3
julia> IMD.nmissing(x)
1