How InMemoryDatasets treats missing values?
Every column supports missing
The Dataset()
constructor automatically converts each column of a data set to allow missing
when constructs a data set. All algorithms in InMemoryDatasets are optimised to minimised the overhead of supporting missing
type.
Functions which skip missing values
When InMemoryDatasets loaded into a Julia session, the behaviour of the following functions will be changed in such a way that they will remove missing values if an AbstractVector{Union{T, Missing}}
is passed as their argument. And it is the user responsibility to handle the situations where this is not desired.
The following list summarises the details of how InMemoryDatasets removes/skips/ignores missing values (for the rest of this section INTEGERS
refers to {U/Int8, U/Int16, U/Int32, U/Int64}
and FLOATS
refers to {Float16, Float32, Float64}
):
argmax
: ForINTEGERS
,FLOATS
,TimeType
, andAbstractString
skip missing values. When all values aremissing
, it returnsmissing
.argmin
: ForINTEGERS
,FLOATS
,TimeType
, andAbstractString
skip missing values. When all values aremissing
, it returnsmissing
.cummax
: ForINTEGERS
,FLOATS
, andTimeType
ignore missing values, however, by passingmissings = :skip
it jumps over missing values. When all values aremissing
, it returns the input.cummax!
: ForINTEGERS
,FLOATS
, andTimeType
ignore missing values, however, by passingmissings = :skip
it jumps over missing values. When all values aremissing
, it returns the input.cummin
: ForINTEGERS
,FLOATS
, andTimeType
ignore missing values, however, by passingmissings = :skip
it jumps over missing values. When all values aremissing
, it returns the input.cummin!
: ForINTEGERS
,FLOATS
, andTimeType
ignore missing values, however, by passingmissings = :skip
it jumps over missing values. When all values aremissing
, it returns the input.cumprod
: ForINTEGERS
andFLOATS
ignore missing values, however, by passingmissings = :skip
it jumps over missing values. When all values aremissing
, it returns the input.cumprod!
: ForINTEGERS
andFLOATS
ignore missing values, however, by passingmissings = :skip
it jumps over missing values. When all values aremissing
, it returns the input.cumsum
: ForINTEGERS
andFLOATS
ignore missing values, however, by passingmissings = :skip
it jumps over missing values. When all values aremissing
, it returns the input.cumsum!
: ForINTEGERS
andFLOATS
ignore missing values, however, by passingmissings = :skip
it jumps over missing values. When all values aremissing
, it returns the input.extrema
: ForINTEGERS
,FLOATS
, andTimeType
skip missing values. When all values aremissing
, it returns(missing, missing)
.findmax
: ForINTEGERS
,FLOATS
,TimeType
, andAbstractString
skip missing values. When all values aremissing
, it returns(missing, missing)
.findmin
: ForINTEGERS
,FLOATS
,TimeType
, andAbstractString
skip missing values. When all values aremissing
, it returns(missing, missing)
.maximum
: ForINTEGERS
,FLOATS
,TimeType
, andAbstractString
skip missing values. When all values aremissing
, it returnsmissing
.mean
: ForINTEGERS
andFLOATS
skip missing values. When all values aremissing
, it returnsmissing
median
: ForINTEGERS
andFLOATS
skip missing values. When all values aremissing
, it returnsmissing
median!
: ForINTEGERS
andFLOATS
skip missing values. When all values aremissing
, it returnsmissing
minimum
: ForINTEGERS
,FLOATS
,TimeType
, andAbstractString
skip missing values. When all values aremissing
, it returnsmissing
.std
: ForINTEGERS
andFLOATS
skip missing values. When all values aremissing
, it returnsmissing
sum
: ForINTEGERS
andFLOATS
skip missing values. When all values aremissing
, it returnsmissing
var
: ForINTEGERS
andFLOATS
skip missing values. When all values aremissing
, it returnsmissing
julia> x = [1,1,missing]
3-element Vector{Union{Missing, Int64}}:
1
1
missing
julia> sum(x)
2
julia> mean(x)
1.0
julia> maximum(x)
1
julia> minimum(x)
1
julia> findmax(x)
(1, 1)
julia> findmin(x)
(1, 1)
julia> cumsum(x)
3-element Vector{Union{Missing, Int64}}:
1
2
2
julia> cumsum(x, missings = :skip)
3-element Vector{Union{Missing, Int64}}:
1
2
missing
julia> cumprod(x, missings = :skip)
3-element Vector{Union{Missing, Int64}}:
1
1
missing
julia> median(x)
1.0
Some remarks
var
and std
will return missing
when dof = true
and an AbstractVector{Union{T, Missing}}
of length one is passed as their argument. This is different from the behaviour of these functions defined in the Statistics
package.
julia> var(Union{Missing, Int}[1])
missing
julia> std(Union{Missing, Int}[1])
missing
julia> var([1]) # fallback to Statistics.var
NaN
julia> std([1]) # fallback to Statistics.std
NaN
Multithreaded functions
The sum
, minimum
, and maximum
functions also support the threads
keyword argument. When it is set to true
, they exploit all cores for calculation.
topk
, IMD.n
, and IMD.nmissing
The following function is also exported by InMemoryDatasets:
topk
: Return top(bottom) k values of a vector. It ignoresmissing
values, unless all values aremissing
which it returns[missing]
.
and the following functions are not exported but are available via dot
notation:
InMemoryDatasets.n
orIMD.n
: Return number of non-missing elementsInMemoryDatasets.nmissing
orIMD.nmissing
: Return number ofmissing
elements
julia> x = [13, 1, missing, 10]
4-element Vector{Union{Missing, Int64}}:
13
1
missing
10
julia> topk(x, 2)
2-element Vector{Int64}:
13
10
julia> topk(x, 2, rev = true)
2-element Vector{Int64}:
1
10
julia> IMD.n(x)
3
julia> IMD.nmissing(x)
1