How InMemoryDatasets treats missing values?
Every column supports missing
The Dataset() constructor automatically converts each column of a data set to allow missing when constructs a data set. All algorithms in InMemoryDatasets are optimised to minimised the overhead of supporting missing type.
Functions which skip missing values
When InMemoryDatasets loaded into a Julia session, the behaviour of the following functions will be changed in such a way that they will remove missing values if an AbstractVector{Union{T, Missing}} is passed as their argument. And it is the user responsibility to handle the situations where this is not desired.
The following list summarises the details of how InMemoryDatasets removes/skips/ignores missing values (for the rest of this section INTEGERS refers to {U/Int8, U/Int16, U/Int32, U/Int64} and FLOATS refers to {Float16, Float32, Float64}):
argmax: ForINTEGERS,FLOATS,TimeType, andAbstractStringskip missing values. When all values aremissing, it returnsmissing.argmin: ForINTEGERS,FLOATS,TimeType, andAbstractStringskip missing values. When all values aremissing, it returnsmissing.cummax: ForINTEGERS,FLOATS, andTimeTypeignore missing values, however, by passingmissings = :skipit jumps over missing values. When all values aremissing, it returns the input.cummax!: ForINTEGERS,FLOATS, andTimeTypeignore missing values, however, by passingmissings = :skipit jumps over missing values. When all values aremissing, it returns the input.cummin: ForINTEGERS,FLOATS, andTimeTypeignore missing values, however, by passingmissings = :skipit jumps over missing values. When all values aremissing, it returns the input.cummin!: ForINTEGERS,FLOATS, andTimeTypeignore missing values, however, by passingmissings = :skipit jumps over missing values. When all values aremissing, it returns the input.cumprod: ForINTEGERSandFLOATSignore missing values, however, by passingmissings = :skipit jumps over missing values. When all values aremissing, it returns the input.cumprod!: ForINTEGERSandFLOATSignore missing values, however, by passingmissings = :skipit jumps over missing values. When all values aremissing, it returns the input.cumsum: ForINTEGERSandFLOATSignore missing values, however, by passingmissings = :skipit jumps over missing values. When all values aremissing, it returns the input.cumsum!: ForINTEGERSandFLOATSignore missing values, however, by passingmissings = :skipit jumps over missing values. When all values aremissing, it returns the input.extrema: ForINTEGERS,FLOATS, andTimeTypeskip missing values. When all values aremissing, it returns(missing, missing).findmax: ForINTEGERS,FLOATS,TimeType, andAbstractStringskip missing values. When all values aremissing, it returns(missing, missing).findmin: ForINTEGERS,FLOATS,TimeType, andAbstractStringskip missing values. When all values aremissing, it returns(missing, missing).maximum: ForINTEGERS,FLOATS,TimeType, andAbstractStringskip missing values. When all values aremissing, it returnsmissing.mean: ForINTEGERSandFLOATSskip missing values. When all values aremissing, it returnsmissingmedian: ForINTEGERSandFLOATSskip missing values. When all values aremissing, it returnsmissingmedian!: ForINTEGERSandFLOATSskip missing values. When all values aremissing, it returnsmissingminimum: ForINTEGERS,FLOATS,TimeType, andAbstractStringskip missing values. When all values aremissing, it returnsmissing.std: ForINTEGERSandFLOATSskip missing values. When all values aremissing, it returnsmissingsum: ForINTEGERSandFLOATSskip missing values. When all values aremissing, it returnsmissingvar: ForINTEGERSandFLOATSskip missing values. When all values aremissing, it returnsmissing
julia> x = [1,1,missing]
3-element Vector{Union{Missing, Int64}}:
1
1
missing
julia> sum(x)
2
julia> mean(x)
1.0
julia> maximum(x)
1
julia> minimum(x)
1
julia> findmax(x)
(1, 1)
julia> findmin(x)
(1, 1)
julia> cumsum(x)
3-element Vector{Union{Missing, Int64}}:
1
2
2
julia> cumsum(x, missings = :skip)
3-element Vector{Union{Missing, Int64}}:
1
2
missing
julia> cumprod(x, missings = :skip)
3-element Vector{Union{Missing, Int64}}:
1
1
missing
julia> median(x)
1.0Some remarks
var and std will return missing when dof = true and an AbstractVector{Union{T, Missing}} of length one is passed as their argument. This is different from the behaviour of these functions defined in the Statistics package.
julia> var(Union{Missing, Int}[1])
missing
julia> std(Union{Missing, Int}[1])
missing
julia> var([1]) # fallback to Statistics.var
NaN
julia> std([1]) # fallback to Statistics.std
NaNMultithreaded functions
The sum, minimum, and maximum functions also support the threads keyword argument. When it is set to true, they exploit all cores for calculation.
topk, IMD.n, and IMD.nmissing
The following function is also exported by InMemoryDatasets:
topk: Return top(bottom) k values of a vector. It ignoresmissingvalues, unless all values aremissingwhich it returns[missing].
and the following functions are not exported but are available via dot notation:
InMemoryDatasets.norIMD.n: Return number of non-missing elementsInMemoryDatasets.nmissingorIMD.nmissing: Return number ofmissingelements
julia> x = [13, 1, missing, 10]
4-element Vector{Union{Missing, Int64}}:
13
1
missing
10
julia> topk(x, 2)
2-element Vector{Int64}:
13
10
julia> topk(x, 2, rev = true)
2-element Vector{Int64}:
1
10
julia> IMD.n(x)
3
julia> IMD.nmissing(x)
1