How InMemoryDatasets treats missing values?
Comparing data sets
== of two data sets or two DatasetColumns falls back to isequal.
Every column supports missing
The Dataset() constructor automatically converts each column of a data set to allow missing when constructs a data set. All algorithms in InMemoryDatasets are optimised to minimised the overhead of supporting missing type.
Functions which skip missing values
InMemoryDatasets has a set of functions which removes missing values. The following list summarises the details of how InMemoryDatasets removes/skips/ignores missing values (for the rest of this section INTEGERS refers to {U/Int8, U/Int16, U/Int32, U/Int64} and FLOATS refers to {Float16, Float32, Float64}):
IMD.argmax: ForINTEGERS,FLOATS,TimeType, andAbstractStringskip missing values. When all values aremissing, it returnsmissing.IMD.argmin: ForINTEGERS,FLOATS,TimeType, andAbstractStringskip missing values. When all values aremissing, it returnsmissing.IMD.cummax: ForINTEGERS,FLOATS, andTimeTypeignore missing values, however, by passingmissings = :skipit jumps over missing values. When all values aremissing, it returns the input.IMD.cummax!: ForINTEGERS,FLOATS, andTimeTypeignore missing values, however, by passingmissings = :skipit jumps over missing values. When all values aremissing, it returns the input.IMD.cummin: ForINTEGERS,FLOATS, andTimeTypeignore missing values, however, by passingmissings = :skipit jumps over missing values. When all values aremissing, it returns the input.IMD.cummin!: ForINTEGERS,FLOATS, andTimeTypeignore missing values, however, by passingmissings = :skipit jumps over missing values. When all values aremissing, it returns the input.IMD.cumprod: ForINTEGERSandFLOATSignore missing values, however, by passingmissings = :skipit jumps over missing values. When all values aremissing, it returns the input.IMD.cumprod!: ForINTEGERSandFLOATSignore missing values, however, by passingmissings = :skipit jumps over missing values. When all values aremissing, it returns the input.IMD.cumsum: ForINTEGERSandFLOATSignore missing values, however, by passingmissings = :skipit jumps over missing values. When all values aremissing, it returns the input.IMD.cumsum!: ForINTEGERSandFLOATSignore missing values, however, by passingmissings = :skipit jumps over missing values. When all values aremissing, it returns the input.IMD.extrema: ForINTEGERS,FLOATS, andTimeTypeskip missing values. When all values aremissing, it returns(missing, missing).IMD.findmax: ForINTEGERS,FLOATS,TimeType, andAbstractStringskip missing values. When all values aremissing, it returns(missing, missing).IMD.findmin: ForINTEGERS,FLOATS,TimeType, andAbstractStringskip missing values. When all values aremissing, it returns(missing, missing).IMD.maximum: ForINTEGERS,FLOATS,TimeType, andAbstractStringskip missing values. When all values aremissing, it returnsmissing.IMD.mean: ForINTEGERSandFLOATSskip missing values. When all values aremissing, it returnsmissingIMD.median: ForINTEGERSandFLOATSskip missing values. When all values aremissing, it returnsmissingIMD.median!: ForINTEGERSandFLOATSskip missing values. When all values aremissing, it returnsmissingIMD.minimum: ForINTEGERS,FLOATS,TimeType, andAbstractStringskip missing values. When all values aremissing, it returnsmissing.IMD.std: ForINTEGERSandFLOATSskip missing values. When all values aremissing, it returnsmissingIMD.sum: ForINTEGERSandFLOATSskip missing values. When all values aremissing, it returnsmissingIMD.var: ForINTEGERSandFLOATSskip missing values. When all values aremissing, it returnsmissing
julia> x = [1,1,missing]
3-element Vector{Union{Missing, Int64}}:
1
1
missing
julia> IMD.sum(x)
2
julia> IMD.mean(x)
1.0
julia> IMD.maximum(x)
1
julia> IMD.minimum(x)
1
julia> IMD.findmax(x)
(1, 1)
julia> IMD.findmin(x)
(1, 1)
julia> IMD.cumsum(x)
3-element Vector{Union{Missing, Int64}}:
1
2
2
julia> IMD.cumsum(x, missings = :skip)
3-element Vector{Union{Missing, Int64}}:
1
2
missing
julia> IMD.cumprod(x, missings = :skip)
3-element Vector{Union{Missing, Int64}}:
1
1
missing
julia> IMD.median(x)
1.0Some remarks
var and std will return missing when dof = true and an AbstractVector of length one is passed as their argument. This is different from the behaviour of these functions defined in the Statistics package.
julia> IMD.var([1])
missing
julia> IMD.std([1])
missing
julia> Statistics.var([1])
NaN
julia> Statistics.std([1])
NaNMultithreaded functions
The IMD.sum, IMD.minimum, and IMD.maximum functions also support the threads keyword argument. When it is set to true, they exploit all cores for calculation.
Other functions
The following functions are also exported by InMemoryDatasets:
bfill: backward fillingbfill!: backward filling in-placeffill: forward fillingffill!: forward filling in-placelag: Create a lag-k of the provided vectorlag!: Replace its input with a lag-k valueslead: Create a lead-k of the provided vectorlead!: Replace its input with a lead-k valuestopk: Return top(bottom) k values of a vector. It ignoresmissingvalues, unless all values aremissingwhich it returns[missing].topkperm: Return the indices of top(bottom) k values of a vector. It ignoresmissingvalues, unless all values aremissingwhich it returns[missing].
and the following functions are not exported but are available via dot notation:
InMemoryDatasets.norIMD.n: Return number of non-missing elementsInMemoryDatasets.nmissingorIMD.nmissing: Return number ofmissingelements
julia> x = [13, 1, missing, 10]
4-element Vector{Union{Missing, Int64}}:
13
1
missing
10
julia> topk(x, 2)
2-element Vector{Int64}:
13
10
julia> topk(x, 2, rev = true)
2-element Vector{Int64}:
1
10
julia> IMD.n(x)
3
julia> IMD.nmissing(x)
1