Formats
Introduction
format is a named function that is assigned to a column (variable). The format of a column will be called on the individual values of the column before some operations (like show, sort,...) are done on the data set. Each column (variable) in a data set has a format property. The initial format of any column is identity, however, setformat! and removeformat! can be used to modify the format of columns in a data set. By default, the format of a column will be shown in the header when a data set is displayed.
The format of a column doesn't change the actual values of the column, thus, the actual values of a column will be untouched during adding or removing formats.
The processing of format is lazy, i.e. Datasets doesn't process format unless an operation needs to access the formatted values. This also means that modifying the format of a column is instance. However, be aware that modifying a column's format changes the modified meta information (i.e. the last time that the data set has been modified) of the data set.
Note that processing of formats are usually done in parallel, thus, it is not safe to use a function which is not parallel safe.
In this section, we discuss the overall aspects of format and we postpone the practical use case of format to later sections when we introduce operations which access the formatted values.
Examples
In this example, we create a simple data set and assign iseven function as the format for :x1, by using setformat!(ds, 1 => iseven), note that we can also use the columns' names to assign format, i.e. the function can be called like setformat!(ds, :x1 => iseven). After calling setformat!, the format of the column will be set, and from this point any operation which support format will use these formatted values. One of the operations which uses formatted values is show. For instance, in the following example, the printed data set shows the formatted values.
julia> ds = Dataset(x1 = 1:5, x2 = [1,2,1,2,1])
5×2 Dataset
Row │ x1 x2
│ identity identity
│ Int64? Int64?
─────┼────────────────────
1 │ 1 1
2 │ 2 2
3 │ 3 1
4 │ 4 2
5 │ 5 1
julia> setformat!(ds, 1 => iseven)
5×2 Dataset
Row │ x1 x2
│ iseven identity
│ Int64? Int64?
─────┼──────────────────
1 │ false 1
2 │ true 2
3 │ false 1
4 │ true 2
5 │ false 1
julia> ds[1,1] # note that the actual value is not changed
1Manipulating formats
There are two functions that are handy for manipulating the columns' formats, setformat! and removeformat!.
setformat! and removeformat! are for setting and removing columns' format, respectively. The syntax of these functions are
setformat!(ds, arg...)
removeformat!(ds, cols...)
For setformat! each arg in the argument must be of the cols => fmt form, where fmt is the named function and cols is either column(s) name(s), column(s) index(s), or regular expression, thus, expressions like setformat!(ds, 1:10=>sqrt), setformat!(ds, r"x"=>iseven, :y=>sqrt) are valid in InMemoryDatasets. When cols refers to more than one column, fmt will be assigned to all of those columns.
For removeformat! each cols in the argument is any column selector like column(s) name(s), column(s) index(s), or regular expression.
Beside these two functions, there exists the getformat function to query format of a column. The syntax of getformat is
getformat(ds, col)
where col is a single column identifier, i.e. column index or column's name.
Examples
In the following example we assign user defined functions as the format for the first and the last column and use the month function (predefined in Julia Dates) as the format for the column :date. Note that, the actual values of columns haven't been modified, they are only shown with the formatted value. As you may observe, the formatted values can help us to scan easily the sale of each store in different month
julia> sale = Dataset(store = ["store1", "store1", "store2",
"store2", "store3", "store3", "store3"],
date = [Date("2020-05-01"), Date("2020-06-01"),
Date("2020-05-01"), Date("2020-06-01"),
Date("2020-05-01"), Date("2020-06-01"), Date("2020-07-01")],
sale = [10000, 10100, 20020, 21000, 20300, 20400, 5000])
7×3 Dataset
Row │ store date sale
│ identity identity identity
│ String? Date? Int64?
─────┼──────────────────────────────────
1 │ store1 2020-05-01 10000
2 │ store1 2020-06-01 10100
3 │ store2 2020-05-01 20020
4 │ store2 2020-06-01 21000
5 │ store3 2020-05-01 20300
6 │ store3 2020-06-01 20400
7 │ store3 2020-07-01 5000
julia> storeid(x) = parse(Int, replace(x, "store"=>""))
storeid (generic function with 1 method)
julia> function SALE(x)
if x < 10000
"low"
elseif x < 20000
"average"
elseif x < 21000
"high"
elseif x >= 21000
"excellent"
else
missing
end
end
SALE (generic function with 1 method)
julia> setformat!(sale, 1 => storeid, :date => month, :sale => SALE)
7×3 Dataset
Row │ store date sale
│ storeid month SALE
│ String? Date? Int64?
─────┼──────────────────────────────
1 │ 1 5 average
2 │ 1 6 average
3 │ 2 5 high
4 │ 2 6 excellent
5 │ 3 5 high
6 │ 3 6 high
7 │ 3 7 low
julia> getformat(sale, "date")
month (generic function with 3 methods)When the formatted values are not needed for some columns, a call to removeformat! can remove them,
julia> removeformat!(sale, [1,2])
7×3 Dataset
Row │ store date sale
│ identity identity SALE
│ String? Date? Int64?
─────┼───────────────────────────────────
1 │ store1 2020-05-01 average
2 │ store1 2020-06-01 average
3 │ store2 2020-05-01 high
4 │ store2 2020-06-01 excellent
5 │ store3 2020-05-01 high
6 │ store3 2020-06-01 high
7 │ store3 2020-07-01 lowModifying a data set
The following rules administrate how a column format will automatically be changed if a data set is modified:
As a general rule, the
formatof a column is preserved during different operations. For example, adding/removing a column to a data set don't change theformatof the original/remaining columns.The
formatof a column wouldn't change if only few observations are updated, modified, added, or deleted, however, if a column goes through a significant change (e.g. all values change, or the column is replaced), itsformatwill be automatically removed.The
formatof a column will be preserved during some operations where a new data set is created. For example, thecombinefunction preserve the format of grouping variables. This feature will be discussed, in more details, when those operations are introduced in later sections.
Using Dictionary
One way to recode values of a data set is by using format which picks the formatted values from a dictionary. Since it is not possible to feed format with any extra positional argument rather than the actual values of observations, the dictionary that defines recoded values must be placed with a default value or must be set as keyword argument with a default value which refers to the actual dictionary that has been defined for this purpose. This argument should be type annotated to avoid any unnecessary allocation.
Example
julia> ds = Dataset(rand(1:2, 10, 3), :auto)
10×3 Dataset
Row │ x1 x2 x3
│ identity identity identity
│ Int64? Int64? Int64?
─────┼──────────────────────────────
1 │ 1 2 1
2 │ 1 2 2
3 │ 1 2 2
4 │ 2 1 1
5 │ 1 1 2
6 │ 1 2 2
7 │ 2 2 1
8 │ 2 1 2
9 │ 2 1 1
10 │ 1 1 1
julia> dict = Dict(1=>"yes", 2=>"no")
Dict{Int64, String} with 2 entries:
2 => "no"
1 => "yes"
julia> fmt1(x, dict::Dict{Int, String} = dict) = get(dict, x, missing)
fmt1 (generic function with 2 methods)
julia> setformat!(ds, 1:3 => fmt1)
10×3 Dataset
Row │ x1 x2 x3
│ fmt1 fmt1 fmt1
│ Int64? Int64? Int64?
─────┼────────────────────────
1 │ yes no yes
2 │ yes no no
3 │ yes no no
4 │ no yes yes
5 │ yes yes no
6 │ yes no no
7 │ no no yes
8 │ no yes no
9 │ no yes yes
10 │ yes yes yes
format validation
InMemoryDatasets doesn't validate the supplied format until it needs to use the formatted values for an operation, in that case, if the supplied format is not a valid format, InMemoryDatasets will throw errors. Also it is important to note that InMemoryDatasets is not aware of changing the definition of a format by users, thus, changing the definition of a function which is used as a format during a workflow may have some side effects. For example if a data set is groupby! with mapformats = true option, changing the definition of the formats destroys the sorting order of the data set, but InMemoryDatasets is unaware of this, so, it is the user responsibility to remove the invalid formats in these situations.
In the following examples we demonstrate some scenarios which end up with an invalid format, and provide some remedies to fix the issues. Nevertheless, note that supplying an invalid format will not damage a data set and a simple call to removeformat! can be helpful to recover the original data set.
Examples
First we create a data set and define a format.
julia> ds = Dataset(x1 = [-1, 0, 1], x2 = [1.1, missing, 2.2], x3 = [1,2,3])
3×3 Dataset
Row │ x1 x2 x3
│ identity identity identity
│ Int64? Float64? Int64?
─────┼───────────────────────────────
1 │ -1 1.1 1
2 │ 0 missing 2
3 │ 1 2.2 3
julia> custom_format(x) = x[2]
custom_format (generic function with 1 method)- The function supplied as
formatis not defined for some values: In this example, we usesqrtas:x1'sformat, but:x1contains negative values andsqrtis not defined for negative integers. Running the following expression will throw bunch of errors, because after settingformatInMemoryDatasets is trying to display the data set, but it cannot do that.
julia> setformat!(ds, 1 => sqrt)
Error showing value of type Dataset:
ERROR: DomainError with -1.0:
sqrt will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).
Stacktrace:
.
.
.This issue can be solve manually by defining a user defined format:
julia> sqrt_fmt(x) = isless(x, 0) ? "invalid" : sqrt(x)
sqrt_fmt (generic function with 1 method)
julia> setformat!(ds, 1 => sqrt_fmt)
3×3 Dataset
Row │ x1 x2 x3
│ sqrt_fmt identity identity
│ Int64? Float64? Int64?
─────┼────────────────────────────────
1 │ invalid 1.1 1
2 │ 0.0 missing 2
3 │ 1.0 2.2 3- Ignoring
missingvalues: In this example, we useROUND(x) = round(Int, x)as:x2format, however,round(Int, x)doesn't know how to deal withmissingvalues, thus, the same as the above example, InMemoryDatasets will throw errors.
julia> ROUND(x) = round(Int, x)
ROUND (generic function with 1 method)
julia> setformat!(ds, 2 => ROUND)
Error showing value of type Dataset:
ERROR: MissingException: cannot convert a missing value to type Int64: use Union{Int64, Missing} instead
Stacktrace:
.
.
.To solve this issue, we can redefine ROUND as
ROUND(x) = ismissing(x) ? missing : round(Int, x)
or
ROUND(x) = round(Union{Int, Missing}, x)
and every thing should be fine. Note that after updating the definition of ROUND, Datasets automatically fixes the formatted values of :x2
julia> ROUND(x) = ismissing(x) ? missing : round(Int, x)
ROUND (generic function with 1 method)
julia> ds
3×3 Dataset
Row │ x1 x2 x3
│ sqrt_fmt ROUND identity
│ Int64? Float64? Int64?
─────┼───────────────────────────────
1 │ invalid 1 1
2 │ 0.0 missing 2
3 │ 1.0 2 3- The function defined as format assumes the input argument is a vector: In this example
custom_format(defined earlier) is used for the third column.custom_formatis defined in such a way that it assumes the input argument is a vector, but Datasets appliesformatto each value.
julia> setformat!(ds, 3=>custom_format)
Error showing value of type Dataset:
ERROR: BoundsError
Stacktrace:
.
.
.To fix the issue we should redefine custom_format or simply remove the column's format:
julia> removeformat!(ds, 3)
3×3 Dataset
Row │ x1 x2 x3
│ sqrt_fmt ROUND identity
│ Int64? Float64? Int64?
─────┼───────────────────────────────
1 │ invalid 1 1
2 │ 0.0 missing 2
3 │ 1.0 2 3