First steps with Datasets

Setting up the environment

To install in memory Datasets package , simply, use the following commands inside a Julia session:

julia> using Pkg
julia> Pkg.add("InMemoryDatasets")

Throughout the rest of the tutorial we will assume that you have installed the "In-Memory Datasets" package and have already typed using InMemoryDatasets which loads the package:

julia> using InMemoryDatasets

Creating a data set

To create a data set, use Dataset(). For example

julia> ds = Dataset(var1 = [1, 2, 3],
                var2 = [1.2, 0.5, 3.3],
                var3 = ["C1", "C2", "C3"])
3×3 Dataset
 Row │ var1      var2      var3
     │ identity  identity  identity
     │ Int64?    Float64?  String?
─────┼────────────────────────────────
   1 │        1       1.2  C1
   2 │        2       0.5  C2
   3 │        3       3.3  C3

The first line of the output provides the general information about the data set. A data set is shown as a table in Julia, where each column represents a variable in the data set. The header section of the table shows three pieces of information for each column (variable), the column's name, the column's format, and the column's data type. The format of a column controls how the values of a column should be shown or interpreted when working with a data set.

The following example shows how to create a data set by providing a range of values.

julia> Dataset(A = 1:3, B = 5:7, fixed = 1)
3×3 Dataset
 Row │ A         B         fixed
     │ identity  identity  identity
     │ Int64?    Int64?    Int64?
─────┼──────────────────────────────
   1 │        1         5         1
   2 │        2         6         1
   3 │        3         7         1

Observe that using scalars for a column, like 1 for the column :fixed get automatically broadcasted to fill all rows of the created Dataset.

The missing values in Julia are declare as missing, and these values can also be an observation for a particular column, e.g.

julia> Dataset(a = [1.1, -10.0, missing], b = 1:3)
3×2 Dataset
 Row │ a          b
     │ identity   identity
     │ Float64?   Int64?
─────┼─────────────────────
   1 │       1.1         1
   2 │     -10.0         2
   3 │ missing           3

Sometimes one needs to create a data set whose column names are not valid Julia identifiers. In such a case the following form where column names are passed as strings, and = is replaced by => is handy:

julia> Dataset("customer age" => [15, 20, 25],
                 "first name" => ["Ben", "Steve", "Jule"])
3×2 Dataset
Row │ customer age  first name
    │ identity      identity
    │ Int64?        String?
────┼───────────────────────────
  1 │           15  Ben
  2 │           20  Steve
  3 │           25  Jule

It is also possible to construct a data set from the values of a matrix or a vector of vectors, e.g.

julia> Dataset([1 0; 2 0], :auto)
2×2 Dataset
 Row │ x1        x2
     │ identity  identity
     │ Int64?    Int64?
─────┼────────────────────
   1 │        1         0
   2 │        2         0

julia> Dataset([[1, 2], [0, 0]], :auto)
2×2 Dataset
 Row │ x1        x2
     │ identity  identity
     │ Int64?    Int64?
─────┼────────────────────
   1 │        1         0
   2 │        2         0

Note that the column names are generated automatically when :auto is set as the second argument.

Alternatively you can pass a vector of column names as a second argument to the Dataset:

julia> Dataset([1 0; 2 0], [:col1, :col2])
2×2 Dataset
 Row │ col1      col2
     │ identity  identity
     │ Int64?    Int64?
─────┼────────────────────
   1 │        1         0
   2 │        2         0

Basic utility functions

Getting meta information about a data set

To get information about a data set, use the content function. It provides meta information about a data set.

julia> ds = Dataset(g = [1, 1, 1, 2, 2],
                   x1_int = [0, 0, 1, missing, 2],
                   x2_int = [3, 2, 1, 3, -2],
                   x1_float = [1.2, missing, -1.0, 2.3, 10],
                   x2_float = [missing, missing, 3.0, missing, missing],
                   x3_float = [missing, missing, -1.4, 3.0, -100.0])
5×6 Dataset
 Row │ g         x1_int    x2_int    x1_float   x2_float   x3_float
     │ identity  identity  identity  identity   identity   identity
     │ Int64?    Int64?    Int64?    Float64?   Float64?   Float64?
─────┼───────────────────────────────────────────────────────────────
   1 │        1         0         3        1.2  missing    missing
   2 │        1         0         2  missing    missing    missing
   3 │        1         1         1       -1.0        3.0       -1.4
   4 │        2   missing         3        2.3  missing          3.0
   5 │        2         2        -2       10.0  missing       -100.0

julia> content(ds)
5×6 Dataset
   Created: 2021-08-04T13:11:53.743
  Modified: 2021-08-04T13:11:53.743
      Info:
-----------------------------------
Columns information
┌─────┬──────────┬──────────┬─────────┐
│ Row │ col      │ format   │ eltype  │
├─────┼──────────┼──────────┼─────────┤
│   1 │ g        │ identity │ Int64   │
│   2 │ x1_int   │ identity │ Int64   │
│   3 │ x2_int   │ identity │ Int64   │
│   4 │ x1_float │ identity │ Float64 │
│   5 │ x2_float │ identity │ Float64 │
│   6 │ x3_float │ identity │ Float64 │
└─────┴──────────┴──────────┴─────────┘

content shows that the data set has 5 rows and 6 columns. It also shows when the data set has been created and when is the last time that it has been modified. The content function also reports the data type and formats of each variable.

The Info field is a string field which can contain any information related to the data set. To set an Info for a data set, use setinfo!, e.g.

julia> setinfo!(ds, "An example from the manual")
"An example from the manual"

This information will be attached to the data set ds. Use getinfo to enquiry this information.

Setting and removing formats

To set a specific format for a column of a data set use setformat! function, e.g.

julia> ds = Dataset(x = 1:10,
                    y = repeat(1:5, inner = 2),
                    z = repeat(1:2, 5))
10×3 Dataset
 Row │ x         y         z
     │ identity  identity  identity
     │ Int64?    Int64?    Int64?
─────┼──────────────────────────────
   1 │        1         1         1
   2 │        2         1         2
   3 │        3         2         1
   4 │        4         2         2
   5 │        5         3         1
   6 │        6         3         2
   7 │        7         4         1
   8 │        8         4         2
   9 │        9         5         1
  10 │       10         5         2

julia> setformat!(ds, :y => sqrt)
10×3 Dataset
 Row │ x         y        z
     │ identity  sqrt     identity
     │ Int64?    Int64?   Int64?
─────┼─────────────────────────────
   1 │        1  1.0             1
   2 │        2  1.0             2
   3 │        3  1.41421         1
   4 │        4  1.41421         2
   5 │        5  1.73205         1
   6 │        6  1.73205         2
   7 │        7  2.0             1
   8 │        8  2.0             2
   9 │        9  2.23607         1
  10 │       10  2.23607         2

The first argument for setformat! is the data set which needs to be modified and the second argument is the name of column, =>, and a named function. In the above example, we assign sqrt function as a format for the column :y.

Note that setformat! doesn't check the validity of a format, so if an invalid format is assigned to a column, for instance assigning sqrt to a column which contains negative values, some functionality of data set will be parallelised (like showing the data set). In these cases, simply remove the invalid format by using removeformat!.

Let's define a function as a new format for column :z in the above example,

julia> function gender(x)
          x == 1 ? "Male" : x == 2 ? "Female" : missing
       end

The format gender accepts one value and if the value is equal to 1, gender maps it to "Male", if the value is equal to 2, it maps it to "Female", and for any other values it maps them to missing.

julia> setformat!(ds, :z => gender)
10×3 Dataset
 Row │ x         y        z
     │ identity  sqrt     gender
     │ Int64?    Int64?   Int64?
─────┼───────────────────────────
   1 │        1  1.0        Male
   2 │        2  1.0      Female
   3 │        3  1.41421    Male
   4 │        4  1.41421  Female
   5 │        5  1.73205    Male
   6 │        6  1.73205  Female
   7 │        7  2.0        Male
   8 │        8  2.0      Female
   9 │        9  2.23607    Male
  10 │       10  2.23607  Female

the removeformat! function should be used to remove a column's format.

julia> removeformat!(ds, :y)
10×3 Dataset
 Row │ x         y         z
     │ identity  identity  gender
     │ Int64?    Int64?    Int64?
─────┼────────────────────────────
   1 │        1         1    Male
   2 │        2         1  Female
   3 │        3         2    Male
   4 │        4         2  Female
   5 │        5         3    Male
   6 │        6         3  Female
   7 │        7         4    Male
   8 │        8         4  Female
   9 │        9         5    Male
  10 │       10         5  Female

Similar to setformat! the first argument is the name of the data set and the second argument is the name of the column(s) which we want to remove its(their) format(s). Note that assigning or removing a format doesn't change the actual values of the column.

By default, formatted values of a column will be used when operations like displaying, sorting, grouping, or joining are called.

Accessing individual column or observation

User must avoid using getindex and setindex! for modifying data sets, we just briefly discuss them here to make sure users understand the effect of these operations on data sets. InMemoryDatasets provides efficient APIs for modifying observations, e.g. see modify!, modify, map!, map, ...

ds[:, col], ds[i, col] can be used to access a specific column or specific observation of a specific column of ds, respectively. For example,

julia> ds = Dataset(x = [4,6,3], y = [1,2,43]);
julia> ds[:, :x]
3-element Vector{Union{Missing, Int64}}:
 4
 6
 3

julia> ds[3, :y]
43

Note that ds[:, col] extracts (copies) a column of a data set as a vector. Thus, this vector can be used as a normal vector in Julia.

Also note that, assigning a new value to ds[3, :y] will modify the data set, i.e.

julia> ds[3, :y] = 3
3

julia> ds
3×2 Dataset
 Row │ x         y
     │ identity  identity
     │ Int64?    Int64?
─────┼────────────────────
   1 │        4         1
   2 │        6         2
   3 │        3         3

julia> content(ds)
3×2 Dataset
   Created: 2021-08-04T13:18:51.185
  Modified: 2021-08-04T13:24:33.086
      Info:
-----------------------------------
Columns information
┌─────┬─────┬──────────┬────────┐
│ Row │ col │ format   │ eltype │
├─────┼─────┼──────────┼────────┤
│   1 │ x   │ identity │ Int64  │
│   2 │ y   │ identity │ Int64  │
└─────┴─────┴──────────┴────────┘

The content function shows that the data set has been created on 2021-08-04T13:18:51.185, and the last time that it has been modified is on 2021-08-04T13:24:33.086.

Adding and removing columns

To add a new column (variable) to a data set use the insertcols! function. The select function and its in-place counterpart select! can be used to drop columns from a data set. The select(select!) function is used to rearange columns, however, using Not(cols) can be used to select all columns except those which are wrapped in Not.

julia> ds = Dataset(var1 = [1, 2, 3])
3×1 Dataset
 Row │ var1
     │ identity
     │ Int64?
─────┼──────────
   1 │        1
   2 │        2
   3 │        3

julia> insertcols!(ds, :var2 => ["val1", "val2", "val3"])
3×2 Dataset
 Row │ var1      var2
     │ identity  identity
     │ Int64?    String?
─────┼──────────────────────
   1 │        1  val1
   2 │        2  val2
   3 │        3  val3

julia> insertcols!(ds, :var3 => [3.5, 4.6, 32.0])
3×3 Dataset
 Row │ var1      var2        var3
     │ identity  identity    identity
     │ Int64?    String?     Float64?
─────┼────────────────────────────────
   1 │        1  val1             3.5
   2 │        2  val2             4.6
   3 │        3  val3            32.0

julia> select!(ds, Not(:var2))
3×2 Dataset
 Row │ var1      var3     
     │ identity  identity 
     │ Int64?    Float64? 
─────┼────────────────────
   1 │        1       3.5
   2 │        2       4.6
   3 │        3      32.0

Converting the columns' type

To convert the values of a column to another type, user can use the following syntax:

modify!(ds, col => byrow(T))

where ds is the input data set, col is the column which its values' type is going to be converted and T is the new type (the byrow function is discussed in Row-wise operations, and the modify! function is discussed in Transforming datasets). This functionality must be used in cases where each individual value needed to be converted. For scenarios that the convertion process needs the information of all values in a column, the byrow function must be dropped, e.g. modify!(ds, col => PooledArray). Additionally, user may allow Julia to find the most suitable type of a column by calling modify!(ds, col => byrow(identity)). In the following example we are using modify! to correct the type of columns in ds.

Note that in the following example calling byrow(identity) on :y convert type Any to Integer. However, note that Integer is an abstract type and it will slow down the performance of operations on ds. To improve the performance of calculations, user may use modify!(ds, :y => byrow(Int)) instead.

julia> using PooledArrays

julia> ds = Dataset(x = [missing,2,3,4], y = Any[1,missing,-1,true], z = ["a", "bc", "a", missing])
4×3 Dataset
 Row │ x         y         z        
     │ identity  identity  identity 
     │ Int64?    Any       String?  
─────┼──────────────────────────────
   1 │  missing  1         a
   2 │        2  missing   bc
   3 │        3  -1        a
   4 │        4  true      missing  

julia> modify!(ds, :x => byrow(Float64), :y => byrow(identity), :z => PooledArray)
4×3 Dataset
 Row │ x          y         z        
     │ identity   identity  identity 
     │ Float64?   Integer?  String?  
─────┼───────────────────────────────
   1 │ missing           1  a
   2 │       2.0   missing  bc
   3 │       3.0        -1  a
   4 │       4.0      true  missing  

julia> ds[:, :x]
4-element Vector{Union{Missing, Float64}}:
  missing
 2.0
 3.0
 4.0

julia> ds[:, :y]
4-element Vector{Union{Missing, Integer}}:
    1
     missing
   -1
 true

julia> ds[:, :z]
4-element PooledVector{Union{Missing, String}, UInt32, Vector{UInt32}}:
 "a"
 "bc"
 "a"
 missing

To convert the type of multiple columns at once, user may use the boradcasting technique:

julia> using PooledArrays

julia> ds = Dataset(x = [missing,2,3,4], y = Any[1,missing,-1,true], z = ["a", "bc", "a", missing])
4×3 Dataset
 Row │ x         y         z        
     │ identity  identity  identity 
     │ Int64?    Any       String?  
─────┼──────────────────────────────
   1 │  missing  1         a
   2 │        2  missing   bc
   3 │        3  -1        a
   4 │        4  true      missing  

julia> modify!(ds, [:x, :y] .=> byrow(Float64)) # note "." in ".=>"
4×3 Dataset
 Row │ x          y          z        
     │ identity   identity   identity 
     │ Float64?   Float64?   String?  
─────┼────────────────────────────────
   1 │ missing          1.0  a
   2 │       2.0  missing    bc
   3 │       3.0       -1.0  a
   4 │       4.0        1.0  missing  

Some useful functions

The following functions are very handy when working with a data set, for more information look at the package documentation. Note that functions which end with ! modify the original data set.

  • names(ds) gives the column names as a vector of string.
  • size(ds) prints the data set dimension, i.e. number of rows and number of columns
  • nrow(ds) returns the number of rows
  • ncol(ds) returns the number of columns
  • filter/filter! filter data based on byrow operation
  • first(ds, n) shows the first n rows of a data set
  • last(ds, n) shows the last n rows of a data set
  • rename/rename! can be used to rename column names
  • select/select! can be used to drop, select, or rearrange columns
  • deleteat! deletes rows from a data set
  • append!(ds, tds) appends tds at the end of ds
  • repeat/repeat! repeats rows of a dataset
  • unique/unique! filter unique rows
  • duplicates find duplicate rows
julia> test_data = Dataset(rand(1:10, 4, 3), :auto)
4×3 Dataset
 Row │ x1        x2        x3
     │ identity  identity  identity
     │ Int64?    Int64?    Int64?
─────┼──────────────────────────────
   1 │        5        10         5
   2 │        3         8        10
   3 │        1         7         7
   4 │        2         6         5

julia> names(test_data)
3-element Vector{String}:
 "x1"
 "x2"
 "x3"

julia> size(test_data)
(4, 3)

julia> nrow(test_data)
4

julia> ncol(test_data)
3

julia> first(test_data, 3)
3×3 Dataset
 Row │ x1        x2        x3
     │ identity  identity  identity
     │ Int64?    Int64?    Int64?
─────┼──────────────────────────────
   1 │        5        10         5
   2 │        3         8        10
   3 │        1         7         7

julia> last(test_data, 2)
2×3 Dataset
 Row │ x1        x2        x3
     │ identity  identity  identity
     │ Int64?    Int64?    Int64?
─────┼──────────────────────────────
   1 │        1         7         7
   2 │        2         6         5

julia> rename!(test_data, :x1 => :var1)
4×3 Dataset
 Row │ var1      x2        x3
     │ identity  identity  identity
     │ Int64?    Int64?    Int64?
─────┼──────────────────────────────
   1 │        5        10         5
   2 │        3         8        10
   3 │        1         7         7
   4 │        2         6         5

julia> select!(test_data, :x2, :var1)
4×2 Dataset
 Row │ x2        var1
     │ identity  identity
     │ Int64?    Int64?
─────┼────────────────────
   1 │       10         5
   2 │        8         3
   3 │        7         1
   4 │        6         2

julia> test_data
4×2 Dataset
 Row │ x2        var1
     │ identity  identity
     │ Int64?    Int64?
─────┼────────────────────
   1 │       10         5
   2 │        8         3
   3 │        7         1
   4 │        6         2

julia> deleteat!(test_data, 2)
3×2 Dataset
 Row │ x2        var1
     │ identity  identity
     │ Int64?    Int64?
─────┼────────────────────
   1 │       10         5
   2 │        7         1
   3 │        6         2

julia> second_data = Dataset(var1 = [1, 3, 5, 6, 6],
                             x2 = [3, 4,5,6, 3])
5×2 Dataset
 Row │ var1      x2
     │ identity  identity
     │ Int64?    Int64?
─────┼────────────────────
   1 │        1         3
   2 │        3         4
   3 │        5         5
   4 │        6         6
   5 │        6         3

julia> append!(test_data, second_data)
8×2 Dataset
 Row │ x2        var1
     │ identity  identity
     │ Int64?    Int64?
─────┼────────────────────
   1 │       10         5
   2 │        7         1
   3 │        6         2
   4 │        3         1
   5 │        4         3
   6 │        5         5
   7 │        6         6
   8 │        3         6