Reshape data sets
Introduction
In InMemoryDatasets the reshaping of a data set is done by the transpose function. In the simplest case, you can think about a data set as a matrix, and transpose simply flips it over its diagonal; that is, transpose switches the row and column indices of it. The key feature that makes transpose versatile and powerful is its ability to do the simple transposing within each group of observations created by groupby!, groupby or gatherby. Basically, the two popular functions for reshaping data, stack and unstack, are special cases of transpose; The stack function transposes each row of a data set, and the unstack function transposes one column of a data set for each group of observations.
By default, the
transposefunction uses parallel algorithms to perform the transposing, however, this can be switched to single threaded process by setting thethreadskeyword argument tofalse.
In this section we give the details of reshaping a data set using the transpose function.
Simple case
Transpose means switching the row and column indices of a matrix, and in InMemoryDatasets when we select a set of columns, we practically have a matrix shape array of data, thus, its transposition means switching the row and column indices of it. Thus, in the simplest form, the syntax of the transpose function is transpose(ds, cols), where ds is the input data set and cols is any kind of column selector which specifies the selected columns for transposing. Since each column of a data set has also a name, transpose creates a new column in the output data set which shows those names from the input data set. By default, the column name for this new column is set to _variables_, and it can be change to any name by using the keyword argument variable_name. The transpose function has a keyword argument renamerowid, which can accept a user defined function to apply on the values of this new added column. Additionally, the keyword argument can be set to nothing if the new column is not needed in the output data set.
The rows of a data set haven't got names, thus, InMemoryDatasets uses automatic name generation to produce the names for the transposed columns of the output data set. User can supply a custom function for generating the names of the transposed columns by using the keyword argument renamecolid, which by default is set as i -> "_c" * string(i) where i is the sequence of columns. User can also set the id keyword argument to a column in the input data set when the values of the column can be used as the names for the transposed columns in the output data set. By default, the id keyword argument uses the formatted values as the names of the transposed columns, however, this can be turned off by using the mapformats = false keyword argument.
Examples
julia> ds = Dataset(x1 = [1,2,3,4], x2 = [1,4,9,16])
4×2 Dataset
Row │ x1 x2
│ identity identity
│ Int64? Int64?
─────┼────────────────────
1 │ 1 1
2 │ 2 4
3 │ 3 9
4 │ 4 16
julia> transpose(ds, 1:2)
2×5 Dataset
Row │ _variables_ _c1 _c2 _c3 _c4
│ identity identity identity identity identity
│ String? Int64? Int64? Int64? Int64?
─────┼─────────────────────────────────────────────────────
1 │ x1 1 2 3 4
2 │ x2 1 4 9 16
julia> transpose(Matrix(ds)) # transpose of a data set, in the simplest case, is similar to matrix transposition
2×4 transpose(::Matrix{Union{Missing, Int64}}) with eltype Union{Missing, Int64}:
1 2 3 4
1 4 9 16
julia> insertcols!(ds, 1, :id => ["r1", "r2", "r3" , "r4"])
4×3 Dataset
Row │ id x1 x2
│ identity identity identity
│ String? Int64? Int64?
─────┼──────────────────────────────
1 │ r1 1 1
2 │ r2 2 4
3 │ r3 3 9
4 │ r4 4 16
julia> transpose(ds, [:x1, :x2], id = :id)
2×5 Dataset
Row │ _variables_ r1 r2 r3 r4
│ identity identity identity identity identity
│ String? Int64? Int64? Int64? Int64?
─────┼─────────────────────────────────────────────────────
1 │ x1 1 2 3 4
2 │ x2 1 4 9 16
julia> transpose(ds, [:x1, :x2], renamecolid = x -> "_COLUMN_" * string(x))
2×5 Dataset
Row │ _variables_ _COLUMN_1 _COLUMN_2 _COLUMN_3 _COLUMN_4
│ identity identity identity identity identity
│ String? Int64? Int64? Int64? Int64?
─────┼─────────────────────────────────────────────────────────
1 │ x1 1 2 3 4
2 │ x2 1 4 9 16
julia> transpose(ds, [:x1, :x2], renamerowid = x -> replace(x, "x"=>""))
2×5 Dataset
Row │ _variables_ _c1 _c2 _c3 _c4
│ identity identity identity identity identity
│ String? Int64? Int64? Int64? Int64?
─────┼─────────────────────────────────────────────────────
1 │ 1 1 2 3 4
2 │ 2 1 4 9 16
julia> transpose(ds, [:x1, :x2], id = :id, variable_name = nothing)
2×4 Dataset
Row │ r1 r2 r3 r4
│ identity identity identity identity
│ Int64? Int64? Int64? Int64?
─────┼────────────────────────────────────────
1 │ 1 2 3 4
2 │ 1 4 9 16
julia> ds2 = Dataset(a=["x", "y"], b=[1, "two"], c=[3, 4], d=[true, false])
2×4 Dataset
Row │ a b c d
│ identity identity identity identity
│ String? Any Int64? Bool?
─────┼────────────────────────────────────────
1 │ x 1 3 true
2 │ y two 4 false
julia> transpose(ds2, Between(:b, :d), id = :a) # promoting the values
3×3 Dataset
Row │ _variables_ x y
│ identity identity identity
│ String? Any Any
─────┼─────────────────────────────────
1 │ b 1 two
2 │ c 3 4
3 │ d true falsetranspose of grouped data sets
When the first argument of the transpose function is a grouped data set - created by groupby!, groupby, or gatherby - transpose does the simple transposing within each group of observations. Thus, the transposition of a grouped data set can be viewed as transposing the matrix shape data values which are created for each group of observations. Since the size of transposed columns within each group can be different, transpose pads them with missing values to overcome this problem. The missing padding can be replaced by any other values which passed to default keyword argument of the function.
julia> ds = Dataset(group = repeat(1:3, inner = 2),
b = repeat(1:2, inner = 3),
c = repeat(1:1, inner = 6),
d = repeat(1:6, inner = 1),
e = string.('a':'f'))
6×5 Dataset
Row │ group b c d e
│ identity identity identity identity identity
│ Int64? Int64? Int64? Int64? String?
─────┼──────────────────────────────────────────────────
1 │ 1 1 1 1 a
2 │ 1 1 1 2 b
3 │ 2 1 1 3 c
4 │ 2 2 1 4 d
5 │ 3 2 1 5 e
6 │ 3 2 1 6 f
julia> transpose(groupby(ds, :group), 2:4)
9×4 Dataset
Row │ group _variables_ _c1 _c2
│ identity identity identity identity
│ Int64? String? Int64? Int64?
─────┼───────────────────────────────────────────
1 │ 1 b 1 1
2 │ 1 c 1 1
3 │ 1 d 1 2
4 │ 2 b 1 2
5 │ 2 c 1 1
6 │ 2 d 3 4
7 │ 3 b 2 2
8 │ 3 c 1 1
9 │ 3 d 5 6
julia> transpose(groupby(ds, :group), 2:4, id = :e)
9×8 Dataset
Row │ group _variables_ a b c d e f
│ identity identity identity identity identity identity identity identity
│ Int64? String? Int64? Int64? Int64? Int64? Int64? Int64?
─────┼───────────────────────────────────────────────────────────────────────────────────
1 │ 1 b 1 1 missing missing missing missing
2 │ 1 c 1 1 missing missing missing missing
3 │ 1 d 1 2 missing missing missing missing
4 │ 2 b missing missing 1 2 missing missing
5 │ 2 c missing missing 1 1 missing missing
6 │ 2 d missing missing 3 4 missing missing
7 │ 3 b missing missing missing missing 2 2
8 │ 3 c missing missing missing missing 1 1
9 │ 3 d missing missing missing missing 5 6
julia> transpose(groupby(ds, :group), 2:4, id = :e, default = 99999)
9×8 Dataset
Row │ group _variables_ a b c d e f
│ identity identity identity identity identity identity identity identity
│ Int64? String? Int64? Int64? Int64? Int64? Int64? Int64?
─────┼───────────────────────────────────────────────────────────────────────────────────
1 │ 1 b 1 1 99999 99999 99999 99999
2 │ 1 c 1 1 99999 99999 99999 99999
3 │ 1 d 1 2 99999 99999 99999 99999
4 │ 2 b 99999 99999 1 2 99999 99999
5 │ 2 c 99999 99999 1 1 99999 99999
6 │ 2 d 99999 99999 3 4 99999 99999
7 │ 3 b 99999 99999 99999 99999 2 2
8 │ 3 c 99999 99999 99999 99999 1 1
9 │ 3 d 99999 99999 99999 99999 5 6
julia> pop = Dataset(country = ["c1","c1","c2","c2","c3","c3"],
sex = [1, 2, 1, 2, 1, 2],
pop_2000 = [100, 120, 150, 155, 170, 190],
pop_2010 = [110, 120, 155, 160, 178, 200],
pop_2020 = [115, 130, 161, 165, 180, 203])
6×5 Dataset
Row │ country sex pop_2000 pop_2010 pop_2020
│ identity identity identity identity identity
│ String? Int64? Int64? Int64? Int64?
─────┼──────────────────────────────────────────────────
1 │ c1 1 100 110 115
2 │ c1 2 120 120 130
3 │ c2 1 150 155 161
4 │ c2 2 155 160 165
5 │ c3 1 170 178 180
6 │ c3 2 190 200 203
julia> gender(x) = x == 1 ? "Male" : "Female"
gender (generic function with 1 method)
julia> setformat!(pop, 2 => gender)
6×5 Dataset
Row │ country sex pop_2000 pop_2010 pop_2020
│ identity gender identity identity identity
│ String? Int64? Int64? Int64? Int64?
─────┼────────────────────────────────────────────────
1 │ c1 Male 100 110 115
2 │ c1 Female 120 120 130
3 │ c2 Male 150 155 161
4 │ c2 Female 155 160 165
5 │ c3 Male 170 178 180
6 │ c3 Female 190 200 203
julia> transpose(gatherby(pop, 1, isgathered = true), r"pop", id = :sex)
9×4 Dataset
Row │ country _variables_ Male Female
│ identity identity identity identity
│ String? String? Int64? Int64?
─────┼───────────────────────────────────────────
1 │ c1 pop_2000 100 120
2 │ c1 pop_2010 110 120
3 │ c1 pop_2020 115 130
4 │ c2 pop_2000 150 155
5 │ c2 pop_2010 155 160
6 │ c2 pop_2020 161 165
7 │ c3 pop_2000 170 190
8 │ c3 pop_2010 178 200
9 │ c3 pop_2020 180 203
julia> ds = Dataset(region = repeat(["North","North","South","South"],2),
fuel_type = repeat(["gas","coal"],4),
load = rand(8),
time = [1,1,1,1,2,2,2,2],
)
8×4 Dataset
Row │ region fuel_type load time
│ identity identity identity identity
│ String? String? Float64? Int64?
─────┼──────────────────────────────────────────
1 │ North gas 0.914918 1
2 │ North coal 0.158792 1
3 │ South gas 0.415604 1
4 │ South coal 0.0702206 1
5 │ North gas 0.419423 2
6 │ North coal 0.765637 2
7 │ South gas 0.222119 2
8 │ South coal 0.723559 2
julia> transpose(groupby(ds, :time), :load, id = 1:2)
2×6 Dataset
Row │ time _variables_ ("North", "gas") ("North", "coal") ("South", "gas") ("South", "coal")
│ identity identity identity identity identity identity
│ Int64? String? Float64? Float64? Float64? Float64?
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 1 load 0.778866 0.0256356 0.729273 0.786919
2 │ 2 load 0.676968 0.366241 0.577498 0.294181
julia> ds = Dataset([[1, 2, 3], [1.1, 2.0, 3.3],[1.1, 2.1, 3.0],[1.1, 2.0, 3.2]]
,[:person, Symbol("11/2020"), Symbol("12/2020"), Symbol("1/2021")])
3×4 Dataset
Row │ person 11/2020 12/2020 1/2021
│ identity identity identity identity
│ Float64? Float64? Float64? Float64?
─────┼────────────────────────────────────────
1 │ 1.0 1.1 1.1 1.1
2 │ 2.0 2.0 2.1 2.0
3 │ 3.0 3.3 3.0 3.2
julia> transpose(gatherby(ds, :person), Not(:person),
variable_name = "Date",
renamerowid = x -> Date(x, dateformat"m/y"),
renamecolid = x -> "measurement")
9×3 Dataset
Row │ person Date measurement
│ identity identity identity
│ Float64? Date? Float64?
─────┼───────────────────────────────────
1 │ 1.0 2020-11-01 1.1
2 │ 1.0 2020-12-01 1.1
3 │ 1.0 2021-01-01 1.1
4 │ 2.0 2020-11-01 2.0
5 │ 2.0 2020-12-01 2.1
6 │ 2.0 2021-01-01 2.0
7 │ 3.0 2020-11-01 3.3
8 │ 3.0 2020-12-01 3.0
9 │ 3.0 2021-01-01 3.2
julia> ds = Dataset(A_2018=1:4, A_2019=5:8, B_2017=9:12,
B_2018=9:12, B_2019 = [missing,13,14,15],
ID = [1,2,3,4])
4×6 Dataset
Row │ A_2018 A_2019 B_2017 B_2018 B_2019 ID
│ identity identity identity identity identity identity
│ Int64? Int64? Int64? Int64? Int64? Int64?
─────┼────────────────────────────────────────────────────────────
1 │ 1 5 9 9 missing 1
2 │ 2 6 10 10 13 2
3 │ 3 7 11 11 14 3
4 │ 4 8 12 12 15 4
julia> f(x) = replace(x, r"[A_B]"=>"")
f (generic function with 1 method)
julia> ; # later we provide a simpler solution for this example
julia> dsA = transpose(groupby(ds, :ID), r"A", renamerowid = f, variable_name = "Year", renamecolid = x->"A");
julia> dsB = transpose(groupby(ds, :ID), r"B", renamerowid = f, variable_name = "Year", renamecolid = x->"B");
julia> outerjoin(dsA, dsB, on = [:ID, :Year])
12×4 Dataset
Row │ ID Year A B
│ identity identity identity identity
│ Int64? String? Int64? Int64?
─────┼────────────────────────────────────────
1 │ 1 2018 1 9
2 │ 1 2019 5 missing
3 │ 2 2018 2 10
4 │ 2 2019 6 13
5 │ 3 2018 3 11
6 │ 3 2019 7 14
7 │ 4 2018 4 12
8 │ 4 2019 8 15
9 │ 1 2017 missing 9
10 │ 2 2017 missing 10
11 │ 3 2017 missing 11
12 │ 4 2017 missing 12Advanced options
reanemcolid with two arguments
The renamecolid function can also get access to the variable names from the input data set as the second argument. This can be used to generate even more customised column names for the output data set.
Passing Tuple of column selectors
The column selector of the transpose function can be also a Tuple of column selectors. In this case, InMemoryDatasets does the transposition for each element of the tuple and then horizontally concatenates the output data sets to create a single data set. This provides extra flexibility to the user for reshaping a data set. By default, the variable_name is set to nothing, when Tuple of column selectors is passed as the argument, however, we can supply different names for each element of the Tuple.
Since the column names for the output data set can be the same for all elements of the tuple, transpose automatically modifies them to make them unique. Nevertheless, by passing renamecolid, we can customise the column names.
Examples
julia> ds = Dataset([[1, 1, 1, 2, 2, 2],
["foo", "bar", "monty", "foo", "bar", "monty"],
["a", "b", "c", "d", "e", "f"],
[1, 2, 3, 4, 5, 6]], [:g, :key, :foo, :bar])
6×4 Dataset
Row │ g key foo bar
│ identity identity identity identity
│ Int64? String? String? Int64?
─────┼────────────────────────────────────────
1 │ 1 foo a 1
2 │ 1 bar b 2
3 │ 1 monty c 3
4 │ 2 foo d 4
5 │ 2 bar e 5
6 │ 2 monty f 6
julia> transpose(groupby(ds, :g), (:foo, :bar), id = :key)
2×7 Dataset
Row │ g foo bar monty foo_1 bar_1 monty_1
│ identity identity identity identity identity identity identity
│ Int64? String? String? String? Int64? Int64? Int64?
─────┼──────────────────────────────────────────────────────────────────────
1 │ 1 a b c 1 2 3
2 │ 2 d e f 4 5 6
julia> transpose(groupby(ds, :g), (:foo, :bar), id = :key,
renamecolid = (x,y) -> string(x,"_",y[1]))
2×7 Dataset
Row │ g foo_foo bar_foo monty_foo foo_bar bar_bar monty_bar
│ identity identity identity identity identity identity identity
│ Int64? String? String? String? Int64? Int64? Int64?
─────┼────────────────────────────────────────────────────────────────────────
1 │ 1 a b c 1 2 3
2 │ 2 d e f 4 5 6
julia> ds = Dataset(paddockId= [0, 0, 1, 1, 2, 2],
color= ["red", "blue", "red", "blue", "red", "blue"],
count= [3, 4, 3, 4, 3, 4],
weight= [0.2, 0.3, 0.2, 0.3, 0.2, 0.2])
6×4 Dataset
Row │ paddockId color count weight
│ identity identity identity identity
│ Int64? String? Int64? Float64?
─────┼─────────────────────────────────────────
1 │ 0 red 3 0.2
2 │ 0 blue 4 0.3
3 │ 1 red 3 0.2
4 │ 1 blue 4 0.3
5 │ 2 red 3 0.2
6 │ 2 blue 4 0.2
julia> transpose(groupby(ds, 1), (:count, :weight),
id = :color,
renamecolid = (x,y)->string(x,"/",y[1]),
)
3×5 Dataset
Row │ paddockId red/count blue/count red/weight blue/weight
│ identity identity identity identity identity
│ Int64? Int64? Int64? Float64? Float64?
─────┼───────────────────────────────────────────────────────────
1 │ 0 3 4 0.2 0.3
2 │ 1 3 4 0.2 0.3
3 │ 2 3 4 0.2 0.2
julia> ds = Dataset(A_2018=1:4, A_2019=5:8, B_2017=9:12,
B_2018=9:12, B_2019 = [missing,13,14,15],
ID = [1,2,3,4])
4×6 Dataset
Row │ A_2018 A_2019 B_2017 B_2018 B_2019 ID
│ identity identity identity identity identity identity
│ Int64? Int64? Int64? Int64? Int64? Int64?
─────┼────────────────────────────────────────────────────────────
1 │ 1 5 9 9 missing 1
2 │ 2 6 10 10 13 2
3 │ 3 7 11 11 14 3
4 │ 4 8 12 12 15 4
julia> f(x) = replace(x, r"[A_B]"=>"")
f (generic function with 1 method)
julia> transpose(gatherby(ds, :ID), ([4,5,3], [1,2]),
variable_name = [:year, nothing],
renamerowid = f,
renamecolid = (x,y)->y[1][1:1])
12×4 Dataset
Row │ ID year B A
│ identity identity identity identity
│ Int64? String? Int64? Int64?
─────┼────────────────────────────────────────
1 │ 1 2018 9 1
2 │ 1 2019 missing 5
3 │ 1 2017 9 missing
4 │ 2 2018 10 2
5 │ 2 2019 13 6
6 │ 2 2017 10 missing
7 │ 3 2018 11 3
8 │ 3 2019 14 7
9 │ 3 2017 11 missing
10 │ 4 2018 12 4
11 │ 4 2019 15 8
12 │ 4 2017 12 missing
julia> ds = Dataset(rand(1:10, 2, 6), :auto)
2×6 Dataset
Row │ x1 x2 x3 x4 x5 x6
│ identity identity identity identity identity identity
│ Int64? Int64? Int64? Int64? Int64? Int64?
─────┼────────────────────────────────────────────────────────────
1 │ 10 6 8 10 10 3
2 │ 9 7 9 4 2 10
julia> transpose(ds, ntuple(i->[i, i+3], 3), renamecolid = (x,y)->string(y[x]))
2×6 Dataset
Row │ x1 x4 x2 x5 x3 x6
│ identity identity identity identity identity identity
│ Int64? Int64? Int64? Int64? Int64? Int64?
─────┼────────────────────────────────────────────────────────────
1 │ 10 9 6 7 8 9
2 │ 10 4 10 2 3 10Spreadsheet-style pivot table
To create a spreadsheet-style pivot tables in InMemoryDatasets, one can use the combination of the combine and transpose functions. To demonstrate this, we re-produce the documentation's examples of the pandas's pivot_table function (ver: 1.3.4).
julia> ds = Dataset(A = ["foo", "foo", "foo", "foo", "foo",
"bar", "bar", "bar", "bar"],
B = ["one", "one", "one", "two", "two",
"one", "one", "two", "two"],
C = ["small", "large", "large", "small",
"small", "large", "small", "small",
"large"],
D = [1, 2, 2, 3, 3, 4, 5, 6, 7],
E = [2, 4, 5, 5, 6, 6, 8, 9, 9])
9×5 Dataset
Row │ A B C D E
│ identity identity identity identity identity
│ String? String? String? Int64? Int64?
─────┼──────────────────────────────────────────────────
1 │ foo one small 1 2
2 │ foo one large 2 4
3 │ foo one large 2 5
4 │ foo two small 3 5
5 │ foo two small 3 6
6 │ bar one large 4 6
7 │ bar one small 5 8
8 │ bar two small 6 9
9 │ bar two large 7 9
julia> ; # This first example aggregates values by taking the sum.
julia> _tmp = combine(groupby(ds, 1:3), 4=>IMD.sum);
julia> transpose(gatherby(_tmp, 1:2, isgathered = true), :sum_D, id = :C, variable_name = nothing)
4×4 Dataset
Row │ A B large small
│ identity identity identity identity
│ String? String? Int64? Int64?
─────┼────────────────────────────────────────
1 │ bar one 4 5
2 │ bar two 7 6
3 │ foo one 4 1
4 │ foo two missing 6
julia> transpose(gatherby(_tmp, 1:2, isgathered = true), :sum_D, id = :C, variable_name = nothing, default = 0)
4×4 Dataset
Row │ A B large small
│ identity identity identity identity
│ String? String? Int64? Int64?
─────┼────────────────────────────────────────
1 │ bar one 4 5
2 │ bar two 7 6
3 │ foo one 4 1
4 │ foo two 0 6
julia> ; # The next example aggregates by taking the mean across multiple columns. Here we don't need transposing
julia> combine(groupby(ds, [:A, :C]), [:D, :E] => mean)
4×4 Dataset
Row │ A C mean_D mean_E
│ identity identity identity identity
│ String? String? Float64? Float64?
─────┼────────────────────────────────────────
1 │ bar large 5.5 7.5
2 │ bar small 5.5 8.5
3 │ foo large 2.0 4.5
4 │ foo small 2.33333 4.33333
julia> combine(groupby(ds, [:A, :C]), :D => mean, :E => [IMD.minimum, IMD.maximum, mean])
4×6 Dataset
Row │ A C mean_D minimum_E maximum_E mean_E
│ identity identity identity identity identity identity
│ String? String? Float64? Int64? Int64? Float64?
─────┼──────────────────────────────────────────────────────────────
1 │ bar large 5.5 6 9 7.5
2 │ bar small 5.5 8 9 8.5
3 │ foo large 2.0 4 5 4.5
4 │ foo small 2.33333 2 6 4.33333