Joins and related topics
Introduction
In this section we give details of the functions that can be used to combined data sets in InMemoryDatasets. These functions are multi-threaded and thus are very performant for general use. By default, InMemoryDatasets uses the formatted values for joining observations, however, this can be controlled by the mapformats
keyword argument.
Database like joins
The main functions for combining two data sets are leftjoin
, innerjoin
, outerjoin
, semijoin
, and antijoin
. The basic syntax for all of these functions is *join(dsl, dsr, on = [...])
, where dsl
is the left data set, dsr
is the right data set, and the on
keyword argument specifies the key(s) which will be used for matching observations between the left and the right data set. In general, the on
key can be a single column name (either as Symbol
or String
) or a vector of column names. When the column names are not identical in both tables, a vector of pair of column names can be used instead.
leftjoin
: joins two data sets and its output contains rows for values of the key(s) that exist in the left data set, whether or not that value exists in the right data set.innerjoin
: joins two data sets and its output contains rows for values of the key(s) that exist in both left and right data sets.outerjoin
: joins two data sets and its output contains rows for values of the key(s) that exist in any of the left or the right data set.semijoin
: Like an inner join, but its output is restricted to columns from the left table.antijoin
: The output contains rows for values of the key(s) that exist in the left data set but not the right data set. As withsemijoin
, output is restricted to columns from the left data set.
See the Wikipedia page on SQL joins for more information.
In general (for some special cases InMemoryDatasets may use "hash-join" techniques), to match observations, InMemoryDatasets sorts the right data set and uses a binary search algorithm for finding the matches of each observation in the left data set in the right data set based on the passed key column(s), thus, it has better performance when the left data set is larger than the right data set. The matching is done based on the formatted values of the key column(s), however, using the mapformats
keyword argument, one may set it to false
for one or both data sets.
For leftjoin
and innerjoin
the order of observations of the output data set is the same as their order in the left data set. However, the order of observations from the right table depends on the stability of the sort algorithm. User can set the stable
keyword argument to true
to guarantee a stable sort. For outerjoin
the order of observations from the left data set in the output data set is also the same as their order in the original data set, however, for those observations which are from the right table, there is no specific order.
By default, the join functions use a hybrid Heap Sort
algorithm to sort the observations in the right data set, however, setting alg = QuickSort
change the default algorithm to a hybrid Quick Sort one.
For very large data sets, if the sorting of the first key is expensive, setting the accelerate
keyword argument to true
may improve the overall performance. By setting accelerate = true
, InMemoryDatasets first divides all observations in the right data set into multiple parts (up to 1024 parts) based on the first passed key, and then for each observations in the left data set finds the corresponding part in the right data set and searches for the matching observations only within that part.
The leftjoin
, semijoin
, and antijoin
functions have in-place version which are leftjoin!
, semijoin!
, and antijoin!
, respectively. Instead of creating a new output dataset, these in-place versions of the functions replace the passed left table. Note that, for the leftjoin!
there must not be more than one match for each observation from the right table, otherwise, the function raises an error.
Examples
julia> name = Dataset(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
3×2 Dataset
Row │ ID Name
│ identity identity
│ Int64? String?
─────┼─────────────────────
1 │ 1 John Doe
2 │ 2 Jane Doe
3 │ 3 Joe Blogs
julia> job = Dataset(ID = [1, 2, 2, 4],
Job = ["Lawyer", "Doctor", "Florist", "Farmer"])
4×2 Dataset
Row │ ID Job
│ identity identity
│ Int64? String?
─────┼────────────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 2 Florist
4 │ 4 Farmer
julia> leftjoin(name, job, on = :ID)
4×3 Dataset
Row │ ID Name Job
│ identity identity identity
│ Int64? String? String?
─────┼───────────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
3 │ 2 Jane Doe Florist
4 │ 3 Joe Blogs missing
julia> dsl = Dataset(year = [Date("2020-3-1"), Date("2021-10-21"), Date("2020-1-4"), Date("2012-12-11")],
leap_year = [true, false, true, true])
4×2 Dataset
Row │ year leap_year
│ identity identity
│ Date? Bool?
─────┼───────────────────────
1 │ 2020-03-01 true
2 │ 2021-10-21 false
3 │ 2020-01-04 true
4 │ 2012-12-11 true
julia> dsr = Dataset(year = [2020, 2021], event = ['A', 'B'])
2×2 Dataset
Row │ year event
│ identity identity
│ Int64? Char?
─────┼────────────────────
1 │ 2020 A
2 │ 2021 B
julia> setformat!(dsl, 1 => year);
julia> leftjoin(dsl, dsr, on = :year)
4×3 Dataset
Row │ year leap_year event
│ year identity identity
│ Date? Bool? Char?
─────┼────────────────────────────
1 │ 2020 true A
2 │ 2021 false B
3 │ 2020 true A
4 │ 2012 true missing
julia> innerjoin(name, job, on = :ID)
3×3 Dataset
Row │ ID Name Job
│ identity identity identity
│ Int64? String? String?
─────┼──────────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
3 │ 2 Jane Doe Florist
julia> outerjoin(name, job, on = :ID)
5×3 Dataset
Row │ ID Name Job
│ identity identity identity
│ Int64? String? String?
─────┼───────────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
3 │ 2 Jane Doe Florist
4 │ 3 Joe Blogs missing
5 │ 4 missing Farmer
To demonstrate the use of the accelerate
keyword, we generate two data sets and use the @btime
macro from the BenchmarkTools
package to benchmark the performance of the innerjoin
function with and without acceleration.
julia> using BenchmarkTools
julia> using Random
julia> dsl = Dataset(x1 = [randstring('a':'z', 6) for _ in 1:10^6],
x2 = rand(1:100, 10^6), x3 = rand(10^6));
julia> dsr = Dataset(y1 = [randstring('a':'z', 6) for _ in 1:10^6],
y2 = rand(1:100, 10^6), y3 = rand(10^6));
julia> @btime innerjoin(dsl, dsr, on = [:x1=>:y1, :x2=>:y2]);
382.759 ms (1254 allocations: 55.40 MiB)
julia> @btime innerjoin(dsl, dsr, on = [:x1=>:y1, :x2=>:y2], accelerate = true);
155.306 ms (2160 allocations: 45.92 MiB)
As it can be observed, using accelerate = true
significantly reduces the joining time. The reason for this reduction is because currently sorting String
type columns in InMemoryDatasets is relatively expensive, and using accelerate = true
helps to reduce this by splitting the observations into multiple parts.
contains
The contains
function is special function that can be used to enquiry observations of a data set which are contained in another data set. It returns a boolean vector where is true when the key for the corresponding row in the main data set is found in the transaction data set. The syntax of the function is the same as the leftjoin
function. When a single column is used for matching observations, the function uses hashing techniques to find the matched observations, however, for multiple key columns, it uses the sorting algorithm to search for the matched observations.
Both semijoin
and antijoin
use the contains
function behind the scene for filtering the left data set.
Close match join
The closejoin
function joins two data sets based on exact match on the key variable or the closest match (here, closest match depends on the direction
keyword argument) when the exact match doesn't exist.
The closejoin!
function does a close join in-place.
Examples
julia> classA = Dataset(id = ["id1", "id2", "id3", "id4", "id5"],
mark = [50, 69.5, 45.5, 88.0, 98.5])
5×2 Dataset
Row │ id mark
│ identity identity
│ String? Float64?
─────┼──────────────────────
1 │ id1 50.0
2 │ id2 69.5
3 │ id3 45.5
4 │ id4 88.0
5 │ id5 98.5
julia> grades = Dataset(mark = [0, 49.5, 59.5, 69.5, 79.5, 89.5, 95.5],
grade = ["F", "P", "C", "B", "A-", "A", "A+"])
7×2 Dataset
Row │ mark grade
│ identity identity
│ Float64? String?
─────┼──────────────────────
1 │ 0.0 F
2 │ 49.5 P
3 │ 59.5 C
4 │ 69.5 B
5 │ 79.5 A-
6 │ 89.5 A
7 │ 95.5 A+
julia> closejoin(classA, grades, on = :mark)
5×3 Dataset
Row │ id mark grade
│ identity identity identity
│ String? Float64? String?
─────┼──────────────────────────────────
1 │ id1 50.0 P
2 │ id2 69.5 B
3 │ id3 45.5 F
4 │ id4 88.0 A-
5 │ id5 98.5 A+
Examples of using closejoin
for financial data.
julia> trades = Dataset(
[["20160525 13:30:00.023",
"20160525 13:30:00.038",
"20160525 13:30:00.048",
"20160525 13:30:00.048",
"20160525 13:30:00.048"],
["MSFT", "MSFT",
"GOOG", "GOOG", "AAPL"],
[51.95, 51.95,
720.77, 720.92, 98.00],
[75, 155,
100, 100, 100]],
["time", "ticker", "price", "quantity"]);
julia> modify!(trades, 1 => byrow(x -> DateTime(x, dateformat"yyyymmdd HH:MM:SS.s")))
5×4 Dataset
Row │ time ticker price quantity
│ identity identity identity identity
│ DateTime? String? Float64? Int64?
─────┼─────────────────────────────────────────────────────────
1 │ 2016-05-25T13:30:00.023 MSFT 51.95 75
2 │ 2016-05-25T13:30:00.038 MSFT 51.95 155
3 │ 2016-05-25T13:30:00.048 GOOG 720.77 100
4 │ 2016-05-25T13:30:00.048 GOOG 720.92 100
5 │ 2016-05-25T13:30:00.048 AAPL 98.0 100
julia> quotes = Dataset(
[["20160525 13:30:00.023",
"20160525 13:30:00.023",
"20160525 13:30:00.030",
"20160525 13:30:00.041",
"20160525 13:30:00.048",
"20160525 13:30:00.049",
"20160525 13:30:00.072",
"20160525 13:30:00.075"],
["GOOG", "MSFT", "MSFT", "MSFT",
"GOOG", "AAPL", "GOOG", "MSFT"],
[720.50, 51.95, 51.97, 51.99,
720.50, 97.99, 720.50, 52.01],
[720.93, 51.96, 51.98, 52.00,
720.93, 98.01, 720.88, 52.03]],
["time", "ticker", "bid", "ask"]);
julia> modify!(quotes, 1 => byrow(x -> DateTime(x, dateformat"yyyymmdd HH:MM:SS.s")))
8×4 Dataset
Row │ time ticker bid ask
│ identity identity identity identity
│ DateTime? String? Float64? Float64?
─────┼─────────────────────────────────────────────────────────
1 │ 2016-05-25T13:30:00.023 GOOG 720.5 720.93
2 │ 2016-05-25T13:30:00.023 MSFT 51.95 51.96
3 │ 2016-05-25T13:30:00.030 MSFT 51.97 51.98
4 │ 2016-05-25T13:30:00.041 MSFT 51.99 52.0
5 │ 2016-05-25T13:30:00.048 GOOG 720.5 720.93
6 │ 2016-05-25T13:30:00.049 AAPL 97.99 98.01
7 │ 2016-05-25T13:30:00.072 GOOG 720.5 720.88
8 │ 2016-05-25T13:30:00.075 MSFT 52.01 52.03
julia> closejoin(trades, quotes, on = :time, makeunique = true)
5×7 Dataset
Row │ time ticker price quantity ticker_1 bid ask
│ identity identity identity identity identity identity identity
│ DateTime? String? Float64? Int64? String? Float64? Float64?
─────┼─────────────────────────────────────────────────────────────────────────────────────────
1 │ 2016-05-25T13:30:00.023 MSFT 51.95 75 MSFT 51.95 51.96
2 │ 2016-05-25T13:30:00.038 MSFT 51.95 155 MSFT 51.97 51.98
3 │ 2016-05-25T13:30:00.048 GOOG 720.77 100 GOOG 720.5 720.93
4 │ 2016-05-25T13:30:00.048 GOOG 720.92 100 GOOG 720.5 720.93
5 │ 2016-05-25T13:30:00.048 AAPL 98.0 100 GOOG 720.5 720.93
In the above example, the closejoin
for each ticker
can be done by passing ticker
as the first variable for the on
keyword, i.e. when more than one key is used for on
the last one will be used for "close match" and the rest are used for exact match.
When border
is set to :missing
(default value) for the :backward
direction the value below the smallest value will be set to missing
, and for the :forward
direction the value above the largest value will be set to missing
. And when border = :nearest
the closest non-missing value will be fetched.
julia> closejoin(trades, quotes, on = [:ticker, :time], border = :missing)
5×6 Dataset
Row │ time ticker price quantity bid ask
│ identity identity identity identity identity identity
│ DateTime? String? Float64? Int64? Float64? Float64?
─────┼─────────────────────────────────────────────────────────────────────────────────
1 │ 2016-05-25T13:30:00.023 MSFT 51.95 75 51.95 51.96
2 │ 2016-05-25T13:30:00.038 MSFT 51.95 155 51.97 51.98
3 │ 2016-05-25T13:30:00.048 GOOG 720.77 100 720.5 720.93
4 │ 2016-05-25T13:30:00.048 GOOG 720.92 100 720.5 720.93
5 │ 2016-05-25T13:30:00.048 AAPL 98.0 100 missing missing
julia> closejoin(trades, quotes, on = [:ticker, :time], border = :nearest)
5×6 Dataset
Row │ time ticker price quantity bid ask
│ identity identity identity identity identity identity
│ DateTime? String? Float64? Int64? Float64? Float64?
─────┼─────────────────────────────────────────────────────────────────────────────
1 │ 2016-05-25T13:30:00.023 MSFT 51.95 75 51.95 51.96
2 │ 2016-05-25T13:30:00.038 MSFT 51.95 155 51.97 51.98
3 │ 2016-05-25T13:30:00.048 GOOG 720.77 100 720.5 720.93
4 │ 2016-05-25T13:30:00.048 GOOG 720.92 100 720.5 720.93
5 │ 2016-05-25T13:30:00.048 AAPL 98.0 100 97.99 98.01
Inequality-kind inner join
The innerjoin
function can also use inequality comparisons to match observations from the left data set with the observations in the right data set. It can find all observations in the right data set that are <=
(<
) or >=
(>
) than a selected observation in the left data set. Additionally, if the user specifies two columns in the right table for a single key in the left table, it matches the observations in the left data set when they fall into the range specifies by the selected two key columns in the right data set. This conditional joining can be done within groups of observations if the user provide more than one key column for the left and the right data sets, i.e. the last key will be used for "inequality-kind" join and the rest will be used for the exact match.
For this kind of inner join, the key columns for both data sets which are defined for grouping observation must be passed as pair of column names (similar to normal use of innerjoin
), however, the key column from the left data set which is going to be used for conditional joining must be also passed as a column name, and the key column(s) for conditional joining from the right data set must be passed as a Tuple of column names. For example, if the key column for the left data set is :l_key
, and there are two columns in the right table called, :r_start
and :r_end
the following demonstrates how a user can perform different kinds of conditional joining:
:l_key => (:r_start, nothing)
, a match happens if the selected observation from the left data set is>= :r_start
.:l_key => (nothing, :r_end)
, a match happens if the selected observation from the left data set is<= :r_end
.:l_key => (:r_start, :r_end)
, a match happens if the selected observation from the left data set is>= :r_start
and<= :r_end
.
To change inequalities to strict inequality the strict_inequality
keyword argument must be set to true
for one or both sides, e.g. strict_inequality = true
(both side), strict_inequality = [false, true]
(only one side).
Examples
julia> store = Dataset([[Date("2019-10-01"), Date("2019-10-02"), Date("2019-10-05"), Date("2019-10-04"), Date("2019-10-03"), Date("2019-10-03")],
["A", "A", "B", "A", "B", "A"]], [:date, :store])
6×2 Dataset
Row │ date store
│ identity identity
│ Date? String?
─────┼──────────────────────
1 │ 2019-10-01 A
2 │ 2019-10-02 A
3 │ 2019-10-05 B
4 │ 2019-10-04 A
5 │ 2019-10-03 B
6 │ 2019-10-03 A
julia> roster = Dataset([["A", "A", "B", "A"],
[4, 1, 8, 2 ],
[Date("2019-10-04"), Date("2019-09-30"), Date("2019-10-04"), Date("2019-10-02")],
[Date("2019-10-06"), Date("2019-10-04"), Date("2019-10-06"), Date("2019-10-04")]],
["store", "employee_ID", "start_date", "end_date"])
4×4 Dataset
Row │ store employee_ID start_date end_date
│ identity identity identity identity
│ String? Int64? Date? Date?
─────┼───────────────────────────────────────────────
1 │ A 4 2019-10-04 2019-10-06
2 │ A 1 2019-09-30 2019-10-04
3 │ B 8 2019-10-04 2019-10-06
4 │ A 2 2019-10-02 2019-10-04
julia> innerjoin(store, roster, on = [:store => :store, :date => (:start_date, nothing)])
9×4 Dataset
Row │ date store employee_ID end_date
│ identity identity identity identity
│ Date? String? Int64? Date?
─────┼───────────────────────────────────────────────
1 │ 2019-10-01 A 1 2019-10-04
2 │ 2019-10-02 A 1 2019-10-04
3 │ 2019-10-02 A 2 2019-10-04
4 │ 2019-10-05 B 8 2019-10-06
5 │ 2019-10-04 A 1 2019-10-04
6 │ 2019-10-04 A 2 2019-10-04
7 │ 2019-10-04 A 4 2019-10-06
8 │ 2019-10-03 A 1 2019-10-04
9 │ 2019-10-03 A 2 2019-10-04
julia> innerjoin(store, roster, on = [:store => :store, :date => (nothing, :end_date)])
14×4 Dataset
Row │ date store employee_ID start_date
│ identity identity identity identity
│ Date? String? Int64? Date?
─────┼───────────────────────────────────────────────
1 │ 2019-10-01 A 1 2019-09-30
2 │ 2019-10-01 A 2 2019-10-02
3 │ 2019-10-01 A 4 2019-10-04
4 │ 2019-10-02 A 1 2019-09-30
5 │ 2019-10-02 A 2 2019-10-02
6 │ 2019-10-02 A 4 2019-10-04
7 │ 2019-10-05 B 8 2019-10-04
8 │ 2019-10-04 A 1 2019-09-30
9 │ 2019-10-04 A 2 2019-10-02
10 │ 2019-10-04 A 4 2019-10-04
11 │ 2019-10-03 B 8 2019-10-04
12 │ 2019-10-03 A 1 2019-09-30
13 │ 2019-10-03 A 2 2019-10-02
14 │ 2019-10-03 A 4 2019-10-04
julia> innerjoin(store, roster, on = [:store => :store, :date => (:start_date, :end_date)])
9×3 Dataset
Row │ date store employee_ID
│ identity identity identity
│ Date? String? Int64?
─────┼───────────────────────────────────
1 │ 2019-10-01 A 1
2 │ 2019-10-02 A 1
3 │ 2019-10-02 A 2
4 │ 2019-10-05 B 8
5 │ 2019-10-04 A 1
6 │ 2019-10-04 A 2
7 │ 2019-10-04 A 4
8 │ 2019-10-03 A 1
9 │ 2019-10-03 A 2
Update a data set by values from another data set
update!
updates a data set values by using values from a transaction data set. The function uses the given keys (on = ...
) to select rows for updating. By default, the missing values in transaction data set wouldn't replace the values in the main data set, however, using allowmissing = true
changes this behaviour. If there are multiple rows in the main data set which match the key(s), using mode = :all
causes all of them to be updated, and mode = :missing
causes only the ones which are missing in the main data set to be updated. If there are multiple rows in the transaction data set which match the key, only the last one will be used to update the main data set.
The update!
functions replace the main data set with the updated version, however, if a copy of the updated data set is required, the update
function can be used instead.
Examples
julia> main = Dataset(group = ["G1", "G1", "G1", "G1", "G2", "G2", "G2"],
id = [ 1 , 1 , 2 , 2 , 1 , 1 , 2 ],
x1 = [1.2, 2.3,missing, 2.3, 1.3, 2.1 , 0.0 ],
x2 = [ 5 , 4 , 4 , 2 , 1 ,missing, 2 ])
7×4 Dataset
Row │ group id x1 x2
│ identity identity identity identity
│ String? Int64? Float64? Int64?
─────┼───────────────────────────────────────────
1 │ G1 1 1.2 5
2 │ G1 1 2.3 4
3 │ G1 2 missing 4
4 │ G1 2 2.3 2
5 │ G2 1 1.3 1
6 │ G2 1 2.1 missing
7 │ G2 2 0.0 2
julia> transaction = Dataset(group = ["G1", "G2"], id = [2, 1],
x1 = [2.5, missing], x2 = [missing, 3])
2×4 Dataset
Row │ group id x1 x2
│ identity identity identity identity
│ String? Int64? Float64? Int64?
─────┼───────────────────────────────────────────
1 │ G1 2 2.5 missing
2 │ G2 1 missing 3
julia> update(main, transaction, on = [:group, :id],
allowmissing = false, mode = :missing)
7×4 Dataset
Row │ group id x1 x2
│ identity identity identity identity
│ String? Int64? Float64? Int64?
─────┼──────────────────────────────────────────
1 │ G1 1 1.2 5
2 │ G1 1 2.3 4
3 │ G1 2 2.5 4
4 │ G1 2 2.3 2
5 │ G2 1 1.3 1
6 │ G2 1 2.1 3
7 │ G2 2 0.0 2
julia> update(main, transaction, on = [:group, :id],
allowmissing = false, mode = :all)
7×4 Dataset
Row │ group id x1 x2
│ identity identity identity identity
│ String? Int64? Float64? Int64?
─────┼──────────────────────────────────────────
1 │ G1 1 1.2 5
2 │ G1 1 2.3 4
3 │ G1 2 2.5 4
4 │ G1 2 2.5 2
5 │ G2 1 1.3 3
6 │ G2 1 2.1 3
7 │ G2 2 0.0 2