Joins and related topics
Introduction
In this section we give details of the functions that can be used to combined data sets in InMemoryDatasets. These functions are multi-threaded (unless threads = false
i s passed to the functions) and thus are very performant for general use. By default, InMemoryDatasets uses the formatted values for joining observations, however, this can be controlled by the mapformats
keyword argument.
Database like joins
The main functions for combining two data sets are leftjoin
, innerjoin
, outerjoin
, semijoin
, and antijoin
. The basic syntax for all of these functions is *join(dsl, dsr, on = [...])
, where dsl
is the left data set, dsr
is the right data set, and the on
keyword argument specifies the key(s) which will be used for matching observations between the left and the right data set. In general, the on
key can be a single column name (either as Symbol
or String
) or a vector of column names. When the column names are not identical in both tables, a vector of pair of column names can be used instead.
leftjoin
: joins two data sets and its output contains rows for values of the key(s) that exist in the left data set, whether or not that value exists in the right data set.innerjoin
: joins two data sets and its output contains rows for values of the key(s) that exist in both left and right data sets.outerjoin
: joins two data sets and its output contains rows for values of the key(s) that exist in any of the left or the right data set.semijoin
: Like an inner join, but its output is restricted to columns from the left table.antijoin
: The output contains rows for values of the key(s) that exist in the left data set but not the right data set. As withsemijoin
, output is restricted to columns from the left data set.
See the Wikipedia page on SQL joins for more information.
By default (except for semijoin
and antijoin
), to match observations, InMemoryDatasets sorts the right data set and uses a binary search algorithm for finding the matches of each observation in the left data set in the right data set based on the passed key column(s), thus, it has better performance when the left data set is larger than the right data set. However, passing method = :hash
changes the default. The matching is done based on the formatted values of the key column(s), however, using the mapformats
keyword argument can change this behaviour, one may set it to false
for one or both data sets.
For method = :sort
and leftjoin
and innerjoin
the order of observations of the output data set is the same as their order in the left data set. However, the order of observations from the right table depends on the stability of the sort algorithm. User can set the stable
keyword argument to true
to guarantee a stable sort. For outerjoin
the order of observations from the left data set in the output data set is also the same as their order in the original data set, however, for those observations which are from the right table when method = :sort
there is no specific order.
By default, when method = :sort
, the join functions use a hybrid Heap Sort
algorithm to sort the observations in the right data set, however, setting alg = QuickSort
change the default algorithm to a hybrid Quick Sort one.
For very large data sets, if the sorting of the first key is expensive, setting the accelerate
keyword argument to true
may improve the overall performance. By setting accelerate = true
, InMemoryDatasets first divides all observations in the right data set into multiple parts (up to 1024 parts) based on the first passed key, and then for each observations in the left data set finds the corresponding part in the right data set and searches for the matching observations only within that part.
The leftjoin
, semijoin
, and antijoin
functions have in-place version which are leftjoin!
, semijoin!
, and antijoin!
, respectively. Instead of creating a new output dataset, these in-place versions of the functions replace the passed left table. Note that, for the leftjoin!
there must be no more than one match for each observation from the right table, otherwise, the function raises an error.
Examples
julia> name = Dataset(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
3×2 Dataset
Row │ ID Name
│ identity identity
│ Int64? String?
─────┼─────────────────────
1 │ 1 John Doe
2 │ 2 Jane Doe
3 │ 3 Joe Blogs
julia> job = Dataset(ID = [1, 2, 2, 4],
Job = ["Lawyer", "Doctor", "Florist", "Farmer"])
4×2 Dataset
Row │ ID Job
│ identity identity
│ Int64? String?
─────┼────────────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 2 Florist
4 │ 4 Farmer
julia> leftjoin(name, job, on = :ID)
4×3 Dataset
Row │ ID Name Job
│ identity identity identity
│ Int64? String? String?
─────┼───────────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
3 │ 2 Jane Doe Florist
4 │ 3 Joe Blogs missing
julia> dsl = Dataset(year = [Date("2020-3-1"), Date("2021-10-21"), Date("2020-1-4"), Date("2012-12-11")],
leap_year = [true, false, true, true])
4×2 Dataset
Row │ year leap_year
│ identity identity
│ Date? Bool?
─────┼───────────────────────
1 │ 2020-03-01 true
2 │ 2021-10-21 false
3 │ 2020-01-04 true
4 │ 2012-12-11 true
julia> dsr = Dataset(year = [2020, 2021], event = ['A', 'B'])
2×2 Dataset
Row │ year event
│ identity identity
│ Int64? Char?
─────┼────────────────────
1 │ 2020 A
2 │ 2021 B
julia> setformat!(dsl, 1 => year);
julia> leftjoin(dsl, dsr, on = :year)
4×3 Dataset
Row │ year leap_year event
│ year identity identity
│ Date? Bool? Char?
─────┼────────────────────────────
1 │ 2020 true A
2 │ 2021 false B
3 │ 2020 true A
4 │ 2012 true missing
julia> innerjoin(name, job, on = :ID)
3×3 Dataset
Row │ ID Name Job
│ identity identity identity
│ Int64? String? String?
─────┼──────────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
3 │ 2 Jane Doe Florist
julia> outerjoin(name, job, on = :ID)
5×3 Dataset
Row │ ID Name Job
│ identity identity identity
│ Int64? String? String?
─────┼───────────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
3 │ 2 Jane Doe Florist
4 │ 3 Joe Blogs missing
5 │ 4 missing Farmer
To demonstrate the use of the accelerate
keyword, we generate two data sets and use the @btime
macro from the BenchmarkTools
package to benchmark the performance of the innerjoin
function with and without acceleration.
julia> using BenchmarkTools
julia> using Random
julia> dsl = Dataset(x1 = [randstring('a':'z', 6) for _ in 1:10^6],
x2 = rand(1:100, 10^6), x3 = rand(10^6));
julia> dsr = Dataset(y1 = [randstring('a':'z', 6) for _ in 1:10^6],
y2 = rand(1:100, 10^6), y3 = rand(10^6));
julia> @btime innerjoin(dsl, dsr, on = [:x1=>:y1, :x2=>:y2]);
382.759 ms (1254 allocations: 55.40 MiB)
julia> @btime innerjoin(dsl, dsr, on = [:x1=>:y1, :x2=>:y2], accelerate = true);
155.306 ms (2160 allocations: 45.92 MiB)
As it can be observed, using accelerate = true
significantly reduces the joining time. The reason for this reduction is because currently sorting String
type columns in InMemoryDatasets is relatively expensive, and using accelerate = true
helps to reduce this by splitting the observations into multiple parts.
And of course for this example we can simply use the hash techniques for matching observations:
julia> @btime innerjoin(dsl, dsr, on = [:x1=>:y1, :x2=>:y2], method = :hash);
86.323 ms (1095 allocations: 96.95 MiB)
contains
The contains
function is special function that can be used to enquiry observations of a data set which are contained in another data set. It returns a boolean vector where is true when the key for the corresponding row in the main data set is found in the transaction data set. The syntax of the function is the same as the leftjoin
function. When a single column is used for matching observations, the function uses hashing techniques to find the matched observations, however, for multiple key columns, it uses the sorting algorithm to search for the matched observations.
Both semijoin
and antijoin
use the contains
function behind the scene for filtering the left data set.
Close match join
The closejoin
function joins two data sets based on exact match on the key variable or the closest match (here, closest match depends on the direction
keyword argument) when the exact match doesn't exist.
The closejoin!
function does a close join in-place.
A tolerance for finding close matches can be passed via the tol
keyword argument, and for the situations where the exact match is not allowed, user can pass allow_exact_match = false
.
closejoin/!
support method = :hash
however, for the last key column it uses the sorting method to find the closest match.
Examples
julia> classA = Dataset(id = ["id1", "id2", "id3", "id4", "id5"],
mark = [50, 69.5, 45.5, 88.0, 98.5])
5×2 Dataset
Row │ id mark
│ identity identity
│ String? Float64?
─────┼──────────────────────
1 │ id1 50.0
2 │ id2 69.5
3 │ id3 45.5
4 │ id4 88.0
5 │ id5 98.5
julia> grades = Dataset(mark = [0, 49.5, 59.5, 69.5, 79.5, 89.5, 95.5],
grade = ["F", "P", "C", "B", "A-", "A", "A+"])
7×2 Dataset
Row │ mark grade
│ identity identity
│ Float64? String?
─────┼──────────────────────
1 │ 0.0 F
2 │ 49.5 P
3 │ 59.5 C
4 │ 69.5 B
5 │ 79.5 A-
6 │ 89.5 A
7 │ 95.5 A+
julia> closejoin(classA, grades, on = :mark)
5×3 Dataset
Row │ id mark grade
│ identity identity identity
│ String? Float64? String?
─────┼──────────────────────────────────
1 │ id1 50.0 P
2 │ id2 69.5 B
3 │ id3 45.5 F
4 │ id4 88.0 A-
5 │ id5 98.5 A+
Examples of using closejoin
for financial data.
julia> trades = Dataset(
[["20160525 13:30:00.023",
"20160525 13:30:00.038",
"20160525 13:30:00.048",
"20160525 13:30:00.048",
"20160525 13:30:00.048"],
["MSFT", "MSFT",
"GOOG", "GOOG", "AAPL"],
[51.95, 51.95,
720.77, 720.92, 98.00],
[75, 155,
100, 100, 100]],
["time", "ticker", "price", "quantity"]);
julia> modify!(trades, 1 => byrow(x -> DateTime(x, dateformat"yyyymmdd HH:MM:SS.s")))
5×4 Dataset
Row │ time ticker price quantity
│ identity identity identity identity
│ DateTime? String? Float64? Int64?
─────┼─────────────────────────────────────────────────────────
1 │ 2016-05-25T13:30:00.023 MSFT 51.95 75
2 │ 2016-05-25T13:30:00.038 MSFT 51.95 155
3 │ 2016-05-25T13:30:00.048 GOOG 720.77 100
4 │ 2016-05-25T13:30:00.048 GOOG 720.92 100
5 │ 2016-05-25T13:30:00.048 AAPL 98.0 100
julia> quotes = Dataset(
[["20160525 13:30:00.023",
"20160525 13:30:00.023",
"20160525 13:30:00.030",
"20160525 13:30:00.041",
"20160525 13:30:00.048",
"20160525 13:30:00.049",
"20160525 13:30:00.072",
"20160525 13:30:00.075"],
["GOOG", "MSFT", "MSFT", "MSFT",
"GOOG", "AAPL", "GOOG", "MSFT"],
[720.50, 51.95, 51.97, 51.99,
720.50, 97.99, 720.50, 52.01],
[720.93, 51.96, 51.98, 52.00,
720.93, 98.01, 720.88, 52.03]],
["time", "ticker", "bid", "ask"]);
julia> modify!(quotes, 1 => byrow(x -> DateTime(x, dateformat"yyyymmdd HH:MM:SS.s")))
8×4 Dataset
Row │ time ticker bid ask
│ identity identity identity identity
│ DateTime? String? Float64? Float64?
─────┼─────────────────────────────────────────────────────────
1 │ 2016-05-25T13:30:00.023 GOOG 720.5 720.93
2 │ 2016-05-25T13:30:00.023 MSFT 51.95 51.96
3 │ 2016-05-25T13:30:00.030 MSFT 51.97 51.98
4 │ 2016-05-25T13:30:00.041 MSFT 51.99 52.0
5 │ 2016-05-25T13:30:00.048 GOOG 720.5 720.93
6 │ 2016-05-25T13:30:00.049 AAPL 97.99 98.01
7 │ 2016-05-25T13:30:00.072 GOOG 720.5 720.88
8 │ 2016-05-25T13:30:00.075 MSFT 52.01 52.03
julia> closejoin(trades, quotes, on = :time, makeunique = true)
5×7 Dataset
Row │ time ticker price quantity ticker_1 bid ask
│ identity identity identity identity identity identity identity
│ DateTime? String? Float64? Int64? String? Float64? Float64?
─────┼─────────────────────────────────────────────────────────────────────────────────────────
1 │ 2016-05-25T13:30:00.023 MSFT 51.95 75 MSFT 51.95 51.96
2 │ 2016-05-25T13:30:00.038 MSFT 51.95 155 MSFT 51.97 51.98
3 │ 2016-05-25T13:30:00.048 GOOG 720.77 100 GOOG 720.5 720.93
4 │ 2016-05-25T13:30:00.048 GOOG 720.92 100 GOOG 720.5 720.93
5 │ 2016-05-25T13:30:00.048 AAPL 98.0 100 GOOG 720.5 720.93
In the above example, the closejoin
for each ticker
can be done by passing ticker
as the first variable for the on
keyword, i.e. when more than one key is used for on
the last one will be used for "close match" and the rest are used for exact match.
When border
is set to :missing
(default value) for the :backward
direction the value below the smallest value will be set to missing
, and for the :forward
direction the value above the largest value will be set to missing
. And when border = :nearest
the closest non-missing value will be fetched.
Passing border = :none
, sets missing for values in left data set which are out of the right data set's range.
julia> closejoin(trades, quotes, on = [:ticker, :time], border = :missing)
5×6 Dataset
Row │ time ticker price quantity bid ask
│ identity identity identity identity identity identity
│ DateTime? String? Float64? Int64? Float64? Float64?
─────┼─────────────────────────────────────────────────────────────────────────────────
1 │ 2016-05-25T13:30:00.023 MSFT 51.95 75 51.95 51.96
2 │ 2016-05-25T13:30:00.038 MSFT 51.95 155 51.97 51.98
3 │ 2016-05-25T13:30:00.048 GOOG 720.77 100 720.5 720.93
4 │ 2016-05-25T13:30:00.048 GOOG 720.92 100 720.5 720.93
5 │ 2016-05-25T13:30:00.048 AAPL 98.0 100 missing missing
julia> closejoin(trades, quotes, on = [:ticker, :time], border = :nearest)
5×6 Dataset
Row │ time ticker price quantity bid ask
│ identity identity identity identity identity identity
│ DateTime? String? Float64? Int64? Float64? Float64?
─────┼─────────────────────────────────────────────────────────────────────────────
1 │ 2016-05-25T13:30:00.023 MSFT 51.95 75 51.95 51.96
2 │ 2016-05-25T13:30:00.038 MSFT 51.95 155 51.97 51.98
3 │ 2016-05-25T13:30:00.048 GOOG 720.77 100 720.5 720.93
4 │ 2016-05-25T13:30:00.048 GOOG 720.92 100 720.5 720.93
5 │ 2016-05-25T13:30:00.048 AAPL 98.0 100 97.99 98.01
Inequality-kind joins
The innerjoin
, contains
, semijoin
, semijoin!
, antijoin
, antijoin!
functions can also use inequality comparisons to match observations from the left data set with the observations in the right data set. They can find all observations in the right data set that are <=
(<
) or >=
(>
) than a selected observation in the left data set. Additionally, if the user specifies two columns in the right table for a single key in the left table, they matche the observations in the left data set when they fall into the range specifies by the selected two key columns in the right data set. These conditional joins can be done within groups of observations if the user provide more than one key column for the left and the right data sets, i.e. the last key will be used for "inequality-kind" join and the rest will be used for the exact match.
For these kind of joins, the key columns for both data sets which are defined for grouping observation must be passed as pair of column names (similar to normal use of other joins), however, the key column from the left data set which is going to be used for conditional joining must be also passed as a column name, and the key column(s) for conditional joining from the right data set must be passed as a Tuple of column names. For example, if the key column for the left data set is :l_key
, and there are two columns in the right table called, :r_start
and :r_end
the following demonstrates how a user can perform different kinds of conditional joining:
:l_key => (:r_start, nothing)
, a match happens if the selected observation from the left data set is>= :r_start
.:l_key => (nothing, :r_end)
, a match happens if the selected observation from the left data set is<= :r_end
.:l_key => (:r_start, :r_end)
, a match happens if the selected observation from the left data set is>= :r_start
and<= :r_end
.
To change inequalities to strict inequalities the strict_inequality
keyword argument must be set to true
for one or both sides, e.g. strict_inequality = true
(both side), strict_inequality = [false, true]
(only one side).
These joins also support the method
keyword argument for all key columns which are not used for inequality like join. contains
and its related functions use method = :hash
by default.
Examples
julia> store = Dataset([[Date("2019-10-01"), Date("2019-10-02"), Date("2019-10-05"), Date("2019-10-04"), Date("2019-10-03"), Date("2019-10-03")],
["A", "A", "B", "A", "B", "A"]], [:date, :store])
6×2 Dataset
Row │ date store
│ identity identity
│ Date? String?
─────┼──────────────────────
1 │ 2019-10-01 A
2 │ 2019-10-02 A
3 │ 2019-10-05 B
4 │ 2019-10-04 A
5 │ 2019-10-03 B
6 │ 2019-10-03 A
julia> roster = Dataset([["A", "A", "B", "A"],
[4, 1, 8, 2 ],
[Date("2019-10-04"), Date("2019-09-30"), Date("2019-10-04"), Date("2019-10-02")],
[Date("2019-10-06"), Date("2019-10-04"), Date("2019-10-06"), Date("2019-10-04")]],
["store", "employee_ID", "start_date", "end_date"])
4×4 Dataset
Row │ store employee_ID start_date end_date
│ identity identity identity identity
│ String? Int64? Date? Date?
─────┼───────────────────────────────────────────────
1 │ A 4 2019-10-04 2019-10-06
2 │ A 1 2019-09-30 2019-10-04
3 │ B 8 2019-10-04 2019-10-06
4 │ A 2 2019-10-02 2019-10-04
julia> innerjoin(store, roster, on = [:store => :store, :date => (:start_date, nothing)])
9×4 Dataset
Row │ date store employee_ID end_date
│ identity identity identity identity
│ Date? String? Int64? Date?
─────┼───────────────────────────────────────────────
1 │ 2019-10-01 A 1 2019-10-04
2 │ 2019-10-02 A 1 2019-10-04
3 │ 2019-10-02 A 2 2019-10-04
4 │ 2019-10-05 B 8 2019-10-06
5 │ 2019-10-04 A 1 2019-10-04
6 │ 2019-10-04 A 2 2019-10-04
7 │ 2019-10-04 A 4 2019-10-06
8 │ 2019-10-03 A 1 2019-10-04
9 │ 2019-10-03 A 2 2019-10-04
julia> innerjoin(store, roster, on = [:store => :store, :date => (nothing, :end_date)])
14×4 Dataset
Row │ date store employee_ID start_date
│ identity identity identity identity
│ Date? String? Int64? Date?
─────┼───────────────────────────────────────────────
1 │ 2019-10-01 A 1 2019-09-30
2 │ 2019-10-01 A 2 2019-10-02
3 │ 2019-10-01 A 4 2019-10-04
4 │ 2019-10-02 A 1 2019-09-30
5 │ 2019-10-02 A 2 2019-10-02
6 │ 2019-10-02 A 4 2019-10-04
7 │ 2019-10-05 B 8 2019-10-04
8 │ 2019-10-04 A 1 2019-09-30
9 │ 2019-10-04 A 2 2019-10-02
10 │ 2019-10-04 A 4 2019-10-04
11 │ 2019-10-03 B 8 2019-10-04
12 │ 2019-10-03 A 1 2019-09-30
13 │ 2019-10-03 A 2 2019-10-02
14 │ 2019-10-03 A 4 2019-10-04
julia> innerjoin(store, roster, on = [:store => :store, :date => (:start_date, :end_date)])
9×3 Dataset
Row │ date store employee_ID
│ identity identity identity
│ Date? String? Int64?
─────┼───────────────────────────────────
1 │ 2019-10-01 A 1
2 │ 2019-10-02 A 1
3 │ 2019-10-02 A 2
4 │ 2019-10-05 B 8
5 │ 2019-10-04 A 1
6 │ 2019-10-04 A 2
7 │ 2019-10-04 A 4
8 │ 2019-10-03 A 1
9 │ 2019-10-03 A 2
julia> dsl = Dataset(x1 = [1,2,1,3], y = [-1.2,-3,2.1,-3.5])
4×2 Dataset
Row │ x1 y
│ identity identity
│ Int64? Float64?
─────┼────────────────────
1 │ 1 -1.2
2 │ 2 -3.0
3 │ 1 2.1
4 │ 3 -3.5
julia> dsr = Dataset(x1 = [1,2,3], lower = [0, -3,1], upper = [3,0,2])
3×3 Dataset
Row │ x1 lower upper
│ identity identity identity
│ Int64? Int64? Int64?
─────┼──────────────────────────────
1 │ 1 0 3
2 │ 2 -3 0
3 │ 3 1 2
julia> contains(dsl, dsr, on = [1=>1, 2=>(2,3)], strict_inequality = true)
4-element Vector{Bool}:
0
0
1
0
Update a data set by values from another data set
update!
updates a data set values by using values from a transaction data set. The function uses the given keys (on = ...
) to select rows for updating. By default, the missing values in transaction data set wouldn't replace the values in the main data set, however, using allowmissing = true
changes this behaviour. If there are multiple rows in the main data set which match the key(s), using mode = :all
causes all of them to be updated, mode = :missings
causes only the ones which are missing in the main data set to be updated, and mode = fun
updates the values which calling fun
on them returns true
. If there are multiple rows in the transaction data set which match the key, only the last one (given stable = true
is passed) will be used to update the main data set.
By default, update!
updates the old values by the new values from the transaction data set, however, user may pass any function via the op
keyword argument to update the values in the main data set by the result of calling op
on values on both data sets. In this case, update!
updates values in the main data set by op(old, new)
, where old
is the value from the main data set and new
is the value from the transaction data set.
The update!
functions replace the main data set with the updated version, however, if a copy of the updated data set is required, the update
function can be used instead.
Like other join functions, one may pass method = :hash
for using hash techniques to match observations.
Examples
julia> main = Dataset(group = ["G1", "G1", "G1", "G1", "G2", "G2", "G2"],
id = [ 1 , 1 , 2 , 2 , 1 , 1 , 2 ],
x1 = [1.2, 2.3,missing, 2.3, 1.3, 2.1 , 0.0 ],
x2 = [ 5 , 4 , 4 , 2 , 1 ,missing, 2 ])
7×4 Dataset
Row │ group id x1 x2
│ identity identity identity identity
│ String? Int64? Float64? Int64?
─────┼───────────────────────────────────────────
1 │ G1 1 1.2 5
2 │ G1 1 2.3 4
3 │ G1 2 missing 4
4 │ G1 2 2.3 2
5 │ G2 1 1.3 1
6 │ G2 1 2.1 missing
7 │ G2 2 0.0 2
julia> transaction = Dataset(group = ["G1", "G2"], id = [2, 1],
x1 = [2.5, missing], x2 = [missing, 3])
2×4 Dataset
Row │ group id x1 x2
│ identity identity identity identity
│ String? Int64? Float64? Int64?
─────┼───────────────────────────────────────────
1 │ G1 2 2.5 missing
2 │ G2 1 missing 3
julia> update(main, transaction, on = [:group, :id],
allowmissing = false, mode = :missings)
7×4 Dataset
Row │ group id x1 x2
│ identity identity identity identity
│ String? Int64? Float64? Int64?
─────┼──────────────────────────────────────────
1 │ G1 1 1.2 5
2 │ G1 1 2.3 4
3 │ G1 2 2.5 4
4 │ G1 2 2.3 2
5 │ G2 1 1.3 1
6 │ G2 1 2.1 3
7 │ G2 2 0.0 2
julia> update(main, transaction, on = [:group, :id],
allowmissing = false, mode = :all)
7×4 Dataset
Row │ group id x1 x2
│ identity identity identity identity
│ String? Int64? Float64? Int64?
─────┼──────────────────────────────────────────
1 │ G1 1 1.2 5
2 │ G1 1 2.3 4
3 │ G1 2 2.5 4
4 │ G1 2 2.5 2
5 │ G2 1 1.3 3
6 │ G2 1 2.1 3
7 │ G2 2 0.0 2
julia> update(main, transaction, on = [:group, :id],
mode = isequal(2.3))
7×4 Dataset
Row │ group id x1 x2
│ identity identity identity identity
│ String? Int64? Float64? Int64?
─────┼─────────────────────────────────────────
1 │ G1 1 1.2 5
2 │ G1 1 2.3 4
3 │ G1 2 missing 4
4 │ G1 2 2.5 2
5 │ G2 1 1.3 1
6 │ G2 1 2.1 missing
7 │ G2 2 0.0 2
julia> update(main, transaction, on = [:group, :id], op = +) # add values of transaction to main, when op is set mode = :all is default
7×4 Dataset
Row │ group id x1 x2
│ identity identity identity identity
│ String? Int64? Float64? Int64?
─────┼─────────────────────────────────────────
1 │ G1 1 1.2 5
2 │ G1 1 2.3 4
3 │ G1 2 missing 4
4 │ G1 2 4.8 2
5 │ G2 1 1.3 4
6 │ G2 1 2.1 missing
7 │ G2 2 0.0 2
compare
The compare
function compares two data sets. When the columns which needed to be compared are specified via the cols
keyword argument, compare
compares the corresponding values in each row by calling eq
on the actual or formatted values. By default, compare
compares two values via the isequal
function, however, users may pass any function via the eq
keyword arguments. When the number of rows of two data sets are not matched, compare
fills the output data set with missing
. Users can pass key columns to perform comparing matched pairs of observations. The key columns can be passed via the on
keyword argument. The compare
function uses outerjoin
to find the corresponding matches, this also means, the compare
function can accept the arguments of outerjoin
.
To pass the
mapformats
keyword argument toouterjoin
incompare
, use theon_mapformats
keyword argument, since themapformats
keyword argument incompare
refers to how observations should be compared; based on actual values or formatted values.
By default, the output data set contains observations id when users pass the on
keyword argument. When an observation exists in only one of the passed data sets, the observation id will be missing for the other one.
Examples
julia> old = Dataset(Insurance_Id=[1,2,3,5],Business_Id=[10,20,30,50],
Amount=[100,200,300,missing],
Account_Id=["x1","x10","x5","x5"])
4×4 Dataset
Row │ Insurance_Id Business_Id Amount Account_Id
│ identity identity identity identity
│ Int64? Int64? Int64? String?
─────┼─────────────────────────────────────────────────
1 │ 1 10 100 x1
2 │ 2 20 200 x10
3 │ 3 30 300 x5
4 │ 5 50 missing x5
julia> new = Dataset(Ins_Id=[1,3,2,4,3,2],
B_Id=[10,40,30,40,30,20],
AMT=[100,200,missing,-500,350,700],
Ac_Id=["x1","x1","x10","x10","x7","x5"])
6×4 Dataset
Row │ Ins_Id B_Id AMT Ac_Id
│ identity identity identity identity
│ Int64? Int64? Int64? String?
─────┼────────────────────────────────────────
1 │ 1 10 100 x1
2 │ 3 40 200 x1
3 │ 2 30 missing x10
4 │ 4 40 -500 x10
5 │ 3 30 350 x7
6 │ 2 20 700 x5
julia> eq_fun(x::Number, y::Number) = abs(x - y) <= 50
eq_fun (generic function with 3 methods)
julia> eq_fun(x::AbstractString, y::AbstractString) = isequal(x,y)
eq_fun (generic function with 2 methods)
julia> eq_fun(x,y) = missing
eq_fun (generic function with 3 methods)
julia> compare(old, new,
on = [1=>1,2=>2],
cols = [:Amount=>:AMT, :Account_Id=>:Ac_Id],
eq = eq_fun)
7×6 Dataset
Row │ Insurance_Id Business_Id obs_id_left obs_id_right Amount=>AMT Account_Id=>Ac_Id
│ identity identity identity identity identity identity
│ Int64? Int64? Int32? Int32? Bool? Bool?
─────┼──────────────────────────────────────────────────────────────────────────────────────
1 │ 1 10 1 1 true true
2 │ 2 20 2 6 false false
3 │ 3 30 3 5 true false
4 │ 5 50 4 missing missing missing
5 │ 2 30 missing 3 missing missing
6 │ 3 40 missing 2 missing missing
7 │ 4 40 missing 4 missing missing