Tutorial - Advanced
This section provides an advanced introduction for using the DLMReader
package to read a delimited file into Julia
. To start using the package enter the following expression in a Julia
session,
julia> using DLMReader
If you have not installed the package yet, you will be prompt to do it so.
Reading the yellow taxi file
The yellow_subset.csv
file is a small subset of taxi movements in New York city in 2010.
julia> taxi_file = joinpath(dirname(pathof(DLMReader)),
"..", "docs", "src", "assets", "yellow_subset.csv"
);
In this tutorial we use the filereader
function to read this file into Julia
. At the first attempt we directly use the filereader
function to read this data into Julia
, however, since the file can be huge we only limit our parsing to few observations (examining the structure of the input file).
julia> taxi = filereader(taxi_file, limit = 4)
┌ Info: There might be less observations in the input file at line 2 (observation 1) than the number of columns in the output dataset.
└
4×18 Dataset
Row │ vendor_id pickup_datetime dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude rate_code store_and_fwd_flag dropoff_longitude ⋯
│ identity identity identity identity identity identity identity identity identity identity ⋯
│ String? String? String? Int64? Float64? Float64? Float64? Int64? Int64? Float64? ⋯
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ \r missing missing missing missing missing missing missing missing missing ⋯
2 │ CMT 2010-03-15 15:29:46 2010-03-15 15:34:09 1 0.8 -74.0047 40.7343 1 0 -73.9936
3 │ CMT 2010-03-15 11:44:31 2010-03-15 11:48:48 1 0.3 -73.9584 40.7729 1 0 -73.9636
4 │ CMT 2010-03-15 11:07:30 2010-03-15 11:21:39 2 2.2 -73.9574 40.78 1 0 -73.9838
8 columns omitted
Note the first row of the output data set is strange, and the output data set is truncated, thus, we cannot see the detected type of columns. We first fix the problem with the first observation. Since the first cell is filled with "\r
" we should suspect that the end-of-line character for this file might be detected incorrectly. The filereader
package allows user to pass the end-of-line character via the linebreak
keyword argument. In general, "\n
" is the end-of-line character for most delimited file, however, in some operating systems "\r
", "\r\n
", etc might be used for this purpose. Using trial and error (there are other ways rather than trial and error) we noticed that the end-of-line for this file is "\r\n
", and there is an extra "\n
" in the first line of the file which has puzzled the automatic detection. Thus, we rerun the code and pass the linebreak
keyword argument to fix the first issue of this file,
julia> taxi = filereader(taxi_file, limit = 4, linebreak = ['\r','\n'])
4×18 Dataset
Row │ vendor_id pickup_datetime dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude rate_code store_and_fwd_flag dropoff_longitude ⋯
│ identity identity identity identity identity identity identity identity identity identity ⋯
│ String? String? String? Int64? Float64? Float64? Float64? Int64? Int64? Float64? ⋯
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ CMT 2010-03-15 15:29:46 2010-03-15 15:34:09 1 0.8 -74.0047 40.7343 1 0 -73.9936 ⋯
2 │ CMT 2010-03-15 11:44:31 2010-03-15 11:48:48 1 0.3 -73.9584 40.7729 1 0 -73.9636
3 │ CMT 2010-03-15 11:07:30 2010-03-15 11:21:39 2 2.2 -73.9574 40.78 1 0 -73.9838
4 │ CMT 2010-03-15 14:57:33 2010-03-15 15:10:26 1 4.1 -74.016 40.7151 1 0 -73.9904
8 columns omitted
Passing data types
In general, providing the data type of columns improve the efficiency of reading delimited file. To provide such information, we look at the detected types in taxi
and pass the right data types. To examine the detected types of a data set, we should use the content
and describe
functions from the InMemoryDatasets
package.
julia> using InMemoryDatasets
julia> content(taxi)
4×18 Dataset
Created: 2022-05-13T15:00:55.868
Modified: 2022-05-13T15:00:55.868
Info:
-----------------------------------
Columns information
┌─────┬────────────────────┬──────────┬─────────┐
│ Row │ col │ format │ eltype │
├─────┼────────────────────┼──────────┼─────────┤
│ 1 │ vendor_id │ identity │ String │
│ 2 │ pickup_datetime │ identity │ String │
│ 3 │ dropoff_datetime │ identity │ String │
│ 4 │ passenger_count │ identity │ Int64 │
│ 5 │ trip_distance │ identity │ Float64 │
│ 6 │ pickup_longitude │ identity │ Float64 │
│ 7 │ pickup_latitude │ identity │ Float64 │
│ 8 │ rate_code │ identity │ Int64 │
│ 9 │ store_and_fwd_flag │ identity │ Int64 │
│ 10 │ dropoff_longitude │ identity │ Float64 │
│ 11 │ dropoff_latitude │ identity │ Float64 │
│ 12 │ payment_type │ identity │ String │
│ 13 │ fare_amount │ identity │ Float64 │
│ 14 │ surcharge │ identity │ Int64 │
│ 15 │ mta_tax │ identity │ Float64 │
│ 16 │ tip_amount │ identity │ Int64 │
│ 17 │ tolls_amount │ identity │ Int64 │
│ 18 │ total_amount │ identity │ Float64 │
└─────┴────────────────────┴──────────┴─────────┘
Note that the type detection for some columns is incorrect, e.g. pickup_datetime
, dropoff_datetime
, surcharge
, etc. In the next step we pass the incorrectly detected columns via the types
and dtformat
keyword arguments,
julia> taxi = filereader(taxi_file, limit = 4, linebreak = ['\r','\n'],
types = Dict([14,16,17] .=> Float64),
dtformat = Dict(2:3 .=> dateformat"y-m-d H:M:S"))
4×18 Dataset
Row │ vendor_id pickup_datetime dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude rate_code store_and_fwd_flag dropoff_longitude ⋯
│ identity identity identity identity identity identity identity identity identity identity ⋯
│ String? DateTime? DateTime? Int64? Float64? Float64? Float64? Int64? Int64? Float64? ⋯
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ CMT 2010-03-15T15:29:46 2010-03-15T15:34:09 1 0.8 -74.0047 40.7343 1 0 -73.9936 ⋯
2 │ CMT 2010-03-15T11:44:31 2010-03-15T11:48:48 1 0.3 -73.9584 40.7729 1 0 -73.9636
3 │ CMT 2010-03-15T11:07:30 2010-03-15T11:21:39 2 2.2 -73.9574 40.78 1 0 -73.9838
4 │ CMT 2010-03-15T14:57:33 2010-03-15T15:10:26 1 4.1 -74.016 40.7151 1 0 -73.9904
8 columns omitted
Passing the date format of the DataTime columns is sufficient for parsing their values. Thus, in the previous code, we only passed the dtformat
keyword argument for the second and third column.
In the next step, we store the data types of the columns in an array and pass it to the filereader
function to process the whole file. Note that for date values in non-standard form we need to pass the dtformat
keyword argument.
julia> _tmp = content(taxi, output = true)[2];
julia> alltypes = disallowmissing(_tmp[:, :eltype]);
julia> taxi = filereader(taxi_file, linebreak = ['\r','\n'],
types = alltypes,
dtformat = Dict(2:3 .=> dateformat"y-m-d H:M:S"))
┌ Info: There might be more observations in the input file at line 11 (observation 10) than the number of columns in the output dataset.
└ CMT,2010-03-15 15:35:04,2010-03-15 16:00:07,1,5.2000000000000002,-73.993639000000002,40.720208999999997,1,,,-73.946438000000001,40.778773999999999,Cas,16.100000000000001,0,0.5,0,0,16.600000000000001.
┌ Warning: There are problems with parsing the input file at line 11 (observation 10) :
│ Column 13 : fare_amount::Float64 : Read from buffer ("Cas")
│ the values are set as missing.
│ MORE DETAILS:
│ vendor_id::String = CMT, pickup_datetime::DateTime = 2010-03-15T15:35:04, dropoff_datetime::DateTime = 2010-03-15T16:00:07, passenger_count::Int64 = 1, trip_distance::Float64 = 5.2, pickup_longitude::Float64 = -73.993639, pickup_latitude::Float64 = 40.720209, rate_code::Int64 = 1, store_and_fwd_flag::Int64 = missing, dropoff_longitude::Float64 = missing, dropoff_latitude::Float64 = -73.946438, payment_type::String = 40.778773999999999, fare_amount::Float64 = missing, surcharge::Float64 = 16.1, mta_tax::Float64 = 0.0, tip_amount::Float64 = 0.5, tolls_amount::Float64 = 0.0, total_amount::Float64 = 0.0
│ CMT,2010-03-15 15:35:04,2010-03-15 16:00:07,1,5.2000000000000002,-73.993639000000002,40.720208999999997,1,,,-73.946438000000001,40.778773999999999,Cas,16.100000000000001,0,0.5,0,0,16.600000000000001
└ @ DLMReader ...
20×18 Dataset
Row │ vendor_id pickup_datetime dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude rate_code store_and_fwd_flag dropoff_longitude ⋯
│ identity identity identity identity identity identity identity identity identity identity ⋯
│ String? DateTime? DateTime? Int64? Float64? Float64? Float64? Int64? Int64? Float64? ⋯
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ CMT 2010-03-15T15:29:46 2010-03-15T15:34:09 1 0.8 -74.0047 40.7343 1 0 -73.9936 ⋯
2 │ CMT 2010-03-15T11:44:31 2010-03-15T11:48:48 1 0.3 -73.9584 40.7729 1 0 -73.9636
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
19 │ CMT 2010-03-15T14:31:04 2010-03-15T14:40:03 1 0.9 -73.9977 40.7412 1 0 -73.9834
20 │ CMT 2010-03-15T12:20:25 2010-03-15T12:26:38 1 0.9 -73.9983 40.7454 1 0 -73.9954
8 columns and 16 rows omitted
Examining the warnings
The last code produces some info
and warning
messages which alert the existence of some issues in reading process. When the filereader
reads all the columns but fails to reach the end of the line, it produce some info
to alert user. The Info
message contains the information about location of issue and the raw text of the input line.
Additionally, when the filereader
cannot parse a particular values, it provides some warning
messages to help users to investigate the problem. The message contains the details and the location of the problem.
In the above example, we can see that the 11th line of the input file has an issue. An investigation reveals that the problem is due to extra ",
" in place of missing values, i.e. near "1,,,-73.946438000000001
". This causes a shift in values, thus, the filereader
function fails to parse the value of the 13th column. This problem is a data entry problem and we cannot fix it unless there is a systematic pattern for such problems. Fortunately, for this specific file the patter is fixed for all lines so we can exploit some features of the DLMReader
package to fix the issue.
To fix the aforementioned problem, we define a new informat which reads a line from the input file and modify its contents in-place and pass this to the filereader
function via line_informat
.
The logic that we are going to follow is "replacing the second ,
in ,,,
with space". Note that this may not make sense for any other files, thus, user must search for a particular pattern in each case.
The line_informat
keyword argument accepts a registered informat which is a function with one positional argument, a special type of mutable string.
Note that the
line_informat
informat is called on each line of the input file, thus, use low level programming to avoid any allocation.
julia> function LINFMT!(x)
replace!(x, ",,," => ", ,")
end
julia> register_informat(LINFMT!)
[ Info: Informat LINFMT! has been registered
julia> taxi = filereader(taxi_file, linebreak = ['\r','\n'],
types = alltypes,
dtformat = Dict(2:3 .=> dateformat"y-m-d H:M:S"),
line_informat = LINFMT!)
20×18 Dataset
Row │ vendor_id pickup_datetime dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude rate_code store_and_fwd_flag dropoff_longitude ⋯
│ identity identity identity identity identity identity identity identity identity identity ⋯
│ String? DateTime? DateTime? Int64? Float64? Float64? Float64? Int64? Int64? Float64? ⋯
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ CMT 2010-03-15T15:29:46 2010-03-15T15:34:09 1 0.8 -74.0047 40.7343 1 0 -73.9936 ⋯
2 │ CMT 2010-03-15T11:44:31 2010-03-15T11:48:48 1 0.3 -73.9584 40.7729 1 0 -73.9636
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
20 │ CMT 2010-03-15T12:20:25 2010-03-15T12:26:38 1 0.9 -73.9983 40.7454 1 0 -73.9954
8 columns and 17 rows omitted
Dealing with String columns
Columns with String
type cause performance issue for large data sets. If those columns contains only a few unique values then we must convert the corresponding columns to PooledArray
. However, if this is not possible we must read those columns into Julia
as fixed-length string types.
User can read the columns as fixed-length string and convert them into
PooledArray
later.
To convert the columns with String
type to PooledArray
, user should use the modify!
function from the InMemoryDatasets
package,
julia> using PooledArrays
julia> modify!(taxi, names(taxi, AbstractString) => PooledArray);
julia> describe(taxi)
18×7 Dataset
Row │ column n nmissing mean std minimum maximum
│ identity identity identity identity identity identity identity
│ String? Any Any Any Any Any Any
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ vendor_id 20 0 nothing nothing CMT CMT
2 │ pickup_datetime 20 0 nothing nothing 2010-03-15T11:07:30 2010-03-15T15:35:04
3 │ dropoff_datetime 20 0 nothing nothing 2010-03-15T11:21:39 2010-03-15T16:00:07
4 │ passenger_count 20 0 1.1 0.307794 1 2
5 │ trip_distance 20 0 2.11 2.15917 0.1 9.2
6 │ pickup_longitude 20 0 -73.9863 0.0186699 -74.016 -73.9574
7 │ pickup_latitude 20 0 40.7486 0.022824 40.7151 40.7972
8 │ rate_code 20 0 1.0 0.0 1 1
9 │ store_and_fwd_flag 19 1 0.0 0.0 0 0
10 │ dropoff_longitude 20 0 -73.9782 0.0154007 -74.0016 -73.9464
11 │ dropoff_latitude 20 0 40.7603 0.0178604 40.7196 40.7838
12 │ payment_type 20 0 nothing nothing Cas Cre
13 │ fare_amount 20 0 8.44 5.62741 2.5 27.7
14 │ surcharge 20 0 0.0 0.0 0.0 0.0
15 │ mta_tax 20 0 0.5 0.0 0.5 0.5
16 │ tip_amount 20 0 0.368 0.830127 0.0 3.0
17 │ tolls_amount 20 0 0.0 0.0 0.0 0.0
18 │ total_amount 20 0 9.308 5.74365 3.0 28.2