Tutorial - Basics
This section provides a tutorial for using the DLMReader
package to read a delimited file into Julia
. To start using the package enter the following expression in a Julia
session
julia> using DLMReader
If you have not installed the package yet, you will be prompt to do it so.
Reading a csv file
The tutorial_1.csv
file is available as part of the DLMReader
package and the following expression creates a file reference to its location,
julia> fname = joinpath(dirname(pathof(DLMReader)),
"..", "docs", "src", "assets", "tutorial_1.csv"
);
The filereader
function is an exported function for reading delimited files into Julia
. By default, the filereader
function uses comma to separate values in a file, however, the delimiter of tutorial_file
is ";
", so we pass this to the filereader
function via the delimiter
keyword argument.
julia> tutorial = filereader(fname, delimiter = ';')
3×4 Dataset
Row │ x1 x2 x3|value date
│ identity identity identity identity
│ Int64? String? String? String?
─────┼──────────────────────────────────────────
1 │ 1 2,30 ab|2 2022-01-02
2 │ 10 -2,10 cd|3 2022-19-20
3 │ 4 1,34 dd| 2022-02-13
Note that you should pass the file delimiter as
Char
to thedelimiter
keyword arguments.
By default, the filereader
function assumes that the first line of the file contains the column names, and it uses this to create column name for the output data set as shown in this example.
The output data set is a Dataset
which is a special type for working with tabular data in Julia
, see the InMemoryDatasets.jl package for more information about working with tabular data in Julia
.
From the output data set, we observe that the third column is actually a mix of two columns which are separated by |
. This means that our delimited file is using alternative delimiters for separating values, so we should provide this information to the filereader
function to help it correctly parse the input file. To use alternative delimiters we must pass a vector of delimiters to the delimiter
keyword argument:
julia> tutorial = filereader(fname, delimiter = [';','|'])
3×5 Dataset
Row │ x1 x2 x3 value date
│ identity identity identity identity identity
│ Int64? String? String? Int64? String?
─────┼────────────────────────────────────────────────────
1 │ 1 2,30 ab 2 2022-01-02
2 │ 10 -2,10 cd 3 2022-19-20
3 │ 4 1,34 dd missing 2022-02-13
Note that by passing [';', '|']
as delimiter, filereader
correctly reads the :x3
and :value
columns into Julia
.
Dealing with Date and Time
The last column of the tutorial_1.csv
file is a Date
type, however, it is read as String
. This is the default behaviour of filereader
. User may force the type of a specific column by passing a vector or a Dict
as the types
keyword argument. If user is willing to specify every column's type, s/he should use a vector of types with the same length of the number of columns as the value of the types
keyword argument, e.g. in the above example [Int, String, String, Int, Date]
would be fine, however, since we are only interested to correct the type of the fifth column, we can pass it in Dict
to the types
keyword argument:
julia> tutorial = filereader(fname, delimiter = [';','|'], types = Dict(5 => Date))
┌ Warning: There are problems with parsing file at line 3 (observation 2) :
│ Column 5 : date : Read from buffer ("2022-19-20")
│ the values are set as missing.
│ MORE DETAILS:
│ x1::Int64 = 10, x2::String = -2,10, x3::String = cd, value::Int64 = 3, date::Date = missing
│ 10;-2,10;cd|3;2022-19-20
└ @ DLMReader ...
3×5 Dataset
Row │ x1 x2 x3 value date
│ identity identity identity identity identity
│ Int64? String? String? Int64? Date?
─────┼────────────────────────────────────────────────────
1 │ 1 2,30 ab 2 2022-01-02
2 │ 10 -2,10 cd 3 missing
3 │ 4 1,34 dd missing 2022-02-13
By default, the filereader
function assumes that the column with the Date
type are in the standard format, i.e. yyyy-mm-dd
, and it will try to parse each value using the date format, however, if a value is not parsable filereader
parses it as missing
, and it shows a warning message to indicate that the parser fails to parse the specific value and provide some information about this. If the value should be treated as missing
, user may ignore the warning message. For instance, in the above example, the issue is due to the fact that 19
is not a correct value for a month, so it must be a data entry error.
If values for date(time) are not represented as yyyy-mm-dd
format, user can use the dtformat
keyword argument to provide the right date format of a specific column. User must pass a dictionary of date format for specifying the date format:
julia> tutorial = filereader(fname, delimiter = [';','|'],
types = Dict(5 => Date),
dtformat = Dict(5 => dateformat"y-m-d"));
Using "informats"
The DLMReader
package provides special functionality, called informat
, to allow modification of the raw text before the parsing phase, i.e. informat
is a special function which will be called on the raw text of a value before sending the text to the parser. The DLMReader
package is shipped with some predefined informats, however, power users can define their own informats for special purposes.
We are going to use one of the predefined informats to parse the second column of tutorial_1.csv
file. The second column of tutorial_1.csv
file uses ",
" as decimal point in numbers, this is a common practice in some european countries. To parse this column correctly, we can call the COMMAX!
informat before parsing its values. The COMMAX!
informat converts ",
" to decimal points, and removes ".
" (thousand separator) and "€
" (U+20AC) from the numbers.
julia> tutorial = filereader(fname, delimiter = [';','|'],
types = Dict(5 => Date),
informat = Dict(2 => COMMAX!))
┌ Warning: There are problems with parsing file at line 3 (observation 2) :
│ Column 5 : date : Read from buffer ("2022-19-20")
│ the values are set as missing.
│ MORE DETAILS:
│ x1::Int64 = 10, x2::Float64 = -2.1, x3::String = cd, value::Int64 = 3, date::Date = missing
│ 10;-2.10;cd|3;2022-19-20
└ @ DLMReader ...
3×5 Dataset
Row │ x1 x2 x3 value date
│ identity identity identity identity identity
│ Int64? Float64? String? Int64? Date?
─────┼────────────────────────────────────────────────────
1 │ 1 2.3 ab 2 2022-01-02
2 │ 10 -2.1 cd 3 missing
3 │ 4 1.34 dd missing 2022-02-13
Writing a data set to disk
The DLMReader
package provides the filewriter
function for writing a data set as a flat file into disk. The function uses comma as default delimiter, however, the user can pass any other delimiter via the delimiter
keyword argument.
julia> filewriter("t_file.csv", tutorial)