Introduction

The DLMReader package has the filereader and filerwriter functions for reading and writing delimited files, respectively. They have a few keyword arguments which we explain each of them in this section.

`filereader`

User must pass the file name as the first argument of filereader to read a delimited file into Julia, i.e. filereader(path; ...). The filereader function assumes that the observations are separated by comma and the first line of the input file contains the columns' name, additionally, it assumes that the strings are not quoted. It scans the first 20 lines of the input file to detect Int64 and Float64 columns, and use String as the default type when the detection goes wrong. Thus, for a well-formatted csv file, user does not need to use any keyword argument. However, the filereader function provides some keyword arguments to give user extra flexibility for reading complex delimited files.

filereader treats empty strings and "." as missing

Keyword arguments

types
delimiter
dlmstr
ignorerepeated
header
linebreak
guessingrows
fixed
quotechar
escapechar
dtformat
int_base
informat
skipto
limit
multiple_obs
line_informat
buffsize
lsize
string_trim
makeunique
emptycolname
warn
eolwarn
threads
threshold

`types`

User can pass the types of each column of the input file by using the types keyword argument. User may pass a vector of types which includes every type of each column, or may pass a dictionary of types for few selected columns.

Default: auto detection

julia> ds = filereader(IOBuffer("""x1,x2
       12,13
       1,2
       """), types = [Int, Float64])
2×2 Dataset
 Row │ x1        x2       
     │ identity  identity 
     │ Int64?    Float64? 
─────┼────────────────────
   1 │       12      13.0
   2 │        1       2.0

`delimiter`

To change the default delimiter, user must pass the delimiter keyword argument. The delimiter keyword argument only accept Char as delimiter. Additionally, user can pass a vector of Char which causes filereader to use them as alternative delimiters.

Default: comma

julia> ds = filereader(IOBuffer("""x1;x2
       12;13
       1;2
       """), delimiter = ';')
2×2 Dataset
 Row │ x1        x2       
     │ identity  identity 
     │ Int64?    Int64?   
─────┼────────────────────
   1 │       12        13
   2 │        1         2

`dlmstr`

This keyword argument is used to pass a string as the delimiter for values.

Default: nothing

julia> ds = filereader(IOBuffer("""x1|:|x2
       12|:|13
       1|:|2
       """), dlmstr = "|:|" )
2×2 Dataset
 Row │ x1        x2       
     │ identity  identity 
     │ Int64?    Int64?   
─────┼────────────────────
   1 │       12        13
   2 │        1         2

`ignorerepeated`

If it is set as true, repeated delimiters will be ignored.

Default: false

julia> ds = filereader(IOBuffer("""x1,,x2
       12,13
       1,,,,2
       """), ignorerepeated = true)
2×2 Dataset
 Row │ x1        x2       
     │ identity  identity 
     │ Int64?    Int64?   
─────┼────────────────────
   1 │       12        13
   2 │        1         2

User must set this as false if the first line of the input file is not the column header. Additionally, user can pass a vector of columns' name, which will be used as the columns' header.

Default: true

julia> ds = filereader(IOBuffer("""
       12,13
       1,2
       """), header = [:Col1, :Col2])
2×2 Dataset
 Row │ Col1      Col2     
     │ identity  identity 
     │ Int64?    Int64?   
─────┼────────────────────
   1 │       12        13
   2 │        1         2

`linebreak`

The filereader function use the value of this option as line separator. It can accept a Char or a vector of Char where the length of the vector is less than or equal two. For some rare cases user may need to pass this option to assist filereader in reading the input file.

Default: auto detection

julia> ds = filereader(IOBuffer("""
       x1,x2;12,13;1,2;"""), linebreak = ';')
2×2 Dataset
 Row │ x1        x2       
     │ identity  identity 
     │ Int64?    Int64?   
─────┼────────────────────
   1 │       12        13
   2 │        1         2

`guessingrows`

This provide the number of lines to be used for types detection. The filereader function will detect the types of the column more accurately if user increase this value, however, it costs more computation time.

Default: 20

`fixed`

This option is used for reading fixed width files. User must pass a dictionary of columns' locations (as a range) for reading a fixed width file.

Default: nothing

julia> ds = filereader(IOBuffer("""
       12
       34
       """), fixed = Dict(1=>1:1, 2=>2:2), header = false)
2×2 Dataset
 Row │ x1        x2       
     │ identity  identity 
     │ Int64?    Int64?   
─────┼────────────────────
   1 │        1         2
   2 │        3         4

`quotechar`

If the texts are quoted in the input file, user must pass the quoted character via this keyword argument.

Default: nothing (the filereader assumes the texts are not quoted)

julia> ds = filereader(IOBuffer("""x1,x2
       "12",13
       "1",2
       """), quotechar = '"')
2×2 Dataset
 Row │ x1        x2       
     │ identity  identity 
     │ Int64?    Int64?   
─────┼────────────────────
   1 │       12        13
   2 │        1         2

`escapechar`

Declaring the escape char for quoted text.

Default: nothing (the filereader assumes the text are not quoted)

`dtformat`

User must pass the date format of DataTime columns if they are different from the standard format. The dtformat keyword argument accept a dictionary of values.

Default: nothing

julia> ds = filereader(IOBuffer("""date1,date2
       2020-1-1,2020/1/1
       2020-2-2,2020/2/2
       """), dtformat = Dict(1 => dateformat"y-m-d", 2 => dateformat"y/m/d"))
2×2 Dataset
 Row │ date1       date2      
     │ identity    identity   
     │ Date?       Date?      
─────┼────────────────────────
   1 │ 2020-01-01  2020-01-01
   2 │ 2020-02-02  2020-02-02

`int_base`

The filereader can read integers with with given base. User can pass this information for a specific column.

Default: nothing

julia> ds = filereader(IOBuffer("""x1,x2
       100,100
       101,101
       """), int_base = Dict(1 => 2))
2×2 Dataset
 Row │ x1        x2       
     │ identity  identity 
     │ Int64?    Int64?   
─────┼────────────────────
   1 │        4       100
   2 │        5       101

`informat`

User can pass a dictionary which provides the information of the informat of selected columns.

Default: nothing

julia> ds = filereader(IOBuffer("""x1,x2
       NA,12
       1,NA
       """), informat = Dict(1:2 .=> NA!))
2×2 Dataset
 Row │ x1        x2       
     │ identity  identity 
     │ Int64?    Int64?   
─────┼────────────────────
   1 │  missing        12
   2 │        1   missing

`skipto`

It can be used to start reading a file from specific location.

Default: 1

julia> ds = filereader(IOBuffer("""COL1, COL2
       1,2
       2,3
       3,4
       """), skipto = 3, header = false)
2×2 Dataset
 Row │ x1        x2       
     │ identity  identity 
     │ Int64?    Int64?   
─────┼────────────────────
   1 │        2         3
   2 │        3         4

`limit`

It can be used to limit the number of observations read from the input file.

Default: Inf

julia> ds = filereader(IOBuffer("""COL1, COL2
       1,2
       2,3
       3,4
       """), limit = 1)
1×2 Dataset
 Row │ COL1      COL2     
     │ identity  identity 
     │ Int64?    Int64?   
─────┼────────────────────
   1 │        1         2

`multiple_obs`

If it is set as true, the filereader function assumes there may be more than one observation in each line of the input file.

Default: false

julia> ds = filereader(IOBuffer("""1,2,3,4,5
       6,7
       """), multiple_obs = true, header = [:x1, :x2], types = [Int, Int])
4×2 Dataset
 Row │ x1        x2       
     │ identity  identity 
     │ Int64?    Int64?   
─────┼────────────────────
   1 │        1         2
   2 │        3         4
   3 │        5         6
   4 │        7   missing

`line_informat`

User can provide line informat via this keyword argument.

Default: nothing

`buffsize`

User can provide any positive number for the buffer size. Each thread allocates the amount of buffsize and reads the values from the input file into it.

Default: 2^16

`lsize`

It indicated the line buffer size for reading the input files. For very wide table use may need to manually adjust this option. Its value must be less than buffsize.

Default: 2^15

`string_trim`

Setting this as true will trim the trailing blanks of strings before storing them into the output data set.

DLMReader shipped with the STRIP! informat which can be used to strip (removing leading and trailing blanks) any raw text before parsing.

Default: false

julia> ds = filereader(IOBuffer("""x1,x2
       "    fdh  ",df
       "dkhfd    ",dfadf
       """), quotechar = '"', string_trim = true)
2×2 Dataset
 Row │ x1        x2       
     │ identity  identity 
     │ String?   String?  
─────┼────────────────────
   1 │     fdh   df
   2 │ dkhfd     dfadf

julia> ds[:, :x1]
2-element Vector{Union{Missing, String}}:
 "    fdh"
 "dkhfd"

julia> ds = filereader(IOBuffer("""x1,x2,x3
       1,   2020-2-2   , " ff  "
       2,2020-1-1,"343"
       """), types = Dict(2 => Date), quotechar = '"', informat = Dict(2:3 .=> STRIP!))
2×3 Dataset
 Row │ x1        x2          x3       
     │ identity  identity    identity 
     │ Int64?    Date?       String?  
─────┼────────────────────────────────
   1 │        1  2020-02-02  ff
   2 │        2  2020-01-01  343

julia> ds[:, :x3]
2-element Vector{Union{Missing, String}}:
 "ff"
 "343"

`makeunique`

If there are non-unique columns' names, this can resolve it by adding a suffix to the names.

Default: false

julia> ds = filereader(IOBuffer("""x,x
       1,2
       """), makeunique = true)
1×2 Dataset
 Row │ x         x_1      
     │ identity  identity 
     │ Int64?    Int64?   
─────┼────────────────────
   1 │        1         2

`emptycolname`

If it is set to true, it generates a column name for columns with empty name.

Default: false

julia> ds = filereader(IOBuffer("""x,
       1,2
       """), emptycolname = true)
1×2 Dataset
 Row │ x         NONAME1  
     │ identity  identity 
     │ Int64?    Int64?   
─────┼────────────────────
   1 │        1         2

`warn`

Control the maximum number of warning and information. Setting it to 0 will suppress warnings and information during reading the input file.

Default: 20

`eolwarn`

Control if the end-of-line character warning should be shown.

Default: true

`threads`

For large files, the filereader function exploits all threads. However, this can be switch off by setting this argument as false.

Default: true

`threshold`

The file size threshold (in bytes) which specifies the minimum file size for switching to the high performance algorithm.

Default: 2^26

`filewriter`

The filewriter function writes a data set into disk. Behind the scene, it uses byrow function from InMemoryDatasets.jl to efficiently convert each row of the input data set into UInt8. The first argument of the filewriter must be a filename and the second argument must be the passed data set.

Keyword arguments

delimiter
quotechar
mapformats
append
header
buffsize
lsize
threads

Introduction

`filereader`

Keyword arguments

`types`

`delimiter`

`dlmstr`

`ignorerepeated`

`header`

`linebreak`

`guessingrows`

`fixed`

`quotechar`

`escapechar`

`dtformat`

`int_base`

`informat`

`skipto`

`limit`

`multiple_obs`

`line_informat`

`buffsize`

`lsize`

`string_trim`

`makeunique`

`emptycolname`

`warn`

`eolwarn`

`threads`

`threshold`

`filewriter`

Keyword arguments

`delimiter`

`quotechar`

`mapformats`

`append`

`header`

`buffsize`

`lsize`

`threads`