Type: | Package |
Title: | Read Hierarchical Fixed Width Files |
Version: | 0.2.5 |
Contact: | ipums@umn.edu |
Description: | Read hierarchical fixed width files like those commonly used by many census data providers. Also allows for reading of data in chunks, and reading 'gzipped' files without storing the full file in memory. |
License: | GPL-2 | GPL-3 | file LICENSE [expanded from: GPL (≥ 2) | file LICENSE] |
Encoding: | UTF-8 |
Depends: | R (≥ 3.0.2) |
LinkingTo: | Rcpp (≥ 1.0.12), BH |
Imports: | Rcpp (≥ 1.0.12), R6, rlang, tibble |
Suggests: | dplyr, readr, testthat |
RoxygenNote: | 7.3.2 |
URL: | https://github.com/ipums/hipread |
BugReports: | https://github.com/ipums/hipread/issues |
NeedsCompilation: | yes |
Packaged: | 2025-06-19 11:35:07 UTC; Derek |
Author: | Greg Freedman Ellis [aut], Derek Burk [aut, cre], Joe Grover [ctb], Mark Padgham [ctb], Hadley Wickham [ctb] (Code adapted from readr), Jim Hester [ctb] (Code adapted from readr), Romain Francois [ctb] (Code adapted from readr), R Core Team [ctb] (Code adapted from readr), RStudio [cph, fnd] (Code adapted from readr), Jukka Jylänki [ctb, cph] (Code adapted from readr), Mikkel Jørgensen [ctb, cph] (Code adapted from readr), University of Minnesota [cph] |
Maintainer: | Derek Burk <ipums+cran@umn.edu> |
Repository: | CRAN |
Date/Publication: | 2025-06-19 12:20:02 UTC |
hipread: Read Hierarchical Fixed Width Files
Description
Read hierarchical fixed width files like those commonly used by many census data providers. Also allows for reading of data in chunks, and reading 'gzipped' files without storing the full file in memory.
Author(s)
Maintainer: Derek Burk ipums+cran@umn.edu
Authors:
Greg Freedman Ellis
Other contributors:
Joe Grover [contributor]
Mark Padgham [contributor]
Hadley Wickham hadley@rstudio.com (Code adapted from readr) [contributor]
Jim Hester james.hester@rstudio.com (Code adapted from readr) [contributor]
Romain Francois (Code adapted from readr) [contributor]
R Core Team (Code adapted from readr) [contributor]
RStudio (Code adapted from readr) [copyright holder, funder]
Jukka Jylänki (Code adapted from readr) [contributor, copyright holder]
Mikkel Jørgensen (Code adapted from readr) [contributor, copyright holder]
University of Minnesota [copyright holder]
See Also
Useful links:
Callback classes
Description
These classes are used to define callback behaviors, and are based
on readr's readr::callback
functions.
Details
The callbacks
HipChunkCallback
,HipListCallback
andHipSideEffectChunkCallback
should be identical to their readr counterparts, but have been copied into hipread to ensure that they work even if readr changes.The callback
HipDataFrameCallback
is similar to readr::DataFrameCallback() except that it usesdplyr::bind_rows()
instead ofrbind()
so that it is faster.
Methods
Public methods
Method new()
Usage
ChunkCallback$new(callback)
Method receive()
Usage
ChunkCallback$receive(data, index)
Method continue()
Usage
ChunkCallback$continue()
Method result()
Usage
ChunkCallback$result()
Method finally()
Usage
ChunkCallback$finally()
Method clone()
The objects of this class are cloneable with this method.
Usage
ChunkCallback$clone(deep = FALSE)
Arguments
deep
Whether to make a deep clone.
Super class
hipread::ChunkCallback
-> HipChunkCallback
Methods
Public methods
Inherited methods
Method clone()
The objects of this class are cloneable with this method.
Usage
HipChunkCallback$clone(deep = FALSE)
Arguments
deep
Whether to make a deep clone.
Super classes
hipread::ChunkCallback
-> hipread::HipChunkCallback
-> HipSideEffectChunkCallback
Methods
Public methods
Inherited methods
Method new()
Usage
HipSideEffectChunkCallback$new(callback)
Method receive()
Usage
HipSideEffectChunkCallback$receive(data, index)
Method continue()
Usage
HipSideEffectChunkCallback$continue()
Method clone()
The objects of this class are cloneable with this method.
Usage
HipSideEffectChunkCallback$clone(deep = FALSE)
Arguments
deep
Whether to make a deep clone.
Super classes
hipread::ChunkCallback
-> hipread::HipChunkCallback
-> HipListCallback
Methods
Public methods
Inherited methods
Method new()
Usage
HipListCallback$new(callback)
Method receive()
Usage
HipListCallback$receive(data, index)
Method result()
Usage
HipListCallback$result()
Method finally()
Usage
HipListCallback$finally()
Method clone()
The objects of this class are cloneable with this method.
Usage
HipListCallback$clone(deep = FALSE)
Arguments
deep
Whether to make a deep clone.
Super classes
hipread::ChunkCallback
-> hipread::HipChunkCallback
-> HipDataFrameCallback
Methods
Public methods
Inherited methods
Method new()
Usage
HipDataFrameCallback$new(callback)
Method receive()
Usage
HipDataFrameCallback$receive(data, index)
Method result()
Usage
HipDataFrameCallback$result()
Method finally()
Usage
HipDataFrameCallback$finally()
Method clone()
The objects of this class are cloneable with this method.
Usage
HipDataFrameCallback$clone(deep = FALSE)
Arguments
deep
Whether to make a deep clone.
Specify column-specific options for hipread
Description
Specify column specifications analogous to readr::fwf_positions()
.
However, unlike in readr, the column type information is specified
alongside the column positions and there are two extra options that
can be specified (trim_ws
gives control over trimming whitespace
in character columns, and imp_dec
allows for implicit decimals in
double columns).
Usage
hip_fwf_positions(
start,
end,
col_names,
col_types,
trim_ws = TRUE,
imp_dec = 0
)
hip_fwf_widths(widths, col_names, col_types, trim_ws = TRUE, imp_dec = 0)
Arguments
start , end |
A vector integers describing the start and end positions of each field |
col_names |
A character vector of variable names |
col_types |
A vector of column types (specified as either "c" or "character" for character, "d" or "double" for double and "i" or "integer" for integer). |
trim_ws |
A logical vector, indicating whether to trim whitespace
on both sides of character columns (Defaults to |
imp_dec |
An integer vector, indicating the number of implicit decimals on a double variable (Defaults to 0, ignored on non-double columns). |
widths |
A vector of integer widths for each field (assumes that columns are consecutive - that there is no overlap or gap between fields) |
Value
A data.frame containing the column specifications
Examples
# 3 Columns, specified by position
hip_fwf_positions(
c(1, 3, 7),
c(2, 6, 10),
c("Var1", "Var2", "Var3"),
c("c", "i", "d")
)
# The same 3 columns, specified by width
hip_fwf_widths(
c(2, 4, 4),
c("Var1", "Var2", "Var3"),
c("c", "i", "d")
)
Create a record type information object
Description
Create a record type information object for hipread to use when reading hierarchical files. A width of 0 indicates that the file is rectangular (eg a standard fixed width file).
Usage
hip_rt(start, width, warn_on_missing = TRUE)
Arguments
start |
Start position of the record type variable |
width |
The width of the record type variable |
warn_on_missing |
Whether to warn when encountering a record type that is not specified |
Value
A list, really only intended to be used internally by hipread
Get path to hipread's example datasets
Description
Get access to example extracts.
Usage
hipread_example(path = NULL)
Arguments
path |
Name of file. If |
Value
The filepath to an example file, or if path is empty, a vector of all available files.
Examples
hipread_example() # Lists all available examples
hipread_example("test-basic.dat") # Gives filepath for a basic example
Calculate frequencies from fixed width file without loading into memory
Description
Calculate the frequency of values in all variables in a fixed width file. Does so without holding the whole data in memory or creating a full R data.frame and calling R code on interim pieces. (Probably only useful inside IPUMS HQ).
Usage
hipread_freqs(
file,
var_info,
rt_info = hip_rt(1, 0),
compression = NULL,
progress = show_progress()
)
Arguments
file |
A filename |
var_info |
Variable information, specified by either |
rt_info |
A record type information object, created by |
compression |
If |
progress |
A logical indicating whether progress should be
displayed on the screen, defaults to showing progress unless
the current context is non-interactive or in a knitr document or
if the user has turned off readr's progress by default using
the option |
Value
A list of frequencies
Read a hierarchical fixed width data file
Description
Analogous to readr::read_fwf()
but allowing for
hierarchical fixed width data files (where the data file has rows of
different record types, each with their own variables and column
specifications). hipread_long()
reads hierarchical data into "long"
format, meaning that there is one row per observation, and variables
that don't apply to the current observation receive missing values.
Alternatively, hipread_list()
reads hierarchical data into "list"
format, which returns a list that has one data.frame per record type.
Usage
hipread_long(
file,
var_info,
rt_info = hip_rt(1, 0),
compression = NULL,
skip = 0,
n_max = -1,
encoding = "UTF-8",
progress = show_progress()
)
hipread_list(
file,
var_info,
rt_info = hip_rt(1, 0),
compression = NULL,
skip = 0,
n_max = -1,
encoding = "UTF-8",
progress = show_progress()
)
Arguments
file |
A filename |
var_info |
Variable information, specified by either |
rt_info |
A record type information object, created by |
compression |
If |
skip |
Number of lines to skip at the start of the data (defaults to 0). |
n_max |
Maximum number of lines to read. Negative numbers (the default) reads all lines. |
encoding |
(Defaults to UTF-8) A string indicating what encoding to use when reading the data, but like readr, the data will always be converted to UTF-8 once it is imported. Note that UTF-16 and UTF-32 are not supported for non-character columns. |
progress |
A logical indicating whether progress should be
displayed on the screen, defaults to showing progress unless
the current context is non-interactive or in a knitr document or
if the user has turned off readr's progress by default using
the option |
Value
A tbl_df
data frame
Examples
# Read an example hierarchical data.frame into long format
data <- hipread_long(
hipread_example("test-basic.dat"),
list(
H = hip_fwf_positions(
c(1, 2, 5, 8),
c(1, 4, 7, 10),
c("rt", "hhnum", "hh_char", "hh_dbl"),
c("c", "i", "c", "d")
),
P = hip_fwf_widths(
c(1, 3, 1, 3, 1),
c("rt", "hhnum", "pernum", "per_dbl", "per_mix"),
c("c", "i", "i", "d", "c")
)
),
hip_rt(1, 1)
)
# Read an example hierarchical data.frame into list format
data <- hipread_list(
hipread_example("test-basic.dat"),
list(
H = hip_fwf_positions(
c(1, 2, 5, 8),
c(1, 4, 7, 10),
c("rt", "hhnum", "hh_char", "hh_dbl"),
c("c", "i", "c", "d")
),
P = hip_fwf_widths(
c(1, 3, 1, 3, 1),
c("rt", "hhnum", "pernum", "per_dbl", "per_mix"),
c("c", "i", "i", "d", "c")
)
),
hip_rt(1, 1)
)
# Read a rectangular data.frame
data_rect <- hipread_long(
hipread_example("test-basic.dat"),
hip_fwf_positions(
c(1, 2),
c(1, 4),
c("rt", "hhnum"),
c("c", "i")
)
)
Read a hierarchical fixed width data file, in chunks
Description
Analogous to readr::read_fwf()
, but with chunks, and allowing for
hierarchical fixed width data files (where the data file has rows of
different record types, each with their own variables and column
specifications). hipread_long_chunked()
reads hierarchical data into "long"
format, meaning that there is one row per observation, and variables
that don't apply to the current observation receive missing values.
Alternatively, hipread_list_chunked()
reads hierarchical data into "list"
format, which returns a list that has one data.frame per record type.
Usage
hipread_long_chunked(
file,
callback,
chunk_size,
var_info,
rt_info = hip_rt(1, 0),
compression = NULL,
skip = 0,
encoding = "UTF-8",
progress = show_progress()
)
hipread_list_chunked(
file,
callback,
chunk_size,
var_info,
rt_info = hip_rt(1, 0),
compression = NULL,
skip = 0,
encoding = "UTF-8",
progress = show_progress()
)
Arguments
file |
A filename |
callback |
A |
chunk_size |
The size of the chunks that will be read as a single unit (defaults to 10000) |
var_info |
Variable information, specified by either |
rt_info |
A record type information object, created by |
compression |
If |
skip |
Number of lines to skip at the start of the data (defaults to 0). |
encoding |
(Defaults to UTF-8) A string indicating what encoding to use when reading the data, but like readr, the data will always be converted to UTF-8 once it is imported. Note that UTF-16 and UTF-32 are not supported for non-character columns. |
progress |
A logical indicating whether progress should be
displayed on the screen, defaults to showing progress unless
the current context is non-interactive or in a knitr document or
if the user has turned off readr's progress by default using
the option |
Value
Depends on the type of callback
function you use
Examples
# Read in a data, filtering out hhnum == "002"
data <- hipread_long_chunked(
hipread_example("test-basic.dat"),
HipDataFrameCallback$new(function(x, pos) x[x$hhnum != 2, ]),
4,
list(
H = hip_fwf_positions(
c(1, 2, 5, 8),
c(1, 4, 7, 10),
c("rt", "hhnum", "hh_char", "hh_dbl"),
c("c", "i", "c", "d")
),
P = hip_fwf_widths(
c(1, 3, 1, 3, 1),
c("rt", "hhnum", "pernum", "per_dbl", "per_mix"),
c("c", "i", "i", "d", "c")
)
),
hip_rt(1, 1)
)
Read a hierarchical fixed width data file, in yields
Description
Enhances hipread_long()
or hipread_list()
to allow you to read
hierarchical data in pieces (called 'yields') and allow your code to
have full control between reading pieces, allowing for more freedom
than the 'callback' method introduced in the chunk functions (like
hipread_long_chunked()
).
Usage
hipread_long_yield(
file,
var_info,
rt_info = hip_rt(1, 0),
compression = NULL,
skip = 0,
encoding = "UTF-8"
)
hipread_list_yield(
file,
var_info,
rt_info = hip_rt(1, 0),
compression = NULL,
skip = 0,
encoding = "UTF-8"
)
Arguments
file |
A filename |
var_info |
Variable information, specified by either |
rt_info |
A record type information object, created by |
compression |
If |
skip |
Number of lines to skip at the start of the data (defaults to 0). |
encoding |
(Defaults to UTF-8) A string indicating what encoding to use when reading the data, but like readr, the data will always be converted to UTF-8 once it is imported. Note that UTF-16 and UTF-32 are not supported for non-character columns. |
Details
These functions return a HipYield R6 object which have the following methods:
-
yield(n = 10000)
A function to read the next 'yield' from the data, returns atbl_df
(or list oftbl_df
forhipread_list_yield()
) with up to n rows (it will return NULL if no rows are left, or all available ones if less than n are available). -
reset()
A function to reset the data so that the next yield will read data from the start. -
is_done()
A function that returns whether the file has been completely read yet or not. -
cur_pos
A property that contains the next row number that will be read (1-indexed).
Value
A HipYield R6 object (See 'Details' for more information)
Methods
Public methods
Method new()
Usage
HipYield$new( file, var_info, rt_info = hip_rt(0, 1), compression = NULL, skip = 0, encoding = NULL )
Method yield()
Usage
HipYield$yield(n = 10000)
Method reset()
Usage
HipYield$reset()
Method is_done()
Usage
HipYield$is_done()
Super class
hipread::HipYield
-> HipLongYield
Methods
Public methods
Inherited methods
Method new()
Usage
HipLongYield$new( file, var_info, rt_info = hip_rt(0, 1), compression = NULL, skip = 0, encoding = NULL )
Method yield()
Usage
HipLongYield$yield(n = 10000)
Super class
hipread::HipYield
-> HipListYield
Methods
Public methods
Inherited methods
Method new()
Usage
HipListYield$new( file, var_info, rt_info = hip_rt(0, 1), compression = NULL, skip = 0, encoding = NULL )
Method yield()
Usage
HipListYield$yield(n = 10000)
Examples
library(hipread)
data <- hipread_long_yield(
hipread_example("test-basic.dat"),
list(
H = hip_fwf_positions(
c(1, 2, 5, 8),
c(1, 4, 7, 10),
c("rt", "hhnum", "hh_char", "hh_dbl"),
c("c", "i", "c", "d")
),
P = hip_fwf_widths(
c(1, 3, 1, 3, 1),
c("rt", "hhnum", "pernum", "per_dbl", "per_mix"),
c("c", "i", "i", "d", "c")
)
),
hip_rt(1, 1)
)
# Read the first 4 rows
data$yield(4)
# Read the next 2 rows
data$yield(2)
# Reset and then read the first 4 rows again
data$reset()
data$yield(4)