Type: | Package |
Title: | Handle Data with Messy Header Rows and Broken Values |
Version: | 0.4.0 |
Depends: | R (≥ 3.1.0) |
Description: | Verb-like functions to work with messy data, often derived from spreadsheets or parsed PDF tables. Includes functions for unwrapping values broken up across rows, relocating embedded grouping values, and to annotate meaningful formatting in spreadsheet files. |
License: | MIT + file LICENSE |
Encoding: | UTF-8 |
LazyData: | true |
Imports: | dplyr (≥ 0.8.4), rlang (≥ 0.2.1), forcats, stringr, tidyr, magrittr, tidyxl, readxl, tibble |
RoxygenNote: | 7.3.1 |
Suggests: | knitr, rmarkdown, testthat (≥ 2.1.0), covr |
VignetteBuilder: | knitr |
URL: | https://github.com/luisDVA/unheadr, https://unheadr.liomys.mx/ |
BugReports: | https://github.com/luisDVA/unheadr/issues |
NeedsCompilation: | no |
Packaged: | 2024-05-13 18:55:37 UTC; luisd |
Author: | Luis D. Verde Arregoitia
|
Maintainer: | Luis D. Verde Arregoitia <luis@liomys.mx> |
Repository: | CRAN |
Date/Publication: | 2024-05-13 19:10:02 UTC |
unheadr: Handle Data with Messy Header Rows and Broken Values
Description
Verb-like functions to work with messy data, often derived from spreadsheets or parsed PDF tables. Includes functions for unwrapping values broken up across rows, relocating embedded grouping values, and to annotate meaningful formatting in spreadsheet files.
Author(s)
Maintainer: Luis D. Verde Arregoitia luis@liomys.mx (ORCID)
See Also
Useful links:
Report bugs at https://github.com/luisDVA/unheadr/issues
re-export magrittr pipe operator
Description
re-export magrittr pipe operator
Statistics for game units in Age of Empires II: Definitive Edition
Description
A dataset with the numerical values that determine the behavior and performance of selected military units available in AoE2:DE (July 2020 Game Update).
Usage
AOEunits
Format
A data frame with 128 observations of 19 variables:
- unit
Unit name
- building
Building in which each unit is trained
- type
Unit class
- age
Age at which the unit becomes trainable
- cost_wood
Unit cost in Wood
- cost_food
Unit cost in Food
- cost_gold
Unit cost in Gold
- build_time
Training time in seconds
- rate_of_fire
Attack speed
- attack_delay
Retasking time
- movement_speed
Travel speed on land
- line_of_sight
Vision over the surrounding area
- hit_points
Unit health
- min_range
Minimum attacking range for ranged units
- range
Maximum attacking range for ranged units
- damage
Damage inflicted per attack
- accuracy
Chance that an attack will be on target
- melee_armor
Armor against melee attacks
- pierce_armor
Armor against projectiles
Source
Age of Empires II. Copyright Microsoft Corporation. This dataset was created under Microsoft's Usage rules for Digital Goods using assets from Age of Empires II, and it is not endorsed by or affiliated with Microsoft. All information shown is an interpretation of data collected in-game with no guarantee on the accuracy of any of the data presented.
Statistics for game units in Age of Empires II: Definitive Edition in a messy presentation
Description
A messy version of the AOEunits
dataset, meant for demonstrating data cleaning functions.
Usage
AOEunits_raw
Format
A data frame with 139 observations of 15 variables. See AOEunits
for variable descriptions.
Source
Age of Empires II. Copyright Microsoft Corporation. This dataset was created under Microsoft's Usage rules for Digital Goods using assets from Age of Empires II, and it is not endorsed by or affiliated with Microsoft. All information shown is an interpretation of data collected in-game with no guarantee on the accuracy of any of the data presented.
Annotate meaningful formatting
Description
Turns cell formatting into annotations for values in the target variable.
Usage
annotate_mf(xlfilepath, orig, new)
Arguments
xlfilepath |
Path to a single-sheet spreadsheet file (xls or xlsx). |
orig |
Target variable to annotate formatting in. |
new |
Name of new variable with cell formatting pasted as a string. |
Details
Seven popular approaches for meaningful formatting (bold, colored
text, italic, strikethrough, underline, double underline, and cell
highlighting) are hardcoded in the function. sheets
, skip
, and range
arguments for spreadsheet input are not supported. The hex8 code of the
fill color used for text color and cell highlighting is also appended in
the output. Ensure the data in the spreadsheet are rectangular before
running; this includes blank but formatted cells beyond the data rectangle.
Value
A tibble with a new column in which the meaningful formatting is embedded as text.
Examples
example_spreadsheet <- system.file("extdata/dog_test.xlsx", package = "unheadr")
annotate_mf(example_spreadsheet, orig = Task, new = Task_annotated)
Annotate meaningful formatting for all cells
Description
Turns cell formatting into annotations for all values across all variables.
Usage
annotate_mf_all(xlfilepath)
Arguments
xlfilepath |
Path to a single-sheet spreadsheet file (xls or xlsx). |
Details
Seven popular approaches for meaningful formatting (bold, colored
text, italic, strikethrough, underline, double underline, and cell
highlighting) are supported in this function. sheets
, skip
, and range
arguments for spreadsheet input are not supported. The hex8 code of the
fill color used for text color and cell highlighting is also appended in
the output. Ensure the data in the spreadsheet are rectangular before
running; this includes blank but formatted cells beyond the data rectangle.
Value
A tibble with meaningful formatting embedded as text for all rows and columns.
Examples
example_spreadsheet <- system.file("extdata/boutiques.xlsx", package = "unheadr")
annotate_mf_all(example_spreadsheet)
boutiques.xlsx spreadsheet
Description
Open XML Format Spreadsheet with 1 sheet, 6 columns, and 8 rows. Toy dataset with Q1 profits for different store locations. Additional information is encoded as meaningful formatting. Bold indicates losses (negative values), colors indicate continent, and italic indicates a second location in the same city.
Details
This data is used in the example for annotate_mf_all()
.
dog_test.xlsx spreadsheet
Description
Open XML Format Spreadsheet with 1 sheet, 2 columns, and 12 rows. Items describe various tasks or behaviors that dogs can be evaluated on, assigned into three categories which appear along with their average scores as embedded subheaders with meaningful formatting.
Details
This data is used in the example for annotate_mf()
.
Source
Items are modified from the checklist written by Junior Watson.
References
http://www.dogtrainingbasics.com/checklist-well-behaved-dog/
Make many header rows into column names
Description
Make many header rows into column names
Usage
mash_colnames(
df,
n_name_rows,
keep_names = TRUE,
sliding_headers = FALSE,
sep = "_"
)
Arguments
df |
A |
n_name_rows |
Number of rows at the top of the data to be used to create the new variable (column) names. Must be >= 1. |
keep_names |
If TRUE, existing names will be included when building the new variable names. Defaults to TRUE. |
sliding_headers |
If TRUE, empty values in the first (topmost) header header row be filled column-wise. Defaults to FALSE. See details. |
sep |
Character string to separate the unified values (default is underscore). |
Details
Tables are often shared with the column names broken up across the
first few rows. This function takes the number of rows at the top of a
table that hold the broken up names and whether or not to include the
names, and mashes the values column-wise into a single string for each
column. The keep_names
argument can be helpful for tables we
imported using a skip
argument. If keep_names
is set to FALSE
,
adjust the value of n_name_rows
accordingly.
This function will throw a warning when possible NA
values end up in the
variable names. sliding_headers
can be used for tables with ragged
names in which not every column has a value in the very first row. In these
cases attribution by adjacency is assumed, and when sliding_headers
is set to TRUE
the names in the topmost row are filled row-wise. This can
be useful for tables reporting survey data or experimental designs in an
untidy manner.
Value
The original data frame, but with new column names and without the top n rows that held the broken up names.
Author(s)
This function was originally contributed by Jarrett Byrnes through a GitHub issue.
Examples
babies <-
data.frame(
stringsAsFactors = FALSE,
Baby = c(NA, NA, "Angie", "Yean", "Pierre"),
Age = c("in", "months", "11", "9", "7"),
Weight = c("kg", NA, "2", "3", "4"),
Ward = c(NA, NA, "A", "B", "C")
)
# Including the object names
mash_colnames(babies, n_name_rows = 2, keep_names = TRUE)
babies_skip <-
data.frame(
stringsAsFactors = FALSE,
X1 = c("Baby", NA, NA, "Jennie", "Yean", "Pierre"),
X2 = c("Age", "in", "months", "11", "9", "7"),
X3 = c("Hospital", NA, NA, "A", "B", "A")
)
#' # Discarding the automatically-generated names (X1, X2, etc...)
mash_colnames(babies_skip, n_name_rows = 3, keep_names = FALSE)
fish_experiment <-
data.frame(
stringsAsFactors = FALSE,
X1 = c("Sample", NA, "Pacific", "Atlantic", "Freshwater"),
X2 = c("Larvae", "Control", "12", "11", "10"),
X3 = c(NA, "Low Dose", "11", "12", "8"),
X4 = c(NA, "High Dose", "8", "7", "9"),
X5 = c("Adult", "Control", "13", "13", "8"),
X6 = c(NA, "Low Dose", "13", "12", "7"),
X7 = c(NA, "High Dose", "10", "10", "9")
)
# Ragged names
mash_colnames(fish_experiment,
n_name_rows = 2,
keep_names = FALSE, sliding_headers = TRUE
)
Comparative data for 54 species of primates
Description
A dataset with embedded subheaders.
Usage
primates2017
Format
A data frame with 69 rows and 4 variables:
- scientific_name
scientific names, with geographic region and taxonomic family embedded as subheaders.
- common_name
vernacular name
- red_list_status
IUCN Red List Status in January 2017
- mass_kg
mean body mass in kilograms
Source
Estrada, Alejandro, et al. "Impending extinction crisis of the world's primates: Why primates matter." Science Advances 3.1 (2017): e1600946. doi:10.1126/sciadv.1600946
Comparative data for 16 species of primates with some broken values
Description
A dataset with embedded subheaders and some values (T. obscurus, T. leucocephalus and N. bengalensis) in the scientific_names variable broken up across two rows (typically done to fit the content in a table).
Usage
primates2017_broken
Format
A data frame with 19 rows and 4 variables:
- scientific_name
scientific names, with embedded subheaders for geographic region and taxonomic family and broken values
- common_name
vernacular name
- red_list_status
IUCN Red List Status in January 2017
- mass_kg
mean body mass in kilograms
Source
Estrada, Alejandro, et al. "Impending extinction crisis of the world's primates: Why primates matter." Science Advances 3.1 (2017): e1600946. doi:10.1126/sciadv.1600946
Comparative data for two species of primates
Description
A dataset in which the elements for some of the values are in separate rows'
Usage
primates2017_wrapped
Format
A data frame with 9 rows and 6 variables:
- scientific_name
scientific names, see reference
- common_name
vernacular name
- habitat
habitat types listed in the IUCN Red List assessments
- red_list_status
IUCN Red List Status in January 2017
- mass_kg
mean body mass in kilograms
- country
Countries where the species is present, from IUCN Red List assessments
Source
Estrada, Alejandro, et al. "Impending extinction crisis of the world's primates: Why primates matter." Science Advances 3.1 (2017): e1600946. doi:10.1126/sciadv.1600946
Vertical character string alignment through regular expressions
Description
Aligning strings with regex.
Usage
regex_valign(stringvec, regex_ai, sep_str = "")
Arguments
stringvec |
A character vector with one element for each line. |
regex_ai |
A regular expression matching the position for alignment. |
sep_str |
Optional character vector that will be inserted at the positions matched by the regular expression. |
Details
Written mainly for reading fixed width files, text, or tables parsed from PDFs.
Value
A character vector with one element for each line, with padding inserted at the matched positions so that elements are vertically aligned across lines.
See Also
This function is based loosely on
textutils::valign()
.
Examples
guests <-
unlist(strsplit(c("6 COAHUILA 20/03/2020
7 COAHUILA 20/03/2020
18 BAJA CALIFORNIA 16/03/2020
109 CDMX 12/03/2020
1230 QUERETARO 21/03/2020"), "\n"))
# align at first uppercase word boundary , inserting a separator
regex_valign(guests, "\\b(?=[A-Z])", " - ")
# align dates at end of string
regex_valign(guests, "\\b(?=[0-9]{2}[\\/]{1}[0-9]{2}[\\/]{1}[0-9]{4}$)")
Deduplicate and remove trailing line breaks
Description
Deduplicate and remove trailing line breaks
Usage
squish_newlines(sepstring)
Arguments
sepstring |
A character vector with new line control characters. |
Details
Useful for tables with merged cells, often imported from Word or PDF files. Can be applied across multiple columns before separating into rows.
Value
A vector without trailing or multiple consecutive new line sequences.
Examples
vecWithNewlines <- c("dog\n\ncat\n\n\npig\n")
squish_newlines(vecWithNewlines)
Merge rows up
Description
Merge rows up
Usage
unbreak_rows(df, regex, ogcol, sep = " ")
Arguments
df |
A data frame with at least two contiguous rows to be merged. |
regex |
A regular expression to identify sets of rows to be merged, meant for the leading of the two contiguous rows. |
ogcol |
Variable with the text strings to match. |
sep |
Character string to separate the unified values (default is space). |
Details
This function recodes empty strings ("") to NA
for smoother pattern matching.
Value
A tibble or data frame with merged rows. Values of the lagging rows are pasted onto the values in the leading row, whitespace is squished, and the lagging row is dropped.
Examples
bball <-
data.frame(
stringsAsFactors = FALSE,
v1 = c(
"Player", NA, "Sleve McDichael", "Dean Wesrey",
"Karl Dandleton"
),
v2 = c("Most points", "in a game", "55", "43", "41"),
v3 = c("Season", "(year ending)", "2001", "2000", "2010")
)
unbreak_rows(bball, "Most", v2)
Unbreak values using regex to match the lagging half of the broken value
Description
Unbreak values using regex to match the lagging half of the broken value
Usage
unbreak_vals(df, regex, ogcol, newcol, sep = " ", slice_groups)
Arguments
df |
A data frame with one or more values within a variable broken up across two rows. |
regex |
Regular expression for matching the trailing (lagging) half of the broken values. |
ogcol |
Variable to unbreak. |
newcol |
Name of the new variable with the unified values. |
sep |
Character string to separate the unified values (default is space). |
slice_groups |
Deprecated. See details and Package News. |
Details
This function is limited to quite specific cases, but useful when
dealing with tables that contain, for example, scientific names broken across two rows.
For unwrapping values, see unwrap_cols
.
Value
A tibble with 'unbroken' values. The variable that originally
contained the broken values gets dropped, and the new variable with the
unified values is placed as the first column. The slice_groups
argument is now deprecated; the extra rows and the variable with broken
values will be dropped.
Examples
data(primates2017_broken)
# regex matches strings starting in lowercase (broken species epithets)
unbreak_vals(primates2017_broken, "^[a-z]", scientific_name, sciname_new)
Rectangling embedded subheaders
Description
Rectangling embedded subheaders
Usage
untangle2(df, regex, orig, new)
Arguments
df |
A data frame with embedded subheaders. |
regex |
Regular expression to match the subheaders. |
orig |
Variable containing the extraneous subheaders. |
new |
Name of variable that will contain the group values. |
Details
Special thanks to Jenny Bryan for fixing the initial tidyeval code and overall function structure.
Value
A tibble without the matched subheaders and a new variable containing the grouping data.
Examples
data(primates2017)
# put taxonomic family in its own variable (matches the suffix "DAE")
untangle2(primates2017, "DAE$", scientific_name, family)
# put geographic regions in their own variable (matching them all by name)
untangle2(
primates2017, "Asia|Madagascar|Mainland Africa|Neotropics",
scientific_name, family
)
# with magrittr pipes (re-exported in this package)
primates2017 %>%
untangle2("DAE$", scientific_name, family) %>%
untangle2(
"Asia|Madagascar|Mainland Africa|Neotropics",
scientific_name, region
)
Unwrap values and clean up NAs used as padding
Description
Unwrap values and clean up NAs used as padding
Usage
unwrap_cols(df, groupingVar, separator)
Arguments
df |
A data frame with wrapped values and an inconsistent number of NA values used to as within-group padding. |
groupingVar |
Name of the variable describing the observational units. |
separator |
Character string defining the separator that will delimit the elements of the unwrapped value. |
Details
This is roughly the opposite of tidyr::separate_rows()
.
Value
A summarized tibble. Order is preserved in the grouping variable by making it a factor.
Examples
data(primates2017_wrapped)
# using commas to separate elements
unwrap_cols(primates2017_wrapped, scientific_name, ", ")
# separating with semicolons
df <- data.frame(
ounits = c("A", NA, "B", "C", "D", NA),
vals = c(1, 2, 2, 3, 1, 3)
)
unwrap_cols(df, ounits, ";")