Help for package wru

Title:

Who are You? Bayesian Prediction of Racial Category Using Surname, First Name, Middle Name, and Geolocation

Version:

3.0.3

Date:

2024-05-24

Description:

Predicts individual race/ethnicity using surname, first name, middle name, geolocation, and other attributes, such as gender and age. The method utilizes Bayes' Rule (with optional measurement error correction) to compute the posterior probability of each racial category for any given individual. The package implements methods described in Imai and Khanna (2016) "Improving Ecological Inference by Predicting Individual Ethnicity from Voter Registration Records" Political Analysis <doi:10.1093/pan/mpw001> and Imai, Olivella, and Rosenman (2022) "Addressing census data problems in race imputation via fully Bayesian Improved Surname Geocoding and name supplements" <doi:10.1126/sciadv.adc9824>. The package also incorporates the data described in Rosenman, Olivella, and Imai (2023) "Race and ethnicity data for first, middle, and surnames" <doi:10.1038/s41597-023-02202-2>.

License:

GPL (≥ 3)

URL:

https://github.com/kosukeimai/wru

BugReports:

https://github.com/kosukeimai/wru/issues

Depends:

R (≥ 4.1.0), utils

Imports:

cli, dplyr, tidyr, furrr, future, piggyback (≥ 0.1.4), PL94171, purrr, Rcpp, rlang

Suggests:

covr, testthat (≥ 3.0.0), tidycensus

LinkingTo:

Rcpp, RcppArmadillo

Config/testthat/edition:

Encoding:

UTF-8

LazyData:

yes

LazyDataCompression:

LazyLoad:

yes

RoxygenNote:

7.3.1

NeedsCompilation:

yes

Packaged:

2024-05-24 16:06:47 UTC; beb

Author:

Kabir Khanna [aut], Brandon Bertelsen [aut, cre], Santiago Olivella [aut], Evan Rosenman [aut], Alexander Rossell Hayes [aut], Kosuke Imai [aut]

Maintainer:

Brandon Bertelsen <brandon@bertelsen.ca>

Repository:

CRAN

Date/Publication:

2024-05-24 18:00:02 UTC

Pre-process vector of names to match census style. Internal function

Description

Pre-process vector of names to match census style. Internal function

Usage

.name_preproc(voter_names, target_names)

Arguments

voter_names

Character vector to be pre-processed.

target_names

Character vector of census names to be matched.

Value

A character vector of pre-processed named

Convert between state names, postal abbreviations, and FIPS codes

Description

Convert between state names, postal abbreviations, and FIPS codes

Usage

as_fips_code(x)

as_state_abbreviation(x)

Arguments

x

A numeric or character vector of state names, postal abbreviations, or FIPS codes. Matches for state names and abbreviations are not case sensitive. FIPS codes may be matched from numeric or character vectors, with or without leading zeroes.

Value

as_state_fips_code(): A character vector of two-digit FIPS codes. One-digit FIPS codes are prefixed with a leading zero, e.g., "06" for California.
as_state_abbreviation(): A character vector of two-letter postal abbreviations, e.g., "CA" for California.

Examples

as_fips_code("california")
as_state_abbreviation("california")

# Character vector matches ignore case
as_fips_code(c("DC", "Md", "va"))
as_state_abbreviation(c("district of columbia", "Maryland", "VIRGINIA"))

# Note that `3` and `7` are standardized to `NA`,
# because no state is assigned those FIPS codes
as_fips_code(1:10)
as_state_abbreviation(1:10)

# You can even mix methods in the same vector
as_fips_code(c("utah", "NM", 8, "04"))
as_state_abbreviation(c("utah", "NM", 8, "04"))

Preflight census data

Description

Preflight census data

Usage

census_data_preflight(census.data, census.geo, year)

Arguments

census.data

A list indexed by two-letter state abbreviations, which contains pre-saved Census geographic data. Can be generated using get_census_data function.

census.geo

An optional character vector specifying what level of geography to use to merge in U.S. Census geographic data. Currently "county", "tract", "block_group", "block", and "place" are supported. Note: sufficient information must be in user-defined voter.file object. If census.geo = "county", then voter.file must have column named county. If census.geo = "tract", then voter.file must have columns named county and tract. And if census.geo = "block", then voter.file must have columns named county, tract, and block. If census.geo = "place", then voter.file must have column named place. If census.geo = "zcta", then voter.file must have column named zcta. Specifying census.geo will call census_helper function to merge Census geographic data at specified level of geography.

year

An optional character vector specifying the year of U.S. Census geographic data to be downloaded. Use "2010", or "2020". Default is "2020".

Census Data download function.

Description

census_geo_api retrieves U.S. Census geographic data for a given state.

Usage

census_geo_api(
  key = Sys.getenv("CENSUS_API_KEY"),
  state,
  geo = c("tract", "block", "block_group", "county", "place", "zcta"),
  age = FALSE,
  sex = FALSE,
  year = c("2020", "2010"),
  retry = 3,
  save_temp = NULL,
  counties = NULL
)

Arguments

key

A character string containing a valid Census API key, which can be requested from the U.S. Census API key signup page.

By default, attempts to find a census key stored in an environment variable named CENSUS_API_KEY.

state

A required character object specifying which state to extract Census data for, e.g., "NJ".

geo

A character object specifying what aggregation level to use. Use "block", "block_group", "county", "place", "tract", or "zcta". Default is "tract". Warning: extracting block-level data takes very long.

age

A TRUE/FALSE object indicating whether to condition on age or not. If FALSE (default), function will return Pr(Geolocation | Race). If TRUE, function will return Pr(Geolocation, Age | Race). If sex is also TRUE, function will return Pr(Geolocation, Age, Sex | Race).

sex

A TRUE/FALSE object indicating whether to condition on sex or not. If FALSE (default), function will return Pr(Geolocation | Race). If TRUE, function will return Pr(Geolocation, Sex | Race). If age is also TRUE, function will return Pr(Geolocation, Age, Sex | Race).

year

A character object specifying the year of U.S. Census data to be downloaded. Use "2010", or "2020". Default is "2020". Warning: 2020 U.S. Census data is downloaded only when age and sex are both FALSE.

retry

The number of retries at the census website if network interruption occurs.

save_temp

File indicating where to save the temporary outputs. Defaults to NULL. If specified, the function will look for an .RData file with the same format as the expected output.

counties

A vector of counties contained in your data. If NULL, all counties are pulled. Useful for smaller predictions where only a few counties are considered. Must be zero padded.

Details

This function allows users to download U.S. Census geographic data (2010 or 2020), at either the county, tract, block, or place level, for a particular state.

Value

Output will be an object of class list, indexed by state names. It will consist of the original user-input data with additional columns of Census geographic data.

References

Relies on get_census_api(), get_census_api_2(), and vec_to_chunk() functions authored by Nicholas Nagle, available here.

Examples


## Not run: census_geo_api(states = c("NJ", "DE"), geo = "block")
## Not run: census_geo_api(states = "FL", geo = "tract", age = TRUE, sex = TRUE)
## Not run: census_geo_api(states = "MA", geo = "place", age = FALSE, sex = FALSE,
 year = "2020")
## End(Not run)

Census geo API helper functions

Description

Census geo API helper functions

Usage

census_geo_api_names(
  year = c("2020", "2010", "2000"),
  age = FALSE,
  sex = FALSE
)

census_geo_api_url(year = c("2020", "2010", "2000"))

Arguments

year

age

sex

Value

census_geo_api_names(): A named list of character vectors whose values correspond to columns of a Census API table and whose names represent the new columns they are used to calculate in census_geo_api().
census_geo_api_url(): A character string containing the base of the URL to a Census API table.

Census helper function.

Description

census_helper links user-input dataset with Census geographic data.

Usage

census_helper(
  key = Sys.getenv("CENSUS_API_KEY"),
  voter.file,
  states = "all",
  geo = "tract",
  age = FALSE,
  sex = FALSE,
  year = "2020",
  census.data = NULL,
  retry = 3,
  use.counties = FALSE
)

Arguments

key

A character string containing a valid Census API key, which can be requested from the U.S. Census API key signup page.

By default, attempts to find a census key stored in an environment variable named CENSUS_API_KEY.

voter.file

An object of class data.frame. Must contain field(s) named county, tract, block, and/or place specifying geolocation. These should be character variables that match up with U.S. Census categories. County should be three characters (e.g., "031" not "31"), tract should be six characters, and block should be four characters. Place should be five characters if it is included.

states

A character vector specifying which states to extract Census data for, e.g. c("NJ", "NY"). Default is "all", which extracts Census data for all states contained in user-input data.

geo

A character object specifying what aggregation level to use. Use "county", "tract", "block" or "place". Default is "tract". Warning: extracting block-level data takes very long.

age

sex

year

census.data

A optional census object of class list containing pre-saved Census geographic data. Can be created using get_census_data function. If census.data is provided, the age element must have the same value as the age option specified in this function (i.e., TRUE in both or FALSE in both). Similarly, the sex element in the object provided in census.data must have the same value as the sex option here. Moreover, the year element in the object provided in census.data must have the same value as the year option in the function (i.e., "2010" in both or "2020" in both). If census.data is missing, Census geographic data will be obtained via Census API.

retry

The number of retries at the census website if network interruption occurs.

use.counties

A logical, defaulting to FALSE. Should census data be filtered by counties available in census.data?

Details

This function allows users to link their geocoded dataset (e.g., voter file) with U.S. Census data (2010 or 2020). The function extracts Census Summary File data at the county, tract, block, or place level. Census data calculated are Pr(Geolocation | Race) where geolocation is county, tract, block, or place.

Value

Output will be an object of class data.frame. It will consist of the original user-input data with additional columns of Census data.

Examples


## Not run: 
census_helper(voter.file = voters, states = "nj", geo = "block")

## End(Not run)
## Not run: 
census_helper(
  voter.file = voters, states = "all", geo = "tract",
  age = TRUE, sex = TRUE
)

## End(Not run)
## Not run: 
census_helper(
  voter.file = voters, states = "all", geo = "county",
  age = FALSE, sex = FALSE, year = "2020"
)

## End(Not run)

Census helper function.

Description

census_helper_new links user-input dataset with Census geographic data.

Usage

census_helper_new(
  key = Sys.getenv("CENSUS_API_KEY"),
  voter.file,
  states = "all",
  geo = c("tract", "block", "block_group", "county", "place", "zcta"),
  age = FALSE,
  sex = FALSE,
  year = "2020",
  census.data = NULL,
  retry = 3,
  use.counties = FALSE,
  skip_bad_geos = FALSE
)

Arguments

key

A character string containing a valid Census API key, which can be requested from the U.S. Census API key signup page.

By default, attempts to find a census key stored in an environment variable named CENSUS_API_KEY.

voter.file

states

A character vector specifying which states to extract Census data for, e.g. c("NJ", "NY"). Default is "all", which extracts Census data for all states contained in user-input data.

geo

A character object specifying what aggregation level to use. Use "county", "tract", "block", or "place". Default is "tract". Warning: extracting block-level data takes very long.

age

sex

year

A character object specifying the year of U.S. Census data to be downloaded. Use "2010", or "2020". Default is "2020".

census.data

A optional census object of class list containing pre-saved Census geographic data. Can be created using get_census_data function. If census.data is provided, the year element must have the same value as the year option specified in this function (i.e., "2010" in both or "2020" in both). If census.data is provided, the age and the sex elements must be FALSE. This corresponds to the defaults of census_geo_api. If census.data is missing, Census geographic data will be obtained via Census API.

retry

The number of retries at the census website if network interruption occurs.

use.counties

A logical, defaulting to FALSE. Should census data be filtered by counties available in census.data?

skip_bad_geos

Logical. Option to have the function skip any geolocations that are not present in the census data, returning a partial data set. Default is set to FALSE, which case it will break and provide error message with a list of offending geolocations.

Details

Value

Output will be an object of class data.frame. It will consist of the original user-input data with additional columns of Census data.

Examples


## Not run: census_helper_new(voter.file = voters, states = "nj", geo = "block")
## Not run: census_helper_new(voter.file = voters, states = "all", geo = "tract")
## Not run: census_helper_new(voter.file = voters, states = "all", geo = "place",
 year = "2020")
## End(Not run)

Legacy data formatting function.

Description

format_legacy_data formats legacy data from the U.S. census to allow for Bayesian name geocoding.

Usage

format_legacy_data(legacyFilePath, state, outFile = NULL)

Arguments

legacyFilePath

A character vector giving the location of a legacy census data folder, sourced from https://www2.census.gov/programs-surveys/decennial/2020/data/01-Redistricting_File–PL_94-171/. These file names should end in ".pl".

state

The two letter state postal code.

outFile

Optional character vector determining whether the formatted RData object should be saved. The filepath should end in ".RData".

Details

This function allows users to construct datasets for analysis using the census legacy data format. These data are available for the 2020 census at https://www2.census.gov/programs-surveys/decennial/2020/data/01-Redistricting_File–PL_94-171/. This function returns data structured analogously to data from the Census API, which is not yet available for the 2020 Census as of September 2021.

Examples

## Not run: 
gaCensusData <- format_legacy_data(PL94171::pl_url('GA', 2020))
predict_race_new(ga.voter.file, namesToUse = 'last, first, mid', census.geo = 'block',
     census.data = gaCensusData)

## End(Not run)

Census API function.

Description

get_census_api obtains U.S. Census data via the public API.

Usage

get_census_api(
  data_url,
  key = Sys.getenv("CENSUS_API_KEY"),
  var.names,
  region,
  retry = 0
)

Arguments

data_url

URL root of the API, e.g., "https://api.census.gov/data/2020/dec/pl".

key

A character string containing a valid Census API key, which can be requested from the U.S. Census API key signup page.

By default, attempts to find a census key stored in an environment variable named CENSUS_API_KEY.

var.names

A character vector of variables to get, e.g., c("P2_005N", "P2_006N", "P2_007N", "P2_008N"). If there are more than 50 variables, then function will automatically split variables into separate queries.

region

Character object specifying which region to obtain data for. Must contain "for" and possibly "in", e.g., "for=block:1213&in=state:47+county:015+tract:*".

retry

The number of retries at the census website if network interruption occurs.

Details

This function obtains U.S. Census data via the public API. User can specify the variables and region(s) for which to obtain data.

Value

If successful, output will be an object of class data.frame. If unsuccessful, function prints the URL query that caused the error.

References

Based on code authored by Nicholas Nagle, which is available here.

Examples

## Not run: 
get_census_api(
  data_url = "https://api.census.gov/data/2020/dec/pl",
  var.names = c("P2_005N", "P2_006N", "P2_007N", "P2_008N"), region = "for=county:*&in=state:34"
)

## End(Not run)

Census API URL assembler.

Description

get_census_api_2 assembles URL components for get_census_api.

Usage

get_census_api_2(
  data_url,
  key = Sys.getenv("CENSUS_API_KEY"),
  get,
  region,
  retry = 3
)

Arguments

data_url

URL root of the API, e.g., "https://api.census.gov/data/2020/dec/pl".

key

A character string containing a valid Census API key, which can be requested from the U.S. Census API key signup page.

By default, attempts to find a census key stored in an environment variable named CENSUS_API_KEY.

get

region

Character object specifying which region to obtain data for. Must contain "for" and possibly "in", e.g., "for=block:1213&in=state:47+county:015+tract:*".

retry

The number of retries at the census website if network interruption occurs.

Details

This function assembles the URL components and sends the request to the Census server. It is used by the get_census_api function. The user should not need to call this function directly.

Value

If successful, output will be an object of class data.frame. If unsuccessful, function prints the URL query that was constructed.

References

Based on code authored by Nicholas Nagle, which is available here.

Examples

## Not run: try(get_census_api_2(data_url = "https://api.census.gov/data/2020/dec/pl",
get = c("P2_005N", "P2_006N", "P2_007N", "P2_008N"), region = "for=county:*&in=state:34"))
## End(Not run)

Multilevel Census data download function.

Description

get_census_data returns county-, tract-, and block-level Census data for specified state(s). Using this function to download Census data in advance can save considerable time when running predict_race and census_helper.

Usage

get_census_data(
  key = Sys.getenv("CENSUS_API_KEY"),
  states,
  age = FALSE,
  sex = FALSE,
  year = "2020",
  census.geo = c("tract", "block", "block_group", "county", "place", "zcta"),
  retry = 3,
  county.list = NULL
)

Arguments

key

A character string containing a valid Census API key, which can be requested from the U.S. Census API key signup page.

By default, attempts to find a census key stored in an environment variable named CENSUS_API_KEY.

states

which states to extract Census data for, e.g., c("NJ", "NY").

age

sex

year

census.geo

An optional character vector specifying what level of geography to use to merge in U.S. Census 2010 geographic data. Currently "county", "tract", "block", and "place" are supported.

retry

The number of retries at the census website if network interruption occurs.

county.list

A named list of character vectors of counties present in your voter.file, per state.

Value

Output will be an object of class list indexed by state. Output will contain a subset of the following elements: state, age, sex, county, tract, block_group, block, and place.

Examples

## Not run: get_census_data(states = c("NJ", "NY"), age = TRUE, sex = FALSE)
## Not run: get_census_data(states = "MN", age = FALSE, sex = FALSE, year = "2020")

Surname probability merging function.

Description

merge_names merges names in a user-input dataset with corresponding race/ethnicity probabilities derived from both the U.S. Census Surname List and Spanish Surname List and voter files from states in the Southern U.S.

Usage

merge_names(
  voter.file,
  namesToUse,
  census.surname,
  table.surnames = NULL,
  table.first = NULL,
  table.middle = NULL,
  clean.names = TRUE,
  impute.missing = FALSE,
  model = "BISG"
)

Arguments

voter.file

An object of class data.frame. Must contain a row for each individual being predicted, as well as a field named last containing each individual's surname. If first name is also being used for prediction, the file must also contain a field named first. If middle name is also being used for prediction, the field must also contain a field named middle.

namesToUse

A character vector identifying which names to use for the prediction. The default value is "last", indicating that only the last name will be used. Other options are "last, first", indicating that both last and first names will be used, and "last, first, middle", indicating that last, first, and middle names will all be used.

census.surname

A TRUE/FALSE object. If TRUE, function will call merge_surnames to merge in Pr(Race | Surname) from U.S. Census Surname List (2000, 2010, or 2020) and Spanish Surname List. If FALSE, user must provide a name.dictionary (see below). Default is TRUE.

table.surnames

An object of class data.frame provided by the users as an alternative surname dictionary. It will consist of a list of U.S. surnames, along with the associated probabilities P(name | ethnicity) for ethnicities: white, Black, Hispanic, Asian, and other. Default is NULL. (last_name for U.S. surnames, p_whi_last for White, p_bla_last for Black, p_his_last for Hispanic, p_asi_last for Asian, p_oth_last for other).

table.first

See table.surnames.

table.middle

See table.surnames.

clean.names

A TRUE/FALSE object. If TRUE, any surnames in voter.file that cannot initially be matched to the database will be cleaned, according to U.S. Census specifications, in order to increase the chance of finding a match. Default is TRUE.

impute.missing

See predict_race.

model

See predict_race.

Details

This function allows users to match names in their dataset with database entries estimating P(name | ethnicity) for each of the five major racial groups for each name. The database probabilities are derived from both the U.S. Census Surname List and Spanish Surname List and voter files from states in the Southern U.S.

By default, the function matches names as follows:

Search raw surnames in the database;
Remove any punctuation and search again;
Remove any spaces and search again;
Remove suffixes (e.g., "Jr") and search again (last names only)
Split double-barreled names into two parts and search first part of name;
Split double-barreled names into two parts and search second part of name;

Each step only applies to names not matched in a previous step. Steps 2 through 6 are not applied if clean.surname is FALSE.

Note: Any name appearing only on the Spanish Surname List is assigned a probability of 1 for Hispanics/Latinos and 0 for all other racial groups.

Value

Output will be an object of class data.frame. It will consist of the original user-input data with additional columns that specify the part of the name matched with Census data (surname.match), and the probabilities Pr(Race | Surname) for each racial group (p_whi for White, p_bla for Black, p_his for Hispanic/Latino, p_asi for Asian and Pacific Islander, and p_oth for Other/Mixed).

Examples

data(voters)
## Not run: try(merge_names(voters, namesToUse = "surname", census.surname = TRUE))

Surname probability merging function.

Description

merge_surnames merges surnames in user-input dataset with corresponding race/ethnicity probabilities from U.S. Census Surname List and Spanish Surname List.

Usage

merge_surnames(
  voter.file,
  surname.year = 2020,
  name.data,
  clean.surname = TRUE,
  impute.missing = TRUE
)

Arguments

voter.file

An object of class data.frame. Must contain a field named 'surname' containing list of surnames to be merged with Census lists.

surname.year

An object of class numeric indicating which year Census Surname List is from. Accepted values are 2010 and 2000. Default is 2020.

name.data

An object of class data.frame. Must contain a leading column of surnames, and 5 subsequent columns, with Pr(Race | Surname) for each of the five major racial categories.

clean.surname

A TRUE/FALSE object. If TRUE, any surnames in voter.file that cannot initially be matched to surname lists will be cleaned, according to U.S. Census specifications, in order to increase the chance of finding a match. Default is TRUE.

impute.missing

A TRUE/FALSE object. If TRUE, race/ethnicity probabilities will be imputed for unmatched names using race/ethnicity distribution for all other names (i.e., not on Census List). Default is TRUE.

Details

This function allows users to match surnames in their dataset with the U.S. Census Surname List (from 2000 or 2010) and Spanish Surname List to obtain Pr(Race | Surname) for each of the five major racial groups.

By default, the function matches surnames to the Census list as follows:

Search raw surnames in Census surname list;
Remove any punctuation and search again;
Remove any spaces and search again;
Remove suffixes (e.g., Jr) and search again;
Split double-barreled surnames into two parts and search first part of name;
Split double-barreled surnames into two parts and search second part of name;
For any remaining names, impute probabilities using distribution for all names not appearing on Census list.

Each step only applies to surnames not matched in a previous ste. Steps 2 through 7 are not applied if clean.surname is FALSE.

Note: Any name appearing only on the Spanish Surname List is assigned a probability of 1 for Hispanics/Latinos and 0 for all other racial groups.

Value

Examples

data(voters)
## Not run: try(merge_surnames(voters))

Internal model fitting functions

Description

These functions are intended for internal use only. Users should use the predict_race() interface rather any of these functions directly.