Title: Who are You? Bayesian Prediction of Racial Category Using Surname, First Name, Middle Name, and Geolocation
Version: 3.0.3
Date: 2024-05-24
Description: Predicts individual race/ethnicity using surname, first name, middle name, geolocation, and other attributes, such as gender and age. The method utilizes Bayes' Rule (with optional measurement error correction) to compute the posterior probability of each racial category for any given individual. The package implements methods described in Imai and Khanna (2016) "Improving Ecological Inference by Predicting Individual Ethnicity from Voter Registration Records" Political Analysis <doi:10.1093/pan/mpw001> and Imai, Olivella, and Rosenman (2022) "Addressing census data problems in race imputation via fully Bayesian Improved Surname Geocoding and name supplements" <doi:10.1126/sciadv.adc9824>. The package also incorporates the data described in Rosenman, Olivella, and Imai (2023) "Race and ethnicity data for first, middle, and surnames" <doi:10.1038/s41597-023-02202-2>.
License: GPL (≥ 3)
URL: https://github.com/kosukeimai/wru
BugReports: https://github.com/kosukeimai/wru/issues
Depends: R (≥ 4.1.0), utils
Imports: cli, dplyr, tidyr, furrr, future, piggyback (≥ 0.1.4), PL94171, purrr, Rcpp, rlang
Suggests: covr, testthat (≥ 3.0.0), tidycensus
LinkingTo: Rcpp, RcppArmadillo
Config/testthat/edition: 3
Encoding: UTF-8
LazyData: yes
LazyDataCompression: xz
LazyLoad: yes
RoxygenNote: 7.3.1
NeedsCompilation: yes
Packaged: 2024-05-24 16:06:47 UTC; beb
Author: Kabir Khanna [aut], Brandon Bertelsen [aut, cre], Santiago Olivella [aut], Evan Rosenman [aut], Alexander Rossell Hayes [aut], Kosuke Imai [aut]
Maintainer: Brandon Bertelsen <brandon@bertelsen.ca>
Repository: CRAN
Date/Publication: 2024-05-24 18:00:02 UTC

Pre-process vector of names to match census style. Internal function

Description

Pre-process vector of names to match census style. Internal function

Usage

.name_preproc(voter_names, target_names)

Arguments

voter_names

Character vector to be pre-processed.

target_names

Character vector of census names to be matched.

Value

A character vector of pre-processed named


Convert between state names, postal abbreviations, and FIPS codes

Description

Convert between state names, postal abbreviations, and FIPS codes

Usage

as_fips_code(x)

as_state_abbreviation(x)

Arguments

x

A numeric or character vector of state names, postal abbreviations, or FIPS codes. Matches for state names and abbreviations are not case sensitive. FIPS codes may be matched from numeric or character vectors, with or without leading zeroes.

Value

as_state_fips_code()

A character vector of two-digit FIPS codes. One-digit FIPS codes are prefixed with a leading zero, e.g., "06" for California.

as_state_abbreviation()

A character vector of two-letter postal abbreviations, e.g., "CA" for California.

Examples

as_fips_code("california")
as_state_abbreviation("california")

# Character vector matches ignore case
as_fips_code(c("DC", "Md", "va"))
as_state_abbreviation(c("district of columbia", "Maryland", "VIRGINIA"))

# Note that `3` and `7` are standardized to `NA`,
# because no state is assigned those FIPS codes
as_fips_code(1:10)
as_state_abbreviation(1:10)

# You can even mix methods in the same vector
as_fips_code(c("utah", "NM", 8, "04"))
as_state_abbreviation(c("utah", "NM", 8, "04"))


Preflight census data

Description

Preflight census data

Usage

census_data_preflight(census.data, census.geo, year)

Arguments

census.data

A list indexed by two-letter state abbreviations, which contains pre-saved Census geographic data. Can be generated using get_census_data function.

census.geo

An optional character vector specifying what level of geography to use to merge in U.S. Census geographic data. Currently "county", "tract", "block_group", "block", and "place" are supported. Note: sufficient information must be in user-defined voter.file object. If census.geo = "county", then voter.file must have column named county. If census.geo = "tract", then voter.file must have columns named county and tract. And if census.geo = "block", then voter.file must have columns named county, tract, and block. If census.geo = "place", then voter.file must have column named place. If census.geo = "zcta", then voter.file must have column named zcta. Specifying census.geo will call census_helper function to merge Census geographic data at specified level of geography.

year

An optional character vector specifying the year of U.S. Census geographic data to be downloaded. Use "2010", or "2020". Default is "2020".


Census Data download function.

Description

census_geo_api retrieves U.S. Census geographic data for a given state.

Usage

census_geo_api(
  key = Sys.getenv("CENSUS_API_KEY"),
  state,
  geo = c("tract", "block", "block_group", "county", "place", "zcta"),
  age = FALSE,
  sex = FALSE,
  year = c("2020", "2010"),
  retry = 3,
  save_temp = NULL,
  counties = NULL
)

Arguments

key

A character string containing a valid Census API key, which can be requested from the U.S. Census API key signup page.

By default, attempts to find a census key stored in an environment variable named CENSUS_API_KEY.

state

A required character object specifying which state to extract Census data for, e.g., "NJ".

geo

A character object specifying what aggregation level to use. Use "block", "block_group", "county", "place", "tract", or "zcta". Default is "tract". Warning: extracting block-level data takes very long.

age

A TRUE/FALSE object indicating whether to condition on age or not. If FALSE (default), function will return Pr(Geolocation | Race). If TRUE, function will return Pr(Geolocation, Age | Race). If sex is also TRUE, function will return Pr(Geolocation, Age, Sex | Race).

sex

A TRUE/FALSE object indicating whether to condition on sex or not. If FALSE (default), function will return Pr(Geolocation | Race). If TRUE, function will return Pr(Geolocation, Sex | Race). If age is also TRUE, function will return Pr(Geolocation, Age, Sex | Race).

year

A character object specifying the year of U.S. Census data to be downloaded. Use "2010", or "2020". Default is "2020". Warning: 2020 U.S. Census data is downloaded only when age and sex are both FALSE.

retry

The number of retries at the census website if network interruption occurs.

save_temp

File indicating where to save the temporary outputs. Defaults to NULL. If specified, the function will look for an .RData file with the same format as the expected output.

counties

A vector of counties contained in your data. If NULL, all counties are pulled. Useful for smaller predictions where only a few counties are considered. Must be zero padded.

Details

This function allows users to download U.S. Census geographic data (2010 or 2020), at either the county, tract, block, or place level, for a particular state.

Value

Output will be an object of class list, indexed by state names. It will consist of the original user-input data with additional columns of Census geographic data.

References

Relies on get_census_api(), get_census_api_2(), and vec_to_chunk() functions authored by Nicholas Nagle, available here.

Examples


## Not run: census_geo_api(states = c("NJ", "DE"), geo = "block")
## Not run: census_geo_api(states = "FL", geo = "tract", age = TRUE, sex = TRUE)
## Not run: census_geo_api(states = "MA", geo = "place", age = FALSE, sex = FALSE,
 year = "2020")
## End(Not run)


Census geo API helper functions

Description

Census geo API helper functions

Usage

census_geo_api_names(
  year = c("2020", "2010", "2000"),
  age = FALSE,
  sex = FALSE
)

census_geo_api_url(year = c("2020", "2010", "2000"))

Arguments

year

A character object specifying the year of U.S. Census data to be downloaded. Use "2010", or "2020". Default is "2020". Warning: 2020 U.S. Census data is downloaded only when age and sex are both FALSE.

age

A TRUE/FALSE object indicating whether to condition on age or not. If FALSE (default), function will return Pr(Geolocation | Race). If TRUE, function will return Pr(Geolocation, Age | Race). If sex is also TRUE, function will return Pr(Geolocation, Age, Sex | Race).

sex

A TRUE/FALSE object indicating whether to condition on sex or not. If FALSE (default), function will return Pr(Geolocation | Race). If TRUE, function will return Pr(Geolocation, Sex | Race). If age is also TRUE, function will return Pr(Geolocation, Age, Sex | Race).

Value

census_geo_api_names()

A named list of character vectors whose values correspond to columns of a Census API table and whose names represent the new columns they are used to calculate in census_geo_api().

census_geo_api_url()

A character string containing the base of the URL to a Census API table.


Census helper function.

Description

census_helper links user-input dataset with Census geographic data.

Usage

census_helper(
  key = Sys.getenv("CENSUS_API_KEY"),
  voter.file,
  states = "all",
  geo = "tract",
  age = FALSE,
  sex = FALSE,
  year = "2020",
  census.data = NULL,
  retry = 3,
  use.counties = FALSE
)

Arguments

key

A character string containing a valid Census API key, which can be requested from the U.S. Census API key signup page.

By default, attempts to find a census key stored in an environment variable named CENSUS_API_KEY.

voter.file

An object of class data.frame. Must contain field(s) named county, tract, block, and/or place specifying geolocation. These should be character variables that match up with U.S. Census categories. County should be three characters (e.g., "031" not "31"), tract should be six characters, and block should be four characters. Place should be five characters if it is included.

states

A character vector specifying which states to extract Census data for, e.g. c("NJ", "NY"). Default is "all", which extracts Census data for all states contained in user-input data.

geo

A character object specifying what aggregation level to use. Use "county", "tract", "block" or "place". Default is "tract". Warning: extracting block-level data takes very long.

age

A TRUE/FALSE object indicating whether to condition on age or not. If FALSE (default), function will return Pr(Geolocation | Race). If TRUE, function will return Pr(Geolocation, Age | Race). If sex is also TRUE, function will return Pr(Geolocation, Age, Sex | Race).

sex

A TRUE/FALSE object indicating whether to condition on sex or not. If FALSE (default), function will return Pr(Geolocation | Race). If TRUE, function will return Pr(Geolocation, Sex | Race). If age is also TRUE, function will return Pr(Geolocation, Age, Sex | Race).

year

A character object specifying the year of U.S. Census data to be downloaded. Use "2010", or "2020". Default is "2020". Warning: 2020 U.S. Census data is downloaded only when age and sex are both FALSE.

census.data

A optional census object of class list containing pre-saved Census geographic data. Can be created using get_census_data function. If census.data is provided, the age element must have the same value as the age option specified in this function (i.e., TRUE in both or FALSE in both). Similarly, the sex element in the object provided in census.data must have the same value as the sex option here. Moreover, the year element in the object provided in census.data must have the same value as the year option in the function (i.e., "2010" in both or "2020" in both). If census.data is missing, Census geographic data will be obtained via Census API.

retry

The number of retries at the census website if network interruption occurs.

use.counties

A logical, defaulting to FALSE. Should census data be filtered by counties available in census.data?

Details

This function allows users to link their geocoded dataset (e.g., voter file) with U.S. Census data (2010 or 2020). The function extracts Census Summary File data at the county, tract, block, or place level. Census data calculated are Pr(Geolocation | Race) where geolocation is county, tract, block, or place.

Value

Output will be an object of class data.frame. It will consist of the original user-input data with additional columns of Census data.

Examples


## Not run: 
census_helper(voter.file = voters, states = "nj", geo = "block")

## End(Not run)
## Not run: 
census_helper(
  voter.file = voters, states = "all", geo = "tract",
  age = TRUE, sex = TRUE
)

## End(Not run)
## Not run: 
census_helper(
  voter.file = voters, states = "all", geo = "county",
  age = FALSE, sex = FALSE, year = "2020"
)

## End(Not run)


Census helper function.

Description

census_helper_new links user-input dataset with Census geographic data.

Usage

census_helper_new(
  key = Sys.getenv("CENSUS_API_KEY"),
  voter.file,
  states = "all",
  geo = c("tract", "block", "block_group", "county", "place", "zcta"),
  age = FALSE,
  sex = FALSE,
  year = "2020",
  census.data = NULL,
  retry = 3,
  use.counties = FALSE,
  skip_bad_geos = FALSE
)

Arguments

key

A character string containing a valid Census API key, which can be requested from the U.S. Census API key signup page.

By default, attempts to find a census key stored in an environment variable named CENSUS_API_KEY.

voter.file

An object of class data.frame. Must contain field(s) named county, tract, block, and/or place specifying geolocation. These should be character variables that match up with U.S. Census categories. County should be three characters (e.g., "031" not "31"), tract should be six characters, and block should be four characters. Place should be five characters if it is included.

states

A character vector specifying which states to extract Census data for, e.g. c("NJ", "NY"). Default is "all", which extracts Census data for all states contained in user-input data.

geo

A character object specifying what aggregation level to use. Use "county", "tract", "block", or "place". Default is "tract". Warning: extracting block-level data takes very long.

age

A TRUE/FALSE object indicating whether to condition on age or not. If FALSE (default), function will return Pr(Geolocation | Race). If TRUE, function will return Pr(Geolocation, Age | Race). If sex is also TRUE, function will return Pr(Geolocation, Age, Sex | Race).

sex

A TRUE/FALSE object indicating whether to condition on sex or not. If FALSE (default), function will return Pr(Geolocation | Race). If TRUE, function will return Pr(Geolocation, Sex | Race). If age is also TRUE, function will return Pr(Geolocation, Age, Sex | Race).

year

A character object specifying the year of U.S. Census data to be downloaded. Use "2010", or "2020". Default is "2020".

census.data

A optional census object of class list containing pre-saved Census geographic data. Can be created using get_census_data function. If census.data is provided, the year element must have the same value as the year option specified in this function (i.e., "2010" in both or "2020" in both). If census.data is provided, the age and the sex elements must be FALSE. This corresponds to the defaults of census_geo_api. If census.data is missing, Census geographic data will be obtained via Census API.

retry

The number of retries at the census website if network interruption occurs.

use.counties

A logical, defaulting to FALSE. Should census data be filtered by counties available in census.data?

skip_bad_geos

Logical. Option to have the function skip any geolocations that are not present in the census data, returning a partial data set. Default is set to FALSE, which case it will break and provide error message with a list of offending geolocations.

Details

This function allows users to link their geocoded dataset (e.g., voter file) with U.S. Census data (2010 or 2020). The function extracts Census Summary File data at the county, tract, block, or place level. Census data calculated are Pr(Geolocation | Race) where geolocation is county, tract, block, or place.

Value

Output will be an object of class data.frame. It will consist of the original user-input data with additional columns of Census data.

Examples


## Not run: census_helper_new(voter.file = voters, states = "nj", geo = "block")
## Not run: census_helper_new(voter.file = voters, states = "all", geo = "tract")
## Not run: census_helper_new(voter.file = voters, states = "all", geo = "place",
 year = "2020")
## End(Not run)


Legacy data formatting function.

Description

format_legacy_data formats legacy data from the U.S. census to allow for Bayesian name geocoding.

Usage

format_legacy_data(legacyFilePath, state, outFile = NULL)

Arguments

legacyFilePath

A character vector giving the location of a legacy census data folder, sourced from https://www2.census.gov/programs-surveys/decennial/2020/data/01-Redistricting_File–PL_94-171/. These file names should end in ".pl".

state

The two letter state postal code.

outFile

Optional character vector determining whether the formatted RData object should be saved. The filepath should end in ".RData".

Details

This function allows users to construct datasets for analysis using the census legacy data format. These data are available for the 2020 census at https://www2.census.gov/programs-surveys/decennial/2020/data/01-Redistricting_File–PL_94-171/. This function returns data structured analogously to data from the Census API, which is not yet available for the 2020 Census as of September 2021.

Examples

## Not run: 
gaCensusData <- format_legacy_data(PL94171::pl_url('GA', 2020))
predict_race_new(ga.voter.file, namesToUse = 'last, first, mid', census.geo = 'block',
     census.data = gaCensusData)

## End(Not run)


Census API function.

Description

get_census_api obtains U.S. Census data via the public API.

Usage

get_census_api(
  data_url,
  key = Sys.getenv("CENSUS_API_KEY"),
  var.names,
  region,
  retry = 0
)

Arguments

data_url

URL root of the API, e.g., "https://api.census.gov/data/2020/dec/pl".

key

A character string containing a valid Census API key, which can be requested from the U.S. Census API key signup page.

By default, attempts to find a census key stored in an environment variable named CENSUS_API_KEY.

var.names

A character vector of variables to get, e.g., c("P2_005N", "P2_006N", "P2_007N", "P2_008N"). If there are more than 50 variables, then function will automatically split variables into separate queries.

region

Character object specifying which region to obtain data for. Must contain "for" and possibly "in", e.g., "for=block:1213&in=state:47+county:015+tract:*".

retry

The number of retries at the census website if network interruption occurs.

Details

This function obtains U.S. Census data via the public API. User can specify the variables and region(s) for which to obtain data.

Value

If successful, output will be an object of class data.frame. If unsuccessful, function prints the URL query that caused the error.

References

Based on code authored by Nicholas Nagle, which is available here.

Examples

## Not run: 
get_census_api(
  data_url = "https://api.census.gov/data/2020/dec/pl",
  var.names = c("P2_005N", "P2_006N", "P2_007N", "P2_008N"), region = "for=county:*&in=state:34"
)

## End(Not run)


Census API URL assembler.

Description

get_census_api_2 assembles URL components for get_census_api.

Usage

get_census_api_2(
  data_url,
  key = Sys.getenv("CENSUS_API_KEY"),
  get,
  region,
  retry = 3
)

Arguments

data_url

URL root of the API, e.g., "https://api.census.gov/data/2020/dec/pl".

key

A character string containing a valid Census API key, which can be requested from the U.S. Census API key signup page.

By default, attempts to find a census key stored in an environment variable named CENSUS_API_KEY.

get

A character vector of variables to get, e.g., c("P2_005N", "P2_006N", "P2_007N", "P2_008N"). If there are more than 50 variables, then function will automatically split variables into separate queries.

region

Character object specifying which region to obtain data for. Must contain "for" and possibly "in", e.g., "for=block:1213&in=state:47+county:015+tract:*".

retry

The number of retries at the census website if network interruption occurs.

Details

This function assembles the URL components and sends the request to the Census server. It is used by the get_census_api function. The user should not need to call this function directly.

Value

If successful, output will be an object of class data.frame. If unsuccessful, function prints the URL query that was constructed.

References

Based on code authored by Nicholas Nagle, which is available here.

Examples

## Not run: try(get_census_api_2(data_url = "https://api.census.gov/data/2020/dec/pl",
get = c("P2_005N", "P2_006N", "P2_007N", "P2_008N"), region = "for=county:*&in=state:34"))
## End(Not run)


Multilevel Census data download function.

Description

get_census_data returns county-, tract-, and block-level Census data for specified state(s). Using this function to download Census data in advance can save considerable time when running predict_race and census_helper.

Usage

get_census_data(
  key = Sys.getenv("CENSUS_API_KEY"),
  states,
  age = FALSE,
  sex = FALSE,
  year = "2020",
  census.geo = c("tract", "block", "block_group", "county", "place", "zcta"),
  retry = 3,
  county.list = NULL
)

Arguments

key

A character string containing a valid Census API key, which can be requested from the U.S. Census API key signup page.

By default, attempts to find a census key stored in an environment variable named CENSUS_API_KEY.

states

which states to extract Census data for, e.g., c("NJ", "NY").

age

A TRUE/FALSE object indicating whether to condition on age or not. If FALSE (default), function will return Pr(Geolocation | Race). If TRUE, function will return Pr(Geolocation, Age | Race). If sex is also TRUE, function will return Pr(Geolocation, Age, Sex | Race).

sex

A TRUE/FALSE object indicating whether to condition on sex or not. If FALSE (default), function will return Pr(Geolocation | Race). If TRUE, function will return Pr(Geolocation, Sex | Race). If age is also TRUE, function will return Pr(Geolocation, Age, Sex | Race).

year

A character object specifying the year of U.S. Census data to be downloaded. Use "2010", or "2020". Default is "2020". Warning: 2020 U.S. Census data is downloaded only when age and sex are both FALSE.

census.geo

An optional character vector specifying what level of geography to use to merge in U.S. Census 2010 geographic data. Currently "county", "tract", "block", and "place" are supported.

retry

The number of retries at the census website if network interruption occurs.

county.list

A named list of character vectors of counties present in your voter.file, per state.

Value

Output will be an object of class list indexed by state. Output will contain a subset of the following elements: state, age, sex, county, tract, block_group, block, and place.

Examples

## Not run: get_census_data(states = c("NJ", "NY"), age = TRUE, sex = FALSE)
## Not run: get_census_data(states = "MN", age = FALSE, sex = FALSE, year = "2020")

Surname probability merging function.

Description

merge_names merges names in a user-input dataset with corresponding race/ethnicity probabilities derived from both the U.S. Census Surname List and Spanish Surname List and voter files from states in the Southern U.S.

Usage

merge_names(
  voter.file,
  namesToUse,
  census.surname,
  table.surnames = NULL,
  table.first = NULL,
  table.middle = NULL,
  clean.names = TRUE,
  impute.missing = FALSE,
  model = "BISG"
)

Arguments

voter.file

An object of class data.frame. Must contain a row for each individual being predicted, as well as a field named last containing each individual's surname. If first name is also being used for prediction, the file must also contain a field named first. If middle name is also being used for prediction, the field must also contain a field named middle.

namesToUse

A character vector identifying which names to use for the prediction. The default value is "last", indicating that only the last name will be used. Other options are "last, first", indicating that both last and first names will be used, and "last, first, middle", indicating that last, first, and middle names will all be used.

census.surname

A TRUE/FALSE object. If TRUE, function will call merge_surnames to merge in Pr(Race | Surname) from U.S. Census Surname List (2000, 2010, or 2020) and Spanish Surname List. If FALSE, user must provide a name.dictionary (see below). Default is TRUE.

table.surnames

An object of class data.frame provided by the users as an alternative surname dictionary. It will consist of a list of U.S. surnames, along with the associated probabilities P(name | ethnicity) for ethnicities: white, Black, Hispanic, Asian, and other. Default is NULL. (last_name for U.S. surnames, p_whi_last for White, p_bla_last for Black, p_his_last for Hispanic, p_asi_last for Asian, p_oth_last for other).

table.first

See table.surnames.

table.middle

See table.surnames.

clean.names

A TRUE/FALSE object. If TRUE, any surnames in voter.file that cannot initially be matched to the database will be cleaned, according to U.S. Census specifications, in order to increase the chance of finding a match. Default is TRUE.

impute.missing

See predict_race.

model

See predict_race.

Details

This function allows users to match names in their dataset with database entries estimating P(name | ethnicity) for each of the five major racial groups for each name. The database probabilities are derived from both the U.S. Census Surname List and Spanish Surname List and voter files from states in the Southern U.S.

By default, the function matches names as follows:

  1. Search raw surnames in the database;

  2. Remove any punctuation and search again;

  3. Remove any spaces and search again;

  4. Remove suffixes (e.g., "Jr") and search again (last names only)

  5. Split double-barreled names into two parts and search first part of name;

  6. Split double-barreled names into two parts and search second part of name;

Each step only applies to names not matched in a previous step. Steps 2 through 6 are not applied if clean.surname is FALSE.

Note: Any name appearing only on the Spanish Surname List is assigned a probability of 1 for Hispanics/Latinos and 0 for all other racial groups.

Value

Output will be an object of class data.frame. It will consist of the original user-input data with additional columns that specify the part of the name matched with Census data (surname.match), and the probabilities Pr(Race | Surname) for each racial group (p_whi for White, p_bla for Black, p_his for Hispanic/Latino, p_asi for Asian and Pacific Islander, and p_oth for Other/Mixed).

Examples

data(voters)
## Not run: try(merge_names(voters, namesToUse = "surname", census.surname = TRUE))

Surname probability merging function.

Description

merge_surnames merges surnames in user-input dataset with corresponding race/ethnicity probabilities from U.S. Census Surname List and Spanish Surname List.

Usage

merge_surnames(
  voter.file,
  surname.year = 2020,
  name.data,
  clean.surname = TRUE,
  impute.missing = TRUE
)

Arguments

voter.file

An object of class data.frame. Must contain a field named 'surname' containing list of surnames to be merged with Census lists.

surname.year

An object of class numeric indicating which year Census Surname List is from. Accepted values are 2010 and 2000. Default is 2020.

name.data

An object of class data.frame. Must contain a leading column of surnames, and 5 subsequent columns, with Pr(Race | Surname) for each of the five major racial categories.

clean.surname

A TRUE/FALSE object. If TRUE, any surnames in voter.file that cannot initially be matched to surname lists will be cleaned, according to U.S. Census specifications, in order to increase the chance of finding a match. Default is TRUE.

impute.missing

A TRUE/FALSE object. If TRUE, race/ethnicity probabilities will be imputed for unmatched names using race/ethnicity distribution for all other names (i.e., not on Census List). Default is TRUE.

Details

This function allows users to match surnames in their dataset with the U.S. Census Surname List (from 2000 or 2010) and Spanish Surname List to obtain Pr(Race | Surname) for each of the five major racial groups.

By default, the function matches surnames to the Census list as follows:

  1. Search raw surnames in Census surname list;

  2. Remove any punctuation and search again;

  3. Remove any spaces and search again;

  4. Remove suffixes (e.g., Jr) and search again;

  5. Split double-barreled surnames into two parts and search first part of name;

  6. Split double-barreled surnames into two parts and search second part of name;

  7. For any remaining names, impute probabilities using distribution for all names not appearing on Census list.

Each step only applies to surnames not matched in a previous ste. Steps 2 through 7 are not applied if clean.surname is FALSE.

Note: Any name appearing only on the Spanish Surname List is assigned a probability of 1 for Hispanics/Latinos and 0 for all other racial groups.

Value

Output will be an object of class data.frame. It will consist of the original user-input data with additional columns that specify the part of the name matched with Census data (surname.match), and the probabilities Pr(Race | Surname) for each racial group (p_whi for White, p_bla for Black, p_his for Hispanic/Latino, p_asi for Asian and Pacific Islander, and p_oth for Other/Mixed). #'

Examples

data(voters)
## Not run: try(merge_surnames(voters))


Internal model fitting functions

Description

These functions are intended for internal use only. Users should use the predict_race() interface rather any of these functions directly.

Usage

.predict_race_old(
  voter.file,
  census.surname = TRUE,
  surname.only = FALSE,
  surname.year = 2020,
  name.dictionaries = NULL,
  census.geo,
  census.key = Sys.getenv("CENSUS_API_KEY"),
  census.data = NULL,
  age = FALSE,
  sex = FALSE,
  year = "2020",
  party,
  retry = 3,
  impute.missing = TRUE,
  use.counties = FALSE
)

predict_race_new(
  voter.file,
  names.to.use,
  year = "2020",
  age = FALSE,
  sex = FALSE,
  census.geo = c("tract", "block", "block_group", "county", "place", "zcta"),
  census.key = Sys.getenv("CENSUS_API_KEY"),
  name.dictionaries,
  surname.only = FALSE,
  census.data = NULL,
  retry = 0,
  impute.missing = TRUE,
  skip_bad_geos = FALSE,
  census.surname = FALSE,
  use.counties = FALSE
)

predict_race_me(
  voter.file,
  names.to.use,
  year = "2020",
  age = FALSE,
  sex = FALSE,
  census.geo = c("tract", "block", "block_group", "county", "place", "zcta"),
  census.key = Sys.getenv("CENSUS_API_KEY"),
  name.dictionaries,
  surname.only = FALSE,
  census.data = NULL,
  retry = 0,
  impute.missing = TRUE,
  census.surname = FALSE,
  use.counties = FALSE,
  race.init,
  ctrl
)

Arguments

voter.file

See documentation in race_predict.

census.surname

See documentation in race_predict.

surname.only

See documentation in race_predict.

surname.year

See documentation in race_predict.

name.dictionaries

See documentation in race_predict.

census.geo

See documentation in race_predict.

census.key

A character object specifying user's Census API key. Required if census.geo is specified, because a valid Census API key is required to download Census geographic data.

If NULL, the default, attempts to find a census key stored in an environment variable named CENSUS_API_KEY.

census.data

See documentation in race_predict.

age

See documentation in race_predict.

sex

See documentation in race_predict.

year

See documentation in race_predict.

party

See documentation in race_predict.

retry

See documentation in race_predict.

impute.missing

See documentation in race_predict.

use.counties

A logical, defaulting to FALSE. Should census data be filtered by counties available in census.data?

names.to.use

See documentation in race_predict.

skip_bad_geos

See documentation in race_predict.

race.init

See documentation in race_predict.

ctrl

See control in documentation for predict_race().

Details

These functions fit different versions of WRU. .predict_race_old fits the original WRU model, also known as BISG with census-based surname dictionary. .predict_race_new fits a new version of BISG which uses a new, augmented surname dictionary, and can also accommodate the use of first and middle name information. Finally, .predict_race_me fits a fully Bayesian Improved Surname Geocoding model (fBISG), which fits a model with measurement-error correction of erroneous zeros in census tables, in addition to also accommodating the augmented surname dictionary, and the first and middle name dictionaries when making predictions.

Value

Output will be an object of class data.frame. It will consist of the original user-input voter.file with additional columns with predicted probabilities for each of the five major racial categories: pred.whi for White, pred.bla for Black, pred.his for Hispanic/Latino, pred.asi for Asian/Pacific Islander, and pred.oth for Other/Mixed.

.predict_race_old

Original WRU race prediction function, implementing classical BISG with census-based surname dictionary.

.predict_race_new

New race prediction function, implementing classical BISG with augmented surname dictionary, as well as first and middle name information.

.predict_race_me

New race prediction function, implementing fBISG (i.e. measurement error correction, fully Bayesian model) with augmented surname dictionary, as well as first and middle name information.


Race prediction function.

Description

predict_race makes probabilistic estimates of individual-level race/ethnicity.

Usage

predict_race(
  voter.file,
  census.surname = TRUE,
  surname.only = FALSE,
  census.geo = c("tract", "block", "block_group", "county", "place", "zcta"),
  census.key = Sys.getenv("CENSUS_API_KEY"),
  census.data = NULL,
  age = FALSE,
  sex = FALSE,
  year = "2020",
  party = NULL,
  retry = 3,
  impute.missing = TRUE,
  skip_bad_geos = FALSE,
  use.counties = FALSE,
  model = "BISG",
  race.init = NULL,
  name.dictionaries = NULL,
  names.to.use = "surname",
  control = NULL
)

Arguments

voter.file

An object of class data.frame. Must contain a row for each individual being predicted, as well as a field named surname containing each individual's surname. If using geolocation in predictions, voter.file must contain a field named state, which contains the two-character abbreviation for each individual's state of residence (e.g., "nj" for New Jersey). If using Census geographic data in race/ethnicity predictions, voter.file must also contain at least one of the following fields: county, tract, block_group, block, and/or place. These fields should contain character strings matching U.S. Census categories. County is three characters (e.g., "031" not "31"), tract is six characters, block group is usually a single character and block is four characters. Place is five characters. See below for other optional fields.

census.surname

A TRUE/FALSE object. If TRUE, function will call merge_surnames to merge in Pr(Race | Surname) from U.S. Census Surname List (2000, 2010, or 2020) and Spanish Surname List. If FALSE, user must provide a name.dictionary (see below). Default is TRUE.

surname.only

A TRUE/FALSE object. If TRUE, race predictions will only use surname data and calculate Pr(Race | Surname). Default is FALSE.

census.geo

An optional character vector specifying what level of geography to use to merge in U.S. Census geographic data. Currently "county", "tract", "block_group", "block", and "place" are supported. Note: sufficient information must be in user-defined voter.file object. If census.geo = "county", then voter.file must have column named county. If census.geo = "tract", then voter.file must have columns named county and tract. And if census.geo = "block", then voter.file must have columns named county, tract, and block. If census.geo = "place", then voter.file must have column named place. If census.geo = "zcta", then voter.file must have column named zcta. Specifying census.geo will call census_helper function to merge Census geographic data at specified level of geography.

census.key

A character object specifying user's Census API key. Required if census.geo is specified, because a valid Census API key is required to download Census geographic data.

If NULL, the default, attempts to find a census key stored in an environment variable named CENSUS_API_KEY.

census.data

A list indexed by two-letter state abbreviations, which contains pre-saved Census geographic data. Can be generated using get_census_data function.

age

An optional TRUE/FALSE object specifying whether to condition race predictions on age (in addition to surname and geolocation). Default is FALSE. Must be same as age in census.data object. May only be set to TRUE if census.geo option is specified. If TRUE, voter.file should include a numerical variable age.

sex

optional TRUE/FALSE object specifying whether to condition race predictions on sex (in addition to surname and geolocation). Default is FALSE. Must be same as sex in census.data object. May only be set to TRUE if census.geo option is specified. If TRUE, voter.file should include a numerical variable sex, where sex is coded as 0 for males and 1 for females.

year

An optional character vector specifying the year of U.S. Census geographic data to be downloaded. Use "2010", or "2020". Default is "2020".

party

An optional character object specifying party registration field in voter.file, e.g., party = "PartyReg". If specified, race/ethnicity predictions will be conditioned on individual's party registration (in addition to geolocation). Whatever the name of the party registration field in voter.file, it should be coded as 1 for Democrat, 2 for Republican, and 0 for Other.

retry

The number of retries at the census website if network interruption occurs.

impute.missing

Logical, defaults to TRUE. Should missing be imputed?

skip_bad_geos

Logical. Option to have the function skip any geolocations that are not present in the census data, returning a partial data set. Default is set to FALSE, in which case it will break and provide error message with a list of offending geolocations.

use.counties

A logical, defaulting to FALSE. Should census data be filtered by counties available in census.data?

model

Character string, either "BISG" (default) or "fBISG" (for error-correction, fully-Bayesian model).

race.init

Vector of initial race for each observation in voter.file. Must be an integer vector, with 1=white, 2=black, 3=hispanic, 4=asian, and 5=other. Defaults to values obtained using model="BISG_surname".

name.dictionaries

Optional named list of data.frame's containing counts of names by race. Any of the following named elements are allowed: "surname", "first", "middle". When present, the objects must follow the same structure as last_c, first_c, mid_c, respectively.

names.to.use

One of 'surname', 'surname, first', or 'surname, first, middle'. Defaults to 'surname'.

control

List of control arguments only used when model="fBISG", including

iter

Number of MCMC iterations. Defaults to 1000.

burnin

Number of iterations discarded as burnin. Defaults to half of iter.

verbose

Print progress information. Defaults to TRUE.

me.correct

Boolean. Should the model correct measurement error for races|geo? Defaults to TRUE.

seed

RNG seed. If NULL, a seed is generated and returned as an attribute for reproducibility.

Details

This function implements the Bayesian race prediction methods outlined in Imai and Khanna (2015). The function produces probabilistic estimates of individual-level race/ethnicity, based on surname, geolocation, and party.

Value

Output will be an object of class data.frame. It will consist of the original user-input voter.file with additional columns with predicted probabilities for each of the five major racial categories: pred.whi for White, pred.bla for Black, pred.his for Hispanic/Latino, pred.asi for Asian/Pacific Islander, and pred.oth for Other/Mixed.

Examples


#' data(voters)
try(predict_race(voter.file = voters, surname.only = TRUE))
## Not run: 
try(predict_race(voter.file = voters, census.geo = "tract"))

## End(Not run)
## Not run: 
try(predict_race(
  voter.file = voters, census.geo = "place", year = "2020"))

## End(Not run)
## Not run: 
CensusObj <- try(get_census_data(state = c("NY", "DC", "NJ")))
try(predict_race(
  voter.file = voters, census.geo = "tract", census.data = CensusObj, party = "PID")
  )

## End(Not run)
## Not run: 
CensusObj2 <- try(get_census_data(state = c("NY", "DC", "NJ"), age = T, sex = T))
try(predict_race(
  voter.file = voters, census.geo = "tract", census.data = CensusObj2, age = T, sex = T))

## End(Not run)
## Not run: 
CensusObj3 <- try(get_census_data(state = c("NY", "DC", "NJ"), census.geo = "place"))
try(predict_race(voter.file = voters, census.geo = "place", census.data = CensusObj3))

## End(Not run)


Collapsed Gibbs sampler for hWRU. Internal function

Description

Collapsed Gibbs sampler for hWRU. Internal function

Usage

sample_me(
  last_name,
  first_name,
  mid_name,
  geo,
  N_rg,
  pi_s,
  pi_f,
  pi_m,
  pi_nr,
  which_names,
  samples,
  burnin,
  race_init,
  verbose
)

Arguments

last_name

Integer vector of last name identifiers for each record (zero indexed; as all that follow). Must match columns numbers in M_rs.

first_name

See last_name

mid_name

See last_name

geo

Integer vector of geographic units for each record. Must match column number in N_rg

N_rg

Integer matrix of race | geography counts in census (geograpgies in columns).

pi_s

Numeric matrix of race | surname prior probabilities.

pi_f

Same as pi_s, but for first names.

pi_m

Same as pi_s, but for middle names.

pi_nr

Matrix of marginal probability distribution over missing names; non-keyword names default to this distribution.

which_names

Integer; 0=surname only. 1=surname + first name. 2= surname, first, and middle names.

samples

Integer number of samples to take after (in total)

burnin

Integer number of samples to discard as burn-in of Markov chain

race_init

Integer vector of initial race assignments

verbose

Boolean; should informative messages be printed?


Dataset with FIPS codes for US states

Description

Dataset including FIPS codes and postal abbreviations for each U.S. state, district, and territory.

Usage

state_fips

Format

A tibble with 57 rows and 3 columns:

state

Two-letter postal abbreviation

state_code

Two-digit FIPS code

state_name

English name

Source

Derived from tidycensus::fips_codes()


Census Surname List (2000).

Description

Census Surname List from 2000 with race/ethnicity probabilities by surname.

Usage

surnames2000

Format

A data frame with 157,728 rows and 6 variables:

surname

Surname

p_whi

Pr(White | Surname)

p_bla

Pr(Black | Surname)

p_his

Pr(Hispanic/Latino | Surname)

p_asi

Pr(Asian/Pacific Islander | Surname)

p_oth

Pr(Other | Surname)

#'

Examples

data(surnames2000)

Census Surname List (2010).

Description

Census Surname List from 2010 with race/ethnicity probabilities by surname.

Usage

surnames2010

Format

A data frame with 167,613 rows and 6 variables:

surname

Surname

p_whi

Pr(White | Surname)

p_bla

Pr(Black | Surname)

p_his

Pr(Hispanic/Latino | Surname)

p_asi

Pr(Asian/Pacific Islander | Surname)

p_oth

Pr(Other | Surname)

#'

Examples

data(surnames)

Variable vector into chunks.

Description

vec_to_chunk takes a list of variables and collects them into 50-variable chunks.

Usage

vec_to_chunk(x)

Arguments

x

Character vector of variable names.

Details

This function takes a list of variable names and collects them into chunks with no more than 50 variables each. This helps to get around requests with more than 50 variables,because the API only allows queries of 50 variables at a time. The user should not need to call this function directly.

Value

Object of class list.

References

Based on code authored by Nicholas Nagle, which is available here.

Examples

## Not run: 
vec_to_chunk(x = c(paste("P012F0", seq(10:49), sep = ""),
              paste("P012I0", seq(10, 49), sep = "")))

## End(Not run)


Example voter file.

Description

An example dataset containing voter file information.

Usage

voters

Format

A data frame with 10 rows and 12 variables:

VoterID

Voter identifier (numeric)

surname

Surname

state

State of residence

CD

Congressional district

county

Census county (three-digit code)

first

First name

last

Last name or surname

tract

Census tract (six-digit code)

block

Census block (four-digit code)

precinct

Voting precinct

place

Voting place

age

Age in years

sex

0=male, 1=female

party

Party registration (character)

PID

Party registration (numeric)

#'

Examples

data(voters)

Preflight for name data

Description

Checks if namedata is available in the current working directory, if not downloads it from github using piggyback. By default, wru will download the data to a temporary directory that lasts as long as your session does. However, you may wish to set the wru_data_wd option to save the downloaded data to your current working directory for more permanence.

Usage

wru_data_preflight()