Help for package arf

Title:

Adversarial Random Forests

Version:

0.2.4

Date:

2025-02-24

Maintainer:

Marvin N. Wright <cran@wrig.de>

Description:

Adversarial random forests (ARFs) recursively partition data into fully factorized leaves, where features are jointly independent. The procedure is iterative, with alternating rounds of generation and discrimination. Data becomes increasingly realistic at each round, until original and synthetic samples can no longer be reliably distinguished. This is useful for several unsupervised learning tasks, such as density estimation and data synthesis. Methods for both are implemented in this package. ARFs naturally handle unstructured data with mixed continuous and categorical covariates. They inherit many of the benefits of random forests, including speed, flexibility, and solid performance with default parameters. For details, see Watson et al. (2023) https://proceedings.mlr.press/v206/watson23a.html.

License:

GPL (≥ 3)

URL:

https://github.com/bips-hb/arf, https://bips-hb.github.io/arf/

BugReports:

https://github.com/bips-hb/arf/issues

Imports:

data.table, ranger, foreach, stringr, truncnorm

Encoding:

UTF-8

RoxygenNote:

7.3.2

Suggests:

ggplot2, doParallel, doFuture, mlbench, knitr, rmarkdown, tibble, palmerpenguins, testthat (≥ 3.0.0)

Config/testthat/edition:

VignetteBuilder:

knitr

NeedsCompilation:

Packaged:

2025-02-24 21:28:44 UTC; wright

Author:

Marvin N. Wright

[aut, cre], David S. Watson

[aut], Kristin Blesch

[aut], Jan Kapar

[aut]

Repository:

CRAN

Date/Publication:

2025-02-24 21:50:02 UTC

arf: Adversarial Random Forests

Description

Author(s)

Maintainer: Marvin N. Wright cran@wrig.de (ORCID)

Authors:

David S. Watson david.s.watson11@gmail.com (ORCID)
Kristin Blesch (ORCID)
Jan Kapar (ORCID)

Examples

# Train ARF and estimate leaf parameters
arf <- adversarial_rf(iris)
psi <- forde(arf, iris)

# Generate 100 synthetic samples from the iris dataset
x_synth <- forge(psi, n_synth = 100)

# Condition on Species = "setosa" and Sepal.Length > 6
evi <- data.frame(Species = "setosa",
                  Sepal.Length = "(6, Inf)")
x_synth <- forge(psi, n_synth = 100, evidence = evi)

# Estimate average log-likelihood
ll <- lik(psi, iris, arf = arf, log = TRUE)
mean(ll)

# Expectation of Sepal.Length for class setosa
evi <- data.frame(Species = "setosa")
expct(psi, query = "Sepal.Length", evidence = evi)

## Not run: 
# Parallelization with doParallel
doParallel::registerDoParallel(cores = 4)

# ... or with doFuture
doFuture::registerDoFuture()
future::plan("multisession", workers = 4)

## End(Not run)

Adversarial Random Forests

Description

Implements an adversarial random forest to learn independence-inducing splits.

Usage

adversarial_rf(
  x,
  num_trees = 10L,
  min_node_size = 2L,
  delta = 0,
  max_iters = 10L,
  early_stop = TRUE,
  prune = TRUE,
  verbose = TRUE,
  parallel = TRUE,
  ...
)

Arguments

x

Input data. Integer variables are recoded as ordered factors with a warning. See Details.

num_trees

Number of trees to grow in each forest. The default works well for most generative modeling tasks, but should be increased for likelihood estimation. See Details.

min_node_size

Minimal number of real data samples in leaf nodes.

delta

Tolerance parameter. Algorithm converges when OOB accuracy is < 0.5 + delta.

max_iters

Maximum iterations for the adversarial loop.

early_stop

Terminate loop if performance fails to improve from one round to the next?

prune

Impose min_node_size by pruning?

verbose

Print discriminator accuracy after each round? Will also show additional warnings.

parallel

Compute in parallel? Must register backend beforehand, e.g. via doParallel or doFuture; see examples.

...

Extra parameters to be passed to ranger.

Details

The adversarial random forest (ARF) algorithm partitions data into fully factorized leaves where features are jointly independent. ARFs are trained iteratively, with alternating rounds of generation and discrimination. In the first instance, synthetic data is generated via independent bootstraps of each feature, and a RF classifier is trained to distinguish between real and fake samples. In subsequent rounds, synthetic data is generated separately in each leaf, using splits from the previous forest. This creates increasingly realistic data that satisfies local independence by construction. The algorithm converges when a RF cannot reliably distinguish between the two classes, i.e. when OOB accuracy falls below 0.5 + delta.

ARFs are useful for several unsupervised learning tasks, such as density estimation (see forde) and data synthesis (see forge). For the former, we recommend increasing the number of trees for improved performance (typically on the order of 100-1000 depending on sample size).

Integer variables are recoded with a warning (set verbose = FALSE to silence these). Default behavior is to convert integer variables with six or more unique values to numeric, while those with up to five unique values are treated as ordered factors. To override this behavior, explicitly recode integer variables to the target type prior to training.

Note: convergence is not guaranteed in finite samples. The max_iters argument sets an upper bound on the number of training rounds. Similar results may be attained by increasing delta. Even a single round can often give good performance, but data with strong or complex dependencies may require more iterations. With the default early_stop = TRUE, the adversarial loop terminates if performance does not improve from one round to the next, in which case further training may be pointless.

Value

A random forest object of class ranger.

References

Watson, D., Blesch, K., Kapar, J., & Wright, M. (2023). Adversarial random forests for density estimation and generative modeling. In Proceedings of the 26th International Conference on Artificial Intelligence and Statistics, pp. 5357-5375.

Examples

# Train ARF and estimate leaf parameters
arf <- adversarial_rf(iris)
psi <- forde(arf, iris)

# Generate 100 synthetic samples from the iris dataset
x_synth <- forge(psi, n_synth = 100)

# Condition on Species = "setosa" and Sepal.Length > 6
evi <- data.frame(Species = "setosa",
                  Sepal.Length = "(6, Inf)")
x_synth <- forge(psi, n_synth = 100, evidence = evi)

# Estimate average log-likelihood
ll <- lik(psi, iris, arf = arf, log = TRUE)
mean(ll)

# Expectation of Sepal.Length for class setosa
evi <- data.frame(Species = "setosa")
expct(psi, query = "Sepal.Length", evidence = evi)

## Not run: 
# Parallelization with doParallel
doParallel::registerDoParallel(cores = 4)

# ... or with doFuture
doFuture::registerDoFuture()
future::plan("multisession", workers = 4)

## End(Not run)

Compute conditional circuit parameters

Description

Compute conditional circuit parameters

Usage

cforde(
  params,
  evidence,
  row_mode = c("separate", "or"),
  nomatch = c("force", "na"),
  verbose = TRUE,
  stepsize = 0,
  parallel = TRUE
)

Arguments

params

Circuit parameters learned via forde.

evidence

Data frame of conditioning event(s).

row_mode

Interpretation of rows in multi-row conditions.

nomatch

What to do if no leaf matches a condition in evidence? Options are to force sampling from a random leaf ("force") or return NA ("na"). The default is "force".

verbose

Show warnings, e.g. when no leaf matches a condition?

stepsize

Stepsize defining number of condition rows handled in one for each step.

parallel

Compute in parallel? Must register backend beforehand, e.g. via doParallel or doFuture; see examples.

Value

List with conditions (evidence_input), prepared conditions (evidence_prepped) and leaves that match the conditions in evidence with continuous data (cnt) and categorical data (cat) as well as leaf info (forest).

Adaptive column renaming

Description

This function renames columns in case the input colnames includes any colnames required by internal functions (e.g., "y").

Usage

col_rename(cn, old_name)

Arguments

cn

Column names.

old_name

Name of column to be renamed.

Rename all problematic columns with col_rename().

Description

Rename all problematic columns with col_rename().

Usage

col_rename_all(cn)

Arguments

cn

Old column names.

Value

New columns names.

Shortcut likelihood function

Description

Calls adversarial_rf, forde and lik. For repeated application, it is faster to save outputs of adversarial_rf and forde and pass them via ... or directly use lik.

Usage

darf(x, query = NULL, ...)

Arguments

x

Input data. Integer variables are recoded as ordered factors with a warning. See Details.

query

Data frame of samples, optionally comprising just a subset of training features. See Details of lik. Is set to x if zero.

...

Extra parameters to be passed to adversarial_rf, forde and lik.

Value

A vector of likelihoods, optionally on the log scale. A dataset of n_synth synthetic samples or of nrow(x) synthetic samples if n_synth is undefined.

References

Examples

# Estimate log-likelihoods
ll <- darf(iris)

# Partial evidence query
ll <- darf(iris, query = iris[1, 1:3])

# Condition on Species = "setosa"
ll <- darf(iris, query = iris[1, 1:3], evidence = data.frame(Species = "setosa"))

Shortcut expectation function

Description

Calls adversarial_rf, forde and expct. For repeated application, it is faster to save outputs of adversarial_rf and forde and pass them via ... or directly use expct.

Usage

earf(x, ...)

Arguments

x

Input data. Integer variables are recoded as ordered factors with a warning. See Details.

...

Extra parameters to be passed to adversarial_rf, forde and expct.

Value

A one row data frame with values for all query variables.

References

Examples

# What is the expected values of each feature?
earf(iris)

#' # What is the expected values of Sepal.Length?
earf(iris, query = "Sepal.Length")

# What if we condition on Species = "setosa"?
earf(iris, query = "Sepal.Length", evidence = data.frame(Species = "setosa"))

Expected Value

Description

Compute the expectation of some query variable(s), optionally conditioned on some event(s).

Usage

expct(
  params,
  query = NULL,
  evidence = NULL,
  evidence_row_mode = c("separate", "or"),
  round = FALSE,
  nomatch = c("force", "na"),
  verbose = TRUE,
  stepsize = 0,
  parallel = TRUE
)

Arguments

params

Circuit parameters learned via forde.

query

Optional character vector of variable names. Estimates will be computed for each. If NULL, all variables other than those in evidence will be estimated. If evidence contains NAs, those values will be imputed and a full dataset is returned.

evidence

Optional set of conditioning events. This can take one of three forms: (1) a partial sample, i.e. a single row of data with some but not all columns; (2) a data frame of conditioning events, which allows for inequalities and intervals; or (3) a posterior distribution over leaves. See Details and Examples.

evidence_row_mode

Interpretation of rows in multi-row evidence. If "separate", each row in evidence is a unique conditioning event for which n_synth synthetic samples are generated. If "or", the rows are combined with a logical OR. See Examples.

round

Round continuous variables to their respective maximum precision in the real data set?

nomatch

What to do if no leaf matches a condition in evidence? Options are to force sampling from a random leaf ("force") or return NA ("na"). The default is "force".

verbose

Show warnings, e.g. when no leaf matches a condition?

stepsize

How many rows of evidence should be handled at each step? Defaults to nrow(evidence) / num_registered_workers for parallel == TRUE.

parallel

Compute in parallel? Must register backend beforehand, e.g. via doParallel or doFuture; see Examples.

Details

This function computes expected values for any subset of features, optionally conditioned on some event(s).

There are three methods for (optionally) encoding conditioning events via the evidence argument. The first is to provide a partial sample, where some columns from the training data are missing or set to NA. The second is to provide a data frame with condition events. This supports inequalities and intervals. Alternatively, users may directly input a pre-calculated posterior distribution over leaves, with columns f_idx and wt. This may be preferable for complex constraints. See Examples.

Please note that results for continuous features which are both included in query and in evidence with an interval condition are currently inconsistent.

Value

A one row data frame with values for all query variables.

References

Examples

# Train ARF and estimate leaf parameters
arf <- adversarial_rf(iris)
psi <- forde(arf, iris)

# What is the expected value of Sepal.Length?
expct(psi, query = "Sepal.Length")

# What if we condition on Species = "setosa"?
evi <- data.frame(Species = "setosa")
expct(psi, query = "Sepal.Length", evidence = evi)

# Compute expectations for all features other than Species
expct(psi, evidence = evi)

# Condition on Species = "setosa" and Petal.Width > 0.3
evi <- data.frame(Species = "setosa", 
                  Petal.Width = ">0.3")
expct(psi, evidence = evi)

# Condition on first two rows with some missing values
evi <- iris[1:2,]
evi[1, 1] <- NA_real_
evi[1, 5] <- NA_character_
evi[2, 2] <- NA_real_
x_synth <- expct(psi, evidence = evi)

## Not run: 
# Parallelization with doParallel
doParallel::registerDoParallel(cores = 4)

# ... or with doFuture
doFuture::registerDoFuture()
future::plan("multisession", workers = 4)

## End(Not run)

Forests for Density Estimation

Description

Uses a pre-trained ARF model to estimate leaf and distribution parameters.

Usage

forde(
  arf,
  x,
  oob = FALSE,
  family = "truncnorm",
  finite_bounds = c("no", "local", "global"),
  alpha = 0,
  epsilon = 0,
  parallel = TRUE
)

Arguments

arf

Pre-trained adversarial_rf. Alternatively, any object of class ranger.

x

Training data for estimating parameters.

oob

Only use out-of-bag samples for parameter estimation? If TRUE, x must be the same dataset used to train arf. Set to "inbag" to only use in-bag samples. Default is FALSE, i.e. use all observations.

family

Distribution to use for density estimation of continuous features. Current options include truncated normal (the default family = "truncnorm") and uniform (family = "unif"). See Details.

finite_bounds

Impose finite bounds on all continuous variables? If "local", infinite bounds are set to empirical extrema within leaves. If "global", infinite bounds are set to global empirical extrema. if "no" (the default), infinite bounds are left unchanged.

alpha

Optional pseudocount for Laplace smoothing of categorical features. This avoids zero-mass points when test data fall outside the support of training data. Effectively parameterizes a flat Dirichlet prior on multinomial likelihoods.

epsilon

Optional slack parameter on empirical bounds when finite_bounds != "no". This avoids zero-density points when test data fall outside the support of training data. The gap between lower and upper bounds is expanded by a factor of 1 + epsilon.

parallel

Compute in parallel? Must register backend beforehand, e.g. via doParallel or doFuture; see examples.

Details

forde extracts leaf parameters from a pretrained forest and learns distribution parameters for data within each leaf. The former includes coverage (proportion of data falling into the leaf) and split criteria. The latter includes proportions for categorical features and mean/variance for continuous features. The result is a probabilistic circuit, stored as a data.table, which can be used for various downstream inference tasks.

Currently, forde only provides support for a limited number of distributional families: truncated normal or uniform for continuous data, and multinomial for discrete data.

Though forde was designed to take an adversarial random forest as input, the function's first argument can in principle be any object of class ranger. This allows users to test performance with alternative pipelines (e.g., with supervised forest input). There is also no requirement that x be the data used to fit arf, unless oob = TRUE. In fact, using another dataset here may protect against overfitting. This connects with Wager & Athey's (2018) notion of "honest trees".

Value

A list with 5 elements: (1) parameters for continuous data; (2) parameters for discrete data; (3) leaf indices and coverage; (4) metadata on variables; and (5) the data input class. This list is used for estimating likelihoods with lik and generating data with forge.

References

Wager, S. & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. J. Am. Stat. Assoc., 113(523): 1228-1242.

Examples

# Train ARF and estimate leaf parameters
arf <- adversarial_rf(iris)
psi <- forde(arf, iris)

# Generate 100 synthetic samples from the iris dataset
x_synth <- forge(psi, n_synth = 100)

# Condition on Species = "setosa" and Sepal.Length > 6
evi <- data.frame(Species = "setosa",
                  Sepal.Length = "(6, Inf)")
x_synth <- forge(psi, n_synth = 100, evidence = evi)

# Estimate average log-likelihood
ll <- lik(psi, iris, arf = arf, log = TRUE)
mean(ll)

# Expectation of Sepal.Length for class setosa
evi <- data.frame(Species = "setosa")
expct(psi, query = "Sepal.Length", evidence = evi)

## Not run: 
# Parallelization with doParallel
doParallel::registerDoParallel(cores = 4)

# ... or with doFuture
doFuture::registerDoFuture()
future::plan("multisession", workers = 4)

## End(Not run)

Forests for Generative Modeling

Description

Uses pre-trained FORDE model to simulate synthetic data.

Usage

forge(
  params,
  n_synth,
  evidence = NULL,
  evidence_row_mode = c("separate", "or"),
  round = TRUE,
  sample_NAs = FALSE,
  nomatch = c("force", "na"),
  verbose = TRUE,
  stepsize = 0,
  parallel = TRUE
)

Arguments

params

Circuit parameters learned via forde.

n_synth

Number of synthetic samples to generate.

evidence

Optional set of conditioning events. This can take one of three forms: (1) a partial sample, i.e. a single row of data with some but not all columns; (2) a data frame of conditioning events, which allows for inequalities; or (3) a posterior distribution over leaves. See Details.

evidence_row_mode

round

Round continuous variables to their respective maximum precision in the real data set?

sample_NAs

Sample NAs respecting the probability for missing values in the original data?

nomatch

What to do if no leaf matches a condition in evidence? Options are to force sampling from a random leaf ("force") or return NA ("na"). The default is "force".

verbose

Show warnings, e.g. when no leaf matches a condition?

stepsize

How many rows of evidence should be handled at each step? Defaults to nrow(evidence) / num_registered_workers for parallel == TRUE.

parallel

Compute in parallel? Must register backend beforehand, e.g. via doParallel or doFuture; see examples.

Details

forge simulates a synthetic dataset of n_synth samples. First, leaves are sampled in proportion to either their coverage (if evidence = NULL) or their posterior probability. Then, each feature is sampled independently within each leaf according to the probability mass or density function learned by forde. This will create realistic data so long as the adversarial RF used in the previous step satisfies the local independence criterion. See Watson et al. (2023).

Value

A dataset of n_synth synthetic samples.

References

Examples

# Train ARF and estimate leaf parameters
arf <- adversarial_rf(iris)
psi <- forde(arf, iris)

# Generate 100 synthetic samples from the iris dataset
x_synth <- forge(psi, n_synth = 100)

# Condition on Species = "setosa"
evi <- data.frame(Species = "setosa")
x_synth <- forge(psi, n_synth = 100, evidence = evi)

# Condition on Species = "setosa" and Sepal.Length > 6
evi <- data.frame(Species = "setosa",
                  Sepal.Length = "(6, Inf)")
x_synth <- forge(psi, n_synth = 100, evidence = evi)

# Alternative syntax for </> conditions
evi <- data.frame(Sepal.Length = ">6")
x_synth <- forge(psi, n_synth = 100, evidence = evi)

# Negation condition, i.e. all classes except "setosa"
evi <- data.frame(Species = "!setosa")
x_synth <- forge(psi, n_synth = 100, evidence = evi)

# Condition on first two data rows with some missing values
evi <- iris[1:2,]
evi[1, 1] <- NA_real_
evi[1, 5] <- NA_character_
evi[2, 2] <- NA_real_
x_synth <- forge(psi, n_synth = 1, evidence = evi)

# Or just input some distribution on leaves
# (Weights that do not sum to unity are automatically scaled)
n_leaves <- nrow(psi$forest)
evi <- data.frame(f_idx = psi$forest$f_idx, wt = rexp(n_leaves))
x_synth <- forge(psi, n_synth = 100, evidence = evi)

## Not run: 
# Parallelization with doParallel
doParallel::registerDoParallel(cores = 4)

# ... or with doFuture
doFuture::registerDoFuture()
future::plan("multisession", workers = 4)

## End(Not run)

Missing value imputation with ARF

Description

Perform single or multiple imputation with ARFs. Calls adversarial_rf, forde and expct/forge.

Usage

impute(
  x,
  m = 1,
  expectation = ifelse(m == 1, TRUE, FALSE),
  num_trees = 100L,
  min_node_size = 10L,
  round = TRUE,
  finite_bounds = "local",
  epsilon = 1e-14,
  verbose = FALSE,
  ...
)

Arguments

x

Input data.

m

Number of imputed datasets to generate. The default is single imputation (m = 1).

expectation

Return expected value instead of multiple imputations. By default, for single imputation (m = 1), the expected value is returned.

num_trees

Number of trees to grow in the ARF.

min_node_size

Minimal number of real data samples in leaf nodes.

round

Round continuous variables to their respective maximum precision in the real data set?

finite_bounds

Impose finite bounds on all continuous variables? See forde.

epsilon

Slack parameter on empirical bounds; see forde.

verbose

Print progress for adversarial_rf?

...

Extra parameters to be passed to adversarial_rf, forde and expct/forge.

Value

Imputed data. A single dataset is returned for m = 1, a list of datasets for m > 1.

Examples

# Generate some missings
iris_na <- iris
for (j in 1:ncol(iris)) {
  iris_na[sample(1:nrow(iris), 5), j] <- NA
}

# Single imputation
iris_imputed <- arf::impute(iris_na, num_trees = 10, m = 1)

# Multiple imputation
iris_imputed <- arf::impute(iris_na, num_trees = 10, m = 10)

## Not run: 
# Parallelization with doParallel
doParallel::registerDoParallel(cores = 4)

# ... or with doFuture
doFuture::registerDoFuture()
future::plan("multisession", workers = 4)

## End(Not run)

Likelihood Estimation

Description

Compute the likelihood of input data, optionally conditioned on some event(s).

Usage

lik(
  params,
  query,
  evidence = NULL,
  arf = NULL,
  oob = FALSE,
  log = TRUE,
  batch = NULL,
  parallel = TRUE
)

Arguments

params

Circuit parameters learned via forde.

query

Data frame of samples, optionally comprising just a subset of training features. Likelihoods will be computed for each sample. Missing features will be marginalized out. See Details.

evidence

Optional set of conditioning events. This can take one of three forms: (1) a partial sample, i.e. a single row of data with some but not all columns; (2) a data frame of conditioning events, which allows for inequalities; or (3) a posterior distribution over leaves. See Details.

arf

Pre-trained adversarial_rf or other object of class ranger. This is not required but speeds up computation considerably for total evidence queries. (Ignored for partial evidence queries.)

oob

Only use out-of-bag leaves for likelihood estimation? If TRUE, x must be the same dataset used to train arf. Only applicable for total evidence queries.

log

Return likelihoods on log scale? Recommended to prevent underflow.

batch

Batch size. The default is to compute densities for all of queries in one round, which is always the fastest option if memory allows. However, with large samples or many trees, it can be more memory efficient to split the data into batches. This has no impact on results.

parallel

Compute in parallel? Must register backend beforehand, e.g. via doParallel or doFuture; see examples.

Details

This function computes the likelihood of input data, optionally conditioned on some event(s). Queries may be partial, i.e. covering some but not all features, in which case excluded variables will be marginalized out.

There are three methods for (optionally) encoding conditioning events via the evidence argument. The first is to provide a partial sample, where some but not all columns from the training data are present. The second is to provide a data frame with three columns: variable, relation, and value. This supports inequalities via relation. Alternatively, users may directly input a pre-calculated posterior distribution over leaves, with columns f_idx and wt. This may be preferable for complex constraints. See Examples.

Value

A vector of likelihoods, optionally on the log scale.

References

Examples

# Train ARF and estimate leaf parameters
arf <- adversarial_rf(iris)
psi <- forde(arf, iris)

# Estimate average log-likelihood
ll <- lik(psi, iris, arf = arf, log = TRUE)
mean(ll)

# Identical but slower
ll <- lik(psi, iris, log = TRUE)
mean(ll)

# Partial evidence query
lik(psi, query = iris[1, 1:3])

# Condition on Species = "setosa"
evi <- data.frame(Species = "setosa")
lik(psi, query = iris[1, 1:3], evidence = evi)

# Condition on Species = "setosa" and Petal.Width > 0.3
evi <- data.frame(Species = "setosa", 
                  Petal.Width = ">0.3")
lik(psi, query = iris[1, 1:3], evidence = evi)

## Not run: 
# Parallelization with doParallel
doParallel::registerDoParallel(cores = 4)

# ... or with doFuture
doFuture::registerDoFuture()
future::plan("multisession", workers = 4)

## End(Not run)

Post-process data

Description

This function prepares output data for forge.

Usage

post_x(x, params, round = TRUE)

Arguments

x

Input data.frame.

params

Circuit parameters learned via forde.

round

Round continuous variables to their respective maximum precision in the real data set?

Preprocess conditions

Description

This function prepares conditions for computing conditional circuit paramaters via cforde

Usage

prep_cond(evidence, params, row_mode)

Arguments

evidence

Optional set of conditioning events.

params

Circuit parameters learned via forde.

row_mode

Interpretation of rows in multi-row conditions.

Preprocess input data

Description

This function prepares input data for ARFs.

Usage

prep_x(x, verbose = TRUE)

Arguments

x

Input data.frame.

verbose

Show warning if recoding integers?

Shortcut sampling function

Description

Calls adversarial_rf, forde and forge. For repeated application, it is faster to save outputs of adversarial_rf and forde and pass them via ... or directly use forge.

Usage

rarf(x, n_synth = NULL, ...)

Arguments

x

Input data. Integer variables are recoded as ordered factors with a warning. See Details.

n_synth

Number of synthetic samples to generate for unconditional generation with no evidence given. Number of synthetic samples to generate per evidence row if evidence is provided. If NULL, defaults to nrow(x) if no evidence is provided and to 1 otherwise.

...

Extra parameters to be passed to adversarial_rf, forde and forge.

Value

A dataset of n_synth synthetic samples or of nrow(x) synthetic samples if n_synth is undefined.

References

Examples

# Generate 150 (size of original iris dataset) synthetic samples from the iris dataset
x_synth <- rarf(iris)

# Generate 100 synthetic samples from the iris dataset
x_synth <- rarf(iris, n_synth = 100)

# Condition on Species = "setosa"
x_synth <- rarf(iris, evidence = data.frame(Species = "setosa"))

Safer version of sample()

Description

Safer version of sample()

Usage

resample(x, ...)

Arguments

x

A vector of one or more elements from which to choose.

...

Further arguments for sample().

Value

A vector of length size with elements drawn from x.

which.max() with random at ties

Description

which.max() with random at ties

Usage

which.max.random(x)

Arguments

x

A numeric vector.

Value

Index of maximum value in x, with random tie-breaking.

arf: Adversarial Random Forests

Description

Author(s)

See Also

Examples

Adversarial Random Forests

Description

Usage

Arguments

Details

Value

References

See Also

Examples

Compute conditional circuit parameters

Description

Usage

Arguments

Value

Adaptive column renaming

Description

Usage

Arguments

Rename all problematic columns with col_rename().

Description

Usage

Arguments

Value

Shortcut likelihood function

Description

Usage

Arguments

Value

References

See Also

Examples

Shortcut expectation function

Description

Usage

Arguments

Value

References

See Also

Examples

Expected Value

Description

Usage

Arguments

Details

Value

References

See Also

Examples

Forests for Density Estimation

Description

Usage

Arguments

Details

Value

References

See Also

Examples

Forests for Generative Modeling

Description

Usage

Arguments

Details

Value

References

See Also

Examples

Missing value imputation with ARF

Description

Usage

Arguments

Value

See Also

Examples

Likelihood Estimation

Description