Title: | Automatically Runs 23 Individual and 17 Ensembles of Models |
Version: | 0.8.0 |
Depends: | Cubist, Metrics, arm, brnn, broom, car, caret, corrplot, doParallel, dplyr, e1071, earth, gam, gbm, ggplot2, glmnet, graphics, grDevices, gridExtra, ipred, leaps, nnet, parallel, pls, purrr, randomForest, reactable, reactablefmtr, readr, rpart, stats, tidyr, tree, utils, xgboost, R (≥ 4.1.0) |
Description: | Automatically runs 23 individual models and 17 ensembles on numeric data. The package automatically returns complete results on all 40 models, 25 charts, multiple tables. The user simply provides the data, and answers a few questions (for example, how many times would you like to resample the data). From there the package randomly splits the data into train, test and validation sets, builds models on the training data, makes predictions on the test and validation sets, measures root mean squared error (RMSE), removes features above a user-set level of Variance Inflation Factor, and has several optional features including scaling all numeric data, four different ways to handle strings in the data. Perhaps the most significant feature is the package's ability to make predictions using the 40 pre trained models on totally new (untrained) data if the user selects that feature. This feature alone represents a very effective solution to the issue of reproducibility of models in data science. The package can also randomly resample the data as many times as the user sets, thus giving more accurate results than a single run. The graphs provide many results that are not typically found. For example, the package automatically calculates the Kolmogorov-Smirnov test for each of the 40 models and plots a bar chart of the results, a bias bar chart of each of the 40 models, as well as several plots for exploratory data analysis (automatic histograms of the numeric data, automatic histograms of the numeric data). The package also automatically creates a summary report that can be both sorted and searched for each of the 40 models, including RMSE, bias, train RMSE, test RMSE, validation RMSE, overfitting and duration. The best results on the holdout data typically beat the best results in data science competitions and published results for the same data set. |
License: | MIT + file LICENSE |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.2 |
LazyData: | true |
Suggests: | knitr, rmarkdown, testthat (≥ 3.0.0) |
Config/testthat/edition: | 3 |
VignetteBuilder: | knitr |
URL: | http://www.NumericEnsembles.com, https://github.com/InfiniteCuriosity/NumericEnsembles |
BugReports: | https://github.com/InfiniteCuriosity/NumericEnsembles/issues |
NeedsCompilation: | no |
Packaged: | 2025-06-01 15:20:32 UTC; russellconte |
Author: | Russ Conte [aut, cre, cph] |
Maintainer: | Russ Conte <russconte@mac.com> |
Repository: | CRAN |
Date/Publication: | 2025-06-01 15:40:02 UTC |
Boston_housing data
Description
This is a modified version of the famous Boston housing data set. This data set includes rows 4:209 and 212:506. The data here is complete except for the data use to make New_Boston. The data first appeared in a paper by David Harrison, Jr. and Daniel L. Rubenfeld, Hedonic housing Prices and the demand for clean air. This was published in March, 1978. Journal of Environmental Economics and Management 5(1):81-102. The descriptions below are quoted from the original paper:
- crim
Crime rate by town. Original data in 1970 FBI data
- zn
Proportion of a town's residential land zoned for lots greater than 25,000 square feet
- indus
Proportional non-retail business per town
- chas
Captures the amenities of a riverside location and thus should be positive
- nox
Nitrogen oxygen concentrations in part per hundred million
- rm
Average number of rooms in owner units
- age
Proportion of owner units built prior to 1940
- dis
Weighted distances to five employment centers in the Boston region
- rad
Index of accessibility to radial highways
- tax
Full property value tax rate ($/$10,000)
- ptratio
Pupil-teacher ratio by town school district
- black
Black proportion of population
- lstat
Proportion of population that is lower status (proportion of adults without some high school education and proportion of male workers classified as laborers)
- medv
Median value of owner occupied homes, from the 1970 United States census
Usage
Boston_housing
Format
An object of class data.frame
with 501 rows and 14 columns.
Source
https://www.law.berkeley.edu/files/Hedonic.PDF
Concrete - This is the strength of concrete daa set originally posted on UCI
Description
Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age and ingredients.
Usage
Concrete
Format
Concrete A data frame with 1030 rows and 9 columns:
- Cement
quantitative – kg in a m3 mixture – Input Variable
- Blast_Furnace_Slag
quantitative – kg in a m3 mixture – Input Variable
- Fly_Ash
quantitative – kg in a m3 mixture – Input Variable
- Water
quantitative – kg in a m3 mixture – Input Variable
- Superplasticizer
quantitative – kg in a m3 mixture – Input Variable
- Coarse_Aggregate
quantitative – kg in a m3 mixture – Input Variable
- Fine_Aggregate
quantitative – kg in a m3 mixture – Input Variable
- Age
Day (1~365) – Input Variable
- Strength
quantitative – MPa – Output Variable
Source
https://archive.ics.uci.edu/dataset/165/concrete+compressive+strength
Insurance - The data is from UCI
Description
This dataset contains detailed information about insurance customers, including their age, sex, body mass index (BMI), number of children, smoking status and region. Having access to such valuable insights allows analysts to get a better view into customer behaviour and the factors that contribute to their insurance charges.
Usage
Insurance
Format
Insurance A data frame with 1338 rows and 7 columns Credit to Bob Wakefield
- Age
The age of the customer. (Integer)
- Children
The number of children the customer has. (Integer)
- Smoker
Whether or not the customer is a smoker. (Boolean)
- Region
The region the customer lives in. (String)
- Charges
The insurance charges for the customer. (Float)
Source
https://www.kaggle.com/datasets/thedevastator/prediction-of-insurance-charges-using-age-gender
NewBoston—These are only the five rows c(1:3, 210:211) from Boston Housing data set. This can be used as new data, and the Boston_housing data set as the original. The numeric function will return predictions on the new data.
Description
This is the first five rows of the Boston housing data set, which have been removed from the Boston data set included here. It is otherwise identical to the Boston data set.
- crim
Crime rate by town. Original data in 1970 FBI data
- zn
Proportion of a town's residential land zoned for lots greater than 25,000 square feet
- indus
Proportional non-retail business per town
- chas
Captures the amenities of a riverside location and thus should be positive
- nox
Nitrogen oxygen concentrations in part per hundred million
- rm
Average number of rooms in owner units
- age
Proportion of owner units built prior to 1940
- dis
Weighted distances to five employment centers in the Boston region
- rad
Index of accessibility to radial highways
- tax
Full property value tax rate ($/$10,000)
- ptratio
Pupil-teacher ratio by town school district
- black
Black proportion of population
- lstat
Proportion of population that is lower status (proportion of adults without some high school education and proportion of male workers classified as laborers)
- medv
Median value of owner occupied homes, from the 1970 United States census
Usage
New_Boston
Format
An object of class data.frame
with 5 rows and 14 columns.
Source
https://www.law.berkeley.edu/files/Hedonic.PDF
Numeric—function to automatically build 23 individual models and 17 ensembles then return the results to the user
Description
Numeric—function to automatically build 23 individual models and 17 ensembles then return the results to the user
Usage
Numeric(
data,
colnum,
numresamples,
remove_VIF_above = 5,
remove_ensemble_correlations_greater_than = 0.98,
scale_all_predictors_in_data = c("Y", "N"),
data_reduction_method = c(0("none"), 1("BIC exhaustive"), 2("BIC forward"),
3("BIC backward"), 4("BIC seqrep"), 5("Mallows_cp exhaustive"),
6("Mallows_cp forward"), 7("Mallows_cp backward"), 8("Mallows_cp, seqrep")),
ensemble_reduction_method = c(0("none"), 1("BIC exhaustive"), 2("BIC forward"),
3("BIC backward"), 4("BIC seqrep"), 5("Mallows_cp exhaustive"),
6("Mallows_cp forward"), 7("Mallows_cp backward"), 8("Mallows_cp, seqrep")),
how_to_handle_strings = c(0("none"), 1("factor levels"), 2("One-hot encoding"),
3("One-hot encoding with jitter")),
predict_on_new_data = c("Y", "N"),
save_all_trained_models = c("Y", "N"),
save_all_plots = c("Y", "N"),
use_parallel = c("Y", "N"),
train_amount,
test_amount,
validation_amount
)
Arguments
data |
data can be a CSV file or within an R package, such as MASS::Boston |
colnum |
a column number in your data |
numresamples |
the number of resamples |
remove_VIF_above |
remove columns with Variable Inflation Factor above value chosen by the user |
remove_ensemble_correlations_greater_than |
maximum value for correlations of the ensemble |
scale_all_predictors_in_data |
"Y" or "N" to scale numeric data |
data_reduction_method |
0(none), BIC (1, 2, 3, 4) or Mallow's_cp (5, 6, 7, 8) for Forward, Backward, Exhaustive and SeqRep |
ensemble_reduction_method |
0(none), BIC (1, 2, 3, 4) or Mallow's_cp (5, 6, 7, 8) for Forward, Backward, Exhaustive and SeqRep |
how_to_handle_strings |
0: No strings, 1: Factor values, 2: One-hot encoding, 3: One-hot encoding AND jitter |
predict_on_new_data |
"Y" or "N". If "Y", then you will be asked for the new data |
save_all_trained_models |
"Y" or "N". If "Y", then places all the trained models in the Environment |
save_all_plots |
Saves all plots to the working directory |
use_parallel |
"Y" or "N" for parallel processing |
train_amount |
set the amount for the training data |
test_amount |
set the amount for the testing data |
validation_amount |
Set the amount for the validation data |
Value
a real number