This package covers common aspects in predictive modeling:
Main purpose of this package is to teach some predictive modeling using a practical toolbox of functions and concepts, to people who is starting in data science, small data and big data. With special focus on results and analysis understanding.
Overview: Quantity of zeros, NA, unique values; as well as the data type may lead to a good or bad model. Here an approach to cover the very first step in data modeling.
## Loading needed libraries
library(funModeling)
data(heart_disease)
my_data_status=df_status(heart_disease)
##                  variable q_zeros p_zeros q_na p_na    type unique
## 1                     age       0    0.00    0 0.00 integer     41
## 2                  gender       0    0.00    0 0.00  factor      2
## 3              chest_pain       0    0.00    0 0.00  factor      4
## 4  resting_blood_pressure       0    0.00    0 0.00 integer     50
## 5       serum_cholestoral       0    0.00    0 0.00 integer    152
## 6     fasting_blood_sugar     258   85.15    0 0.00  factor      2
## 7         resting_electro     151   49.83    0 0.00  factor      3
## 8          max_heart_rate       0    0.00    0 0.00 integer     91
## 9             exer_angina     204   67.33    0 0.00 integer      2
## 10                oldpeak      99   32.67    0 0.00 numeric     40
## 11                  slope       0    0.00    0 0.00 integer      3
## 12      num_vessels_flour     176   58.09    4 1.32 integer      4
## 13                   thal       0    0.00    2 0.66  factor      3
## 14 heart_disease_severity     164   54.13    0 0.00 integer      5
## 15           exter_angina     204   67.33    0 0.00  factor      2
## 16      has_heart_disease       0    0.00    0 0.00  factor      2
q_zeros: quantity of zeros (p_zeros: in percentage)q_na:  quantity of NA (p_na: in percentage)type: factor or numericunique: quantity of unique valuesFunction df_status takes a data frame and returns a the status table to quickly remove unwanted cases.
Removing variables with high number of NA/zeros
# Removing variables with 60% of zero values
vars_to_remove=subset(my_data_status, my_data_status$p_zeros > 60)
vars_to_remove["variable"]
##               variable
## 6  fasting_blood_sugar
## 9          exer_angina
## 15        exter_angina
## Keeping all except vars_to_remove 
heart_disease_2=heart_disease[, !(names(heart_disease) %in% vars_to_remove[,"variable"])]
Ordering data by percentage of zeros
my_data_status[order(-my_data_status$p_zeros),]
##                  variable q_zeros p_zeros q_na p_na    type unique
## 6     fasting_blood_sugar     258   85.15    0 0.00  factor      2
## 9             exer_angina     204   67.33    0 0.00 integer      2
## 15           exter_angina     204   67.33    0 0.00  factor      2
## 12      num_vessels_flour     176   58.09    4 1.32 integer      4
## 14 heart_disease_severity     164   54.13    0 0.00 integer      5
## 7         resting_electro     151   49.83    0 0.00  factor      3
## 10                oldpeak      99   32.67    0 0.00 numeric     40
## 1                     age       0    0.00    0 0.00 integer     41
## 2                  gender       0    0.00    0 0.00  factor      2
## 3              chest_pain       0    0.00    0 0.00  factor      4
## 4  resting_blood_pressure       0    0.00    0 0.00 integer     50
## 5       serum_cholestoral       0    0.00    0 0.00 integer    152
## 8          max_heart_rate       0    0.00    0 0.00 integer     91
## 11                  slope       0    0.00    0 0.00 integer      3
## 13                   thal       0    0.00    2 0.66  factor      3
## 16      has_heart_disease       0    0.00    0 0.00  factor      2
Constraint: Target variable must have only 2 values. If it has NA values, they will be removed.
Note: Please note there are many ways for selecting best variables to build a model, here is presented one more based on visual analysis.
cross_gender=cross_plot(heart_disease, str_input="gender", str_target="has_heart_disease")
 
Last two plots have the same data source, showing the distribution of has_heart_disease in terms of gender. The one on the left shows in percentage value, while the one on the right shows in absolute value.
Gender variable seems to be a good predictor, since the likelihood of having heart disease is different given the female/male groups.  it gives an order to the data.
There are a total of 97 females:
There are a total of 206 males:
Total cases: Summing the values of four bars: 25+72+114+92=303.
Note: What would it happened if instead of having the rates of 25.8% vs. 55.3% (female vs male), they had been more similar like 30.2% vs. 30.6%). In this case variable gender it would have been much less relevant, since it doesn't separate the has_heart_disease event.
Numerical variables should be binned in order to plot them with an histogram, otherwise the plot is not showing information, as it can be seen here:
There is a function included in the package (inherited from Hmisc package) : equal_freq, which returns the bins/buckets based on the equal frequency criteria. Which is -or tries to- have the same quantity of rows per bin.
For numerical variables, cross_plot has by default the auto_binning=T, which automtically calls the equal_freq function with n_bins=10 (or the closest number).
cross_plot(heart_disease, str_input="max_heart_rate", str_target="has_heart_disease")
 
If you don't want the automatic binning, then set the auto_binning=F in cross_plot function.
For example, creating oldpeak_2 based on equal frequency, with 3 buckets.
heart_disease$oldpeak_2=equal_freq(var=heart_disease$oldpeak, n_bins = 3)
summary(heart_disease$oldpeak_2)
## [0.0,0.2) [0.2,1.5) [1.5,6.2] 
##       106       107        90
Plotting the binned variable (auto_binning = F):
cross_oldpeak_2=cross_plot(heart_disease, str_input="oldpeak_2", str_target="has_heart_disease", auto_binning = F)
 
This new plot based on oldpeak_2 shows clearly how: the likelihood of having heart disease increases as oldpeak_2 increases as well. Again, it gives an order to the data.
Converting variable max_heart_rate into a one of 10 bins:
heart_disease$max_heart_rate_2=equal_freq(var=heart_disease$max_heart_rate, n_bins = 10)
cross_plot(heart_disease, str_input="max_heart_rate_2", str_target="has_heart_disease")
 
At a first glance, max_heart_rate_2 shows a negative and linear relationship, however there are some buckets which add noise to the relationship. For example, the bucket (141, 146] has a higher heart disease rate than the previous bucket, and it was expected to have a lower. This could be noise in data. 
Key note: One way to reduce the noise (at the cost of losing some information), is to split with less bins:
heart_disease$max_heart_rate_3=equal_freq(var=heart_disease$max_heart_rate, n_bins = 5)
cross_plot(heart_disease, str_input="max_heart_rate_3", str_target="has_heart_disease")
 
Conclusion: As it can be seen, now the relationship is much clean and clear. Bucket 'N' has a higher rate than 'N+1', which implies a negative correlation.
How about saving the cross_plot result into a folder?
Just set the parameter path_out with the folder you want -It creates a new one if it doesn't exists-.
cross_plot(heart_disease, str_input="max_heart_rate_3", str_target="has_heart_disease", path_out="my_plots")
It creates the folder my_plots into the working directory. 
cross_plot on multiple variablesImagine you want to run cross_plot for several variables at the same time. To achieve this goal just define a vector containing the variable names.
If you want to analyze these 3 variables:
vars_to_analyze=c("age", "oldpeak", "max_heart_rate")
cross_plot(data=heart_disease, str_target="has_heart_disease", str_input=vars_to_analyze)
cross_plot is good to visualize linear relationships, giving it a hint on non-linear relationships.Overview: Once the predictive model is developed with training data, it should be compared with test data (which wasn't seen by the model before). Here is presented a wrapper for the ROC Curve and AUC (area under ROC) and  the KS (Kolmogorov-Smirnov).
## Training and test data. Percentage of training cases default value=80%.
index_sample=get_sample(data=heart_disease, percentage_tr_rows=0.8)
## Generating the samples
data_tr=heart_disease[index_sample,] 
data_ts=heart_disease[-index_sample,]
## Creating the model only with training data
fit_glm=glm(has_heart_disease ~ age + oldpeak, data=data_tr, family = binomial)
## Performance metrics for Training Data
model_performance(fit=fit_glm, data = data_tr, target_var = "has_heart_disease")
 
## 
## -----------
##  AUC   KS  
## ----- -----
## 0.759 0.406
## -----------
## Performance metrics for Test Data
model_performance(fit=fit_glm, data = data_ts, target_var = "has_heart_disease")
 
## 
## -----------
##  AUC   KS  
## ----- -----
## 0.748 0.456
## -----------
Key notes
Final comments