Type: | Package |
Title: | Introduction to Probability, Statistics and R for Data-Based Sciences |
Version: | 1.0.0 |
Date: | 2024-04-10 |
Description: | Contains data sets, programmes and illustrations discussed in the book, "Introduction to Probability, Statistics and R: Foundations for Data-Based Sciences." Sahu (2024, isbn:9783031378645) describes the methods in detail. |
License: | GPL-3 |
Maintainer: | Sujit K. Sahu <S.K.Sahu@soton.ac.uk> |
URL: | https://www.sujitsahu.com |
Encoding: | UTF-8 |
LazyData: | true |
RoxygenNote: | 7.2.3 |
Imports: | methods, Rdpack, utils, ggplot2, extraDistr |
Depends: | R (≥ 4.1.0) |
Suggests: | xtable, GGally, magick, MASS, knitr, rmarkdown, testthat, tidyr, huxtable, RColorBrewer, markdown, BiocStyle |
VignetteBuilder: | knitr |
RdMacros: | Rdpack |
BugReports: | https://github.com/sujit-sahu/ipsRdbs/issues |
NeedsCompilation: | no |
Packaged: | 2024-04-10 16:55:36 UTC; sks |
Author: | Sujit K. Sahu |
Repository: | CRAN |
Date/Publication: | 2024-04-10 17:20:03 UTC |
Age and value of 50 beanie baby toys
Description
Age and value of 50 beanie baby toys
Usage
beanie
Format
A data frame with 50 rows and 3 columns:
- name
Name of the toy
- age
Age of the toy in months
- value
Market value of the toy in US dollars
Source
Beanie world magazine
Examples
head(beanie)
summary(beanie)
plot(beanie$age, beanie$value, xlab="Age", ylab="Value", pch="*", col="red")
Wealth, age and region of 225 billionaires in 1992 as reported in the Fortune magazine
Description
Wealth, age and region of 225 billionaires in 1992 as reported in the Fortune magazine
Usage
bill
Format
A data frame with 225 rows and three columns:
- wealth
Wealth in billions of US dollars
- age
Age of the billionaire
- region
five regions:Asia, Europe, Middle East, United States, and Other
Source
Fortune magazine 1992.
Examples
head(bill)
summary(bill)
table(bill$region)
levels(bill$region)
levels(bill$region) <- c("Asia", "Europe", "Mid-East", "Other", "USA")
bill.wealth.ge5 <- bill[bill$wealth>5, ]
bill.wealth.ge5
bill.region.A <- bill[ bill$region == "A", ]
bill.region.A
a <- seq(1, 10, by =2)
oddrows <- bill[a, ]
barplot(table(bill$region), col=2:6)
hist(bill$wealth) # produces a dull looking plot
hist(bill$wealth, nclass=20) # produces a more detailed plot.
hist(bill$wealth, nclass=20, xlab="Wealth",
main="Histogram of wealth of billionaires")
# produces a more informative plot.
plot(bill$age, bill$wealth) # A very dull plot.
plot(bill$age, bill$wealth, xlab="Age", ylab="Wealth", pch="*") # better
plot(bill$age, bill$wealth, xlab="Age", ylab="Wealth", type="n")
# Lays the plot area but does not plot.
text(bill$age, bill$wealth, labels=bill$region, cex=0.7, col=2:6)
# Adds the points to the empty plot.
# Provides a better looking plot with more information.
boxplot(data=bill, wealth ~ region, col=2:6)
tapply(X=bill$wealth, INDEX=bill$region, FUN=mean)
tapply(X=bill$wealth, INDEX=bill$region, FUN=sd)
round(tapply(X=bill$wealth, INDEX=bill$region, FUN=mean), 2)
library(ggplot2)
gg <- ggplot2::ggplot(data=bill, aes(x=age, y=wealth)) +
geom_point(aes(col=region, size=wealth)) +
geom_smooth(method="loess", se=FALSE) +
xlim(c(7, 102)) +
ylim(c(1, 37)) +
labs(subtitle="Wealth vs Age of Billionaires",
y="Wealth (Billion US $)", x="Age",
title="Scatterplot", caption = "Source: Fortune Magazine, 1992.")
plot(gg)
Body fat percentage data for 102 elite male athletes training at the Australian Institute of Sport.
Description
Body fat percentage data for 102 elite male athletes training at the Australian Institute of Sport.
Usage
bodyfat
Format
A data frame with 102 rows and two columns:
- Skinfold
Skin-fold thicknesses measured using calipers
- Bodyfat
Percentage of fat content in the body
Source
Data collected by Dr R. Telford who was working for the Australian Institute of Sport (AIS)
Examples
summary(bodyfat)
plot(bodyfat$Skinfold, bodyfat$Bodyfat, xlab="Skin", ylab="Fat")
plot(bodyfat$Skinfold, log(bodyfat$Bodyfat), xlab="Skin", ylab="log Fat")
plot(log(bodyfat$Skinfold), log(bodyfat$Bodyfat), xlab="log Skin", ylab="log Fat")
old.par <- par(no.readonly = TRUE)
# par(mfrow=c(2,2)) # draws four plots in a graph
plot(bodyfat$Skinfold, bodyfat$Bodyfat, xlab="Skin", ylab="Fat")
plot(bodyfat$Skinfold, log(bodyfat$Bodyfat), xlab="Skin", ylab="log Fat")
plot(log(bodyfat$Skinfold), log(bodyfat$Bodyfat), xlab="log Skin", ylab="log Fat")
plot(1:5, 1:5, axes=FALSE, xlab="", ylab="", type="n")
text(2, 2, "Log both X and Y")
text(2, 1, "To have the best plot")
# Keep the transformed variables in the data set
bodyfat$logskin <- log(bodyfat$Skinfold)
bodyfat$logbfat <- log(bodyfat$Bodyfat)
bodyfat$logskin <- log(bodyfat$Skinfold)
par(old.par)
# Create a grouped variable
bodyfat$cutskin <- cut(log(bodyfat$Skinfold), breaks=6)
boxplot(data=bodyfat, Bodyfat~cutskin, col=2:7)
head(bodyfat)
require(ggplot2)
p2 <- ggplot(data=bodyfat, aes(x=cutskin, y=logbfat)) +
geom_boxplot(col=2:7) +
stat_summary(fun=mean, geom="line", aes(group=1), col="blue", linewidth=1) +
labs(x="Skinfold", y="Percentage of log bodyfat",
title="Boxplot of log-bodyfat percentage vs grouped log-skinfold")
plot(p2)
n <- nrow(bodyfat)
x <- bodyfat$logskin
y <- bodyfat$logbfat
xbar <- mean(x)
ybar <- mean(y)
sx2 <- var(x)
sy2 <- var(y)
sxy <- cov(x, y)
r <- cor(x, y)
print(list(n=n, xbar=xbar, ybar=ybar, sx2=sx2, sy2=sy2, sxy=sxy, r=r))
hatbeta1 <- r * sqrt(sy2/sx2) # calculates estimate of the slope
hatbeta0 <- ybar - hatbeta1 * xbar # calculates estimate of the intercept
rs <- y - hatbeta0 - hatbeta1 * x # calculates residuals
s2 <- sum(rs^2)/(n-2) # calculates estimate of sigma2
s2
bfat.lm <- lm(logbfat ~ logskin, data=bodyfat)
### Check the diagnostics
plot(bfat.lm$fit, bfat.lm$res, xlab="Fitted values", ylab = "Residuals")
abline(h=0)
### Should be a random scatter
qqnorm(bfat.lm$res, col=2)
qqline(bfat.lm$res, col="blue")
# All Points should be on the straight line
summary(bfat.lm)
anova(bfat.lm)
plot(bodyfat$logskin, bodyfat$logbfat, xlab="log Skin", ylab="log Fat")
abline(bfat.lm, col=7)
title("Scatter plot with the fitted Linear Regression line")
# 95% CI for beta(1)
# 0.88225 + c(-1, 1) * qt(0.975, df=100) * 0.02479
# round(0.88225 + c(-1, 1) * qt(0.975, df=100) * 0.02479, 2)
# To test H0: beta1 = 1.
tstat <- (0.88225 -1)/0.02479
pval <- 2 * (1- pt(abs(tstat), df=100))
x <- seq(from=-5, to=5, length=500)
y <- dt(x, df=100)
plot(x, y, xlab="", ylab="", type="l")
title("T-density with df=100")
abline(v=abs(tstat))
abline(h=0)
x1 <- seq(from=abs(tstat), to=10, length=100)
y1 <- rep(0, length=100)
x2 <- x1
y2 <- dt(x1, df=100)
segments(x1, y1, x2, y2)
abline(h=0)
# Predict at a new value of Skinfold=70
# Create a new data set called new
newx <- data.frame(logskin=log(70))
a <- predict(bfat.lm, newdata=newx, se.fit=TRUE)
# Confidence interval for the mean of log bodyfat at skinfold=70
a <- predict(bfat.lm, newdata=newx, interval="confidence")
# a
# fit lwr upr
# [1,] 2.498339 2.474198 2.52248
# Prediction interval for a future log bodyfat at skinfold=70
a <- predict(bfat.lm, newdata=newx, interval="prediction")
a
# fit lwr upr
# [1,] 2.498339 2.333783 2.662895
#prediction intervals for the mean
pred.bfat.clim <- predict(bfat.lm, data=bodyfat, interval="confidence")
#prediction intervals for future observation
pred.bfat.plim <- suppressWarnings(predict(bfat.lm, data=bodyfat, interval="prediction"))
plot(bodyfat$logskin, bodyfat$logbfat, xlab="log Skin", ylab="log Fat")
abline(bfat.lm, col=5)
lines(log(bodyfat$Skinfold), pred.bfat.clim[,2], lty=2, col=2)
lines(log(bodyfat$Skinfold), pred.bfat.clim[,3], lty=2, col=2)
lines(log(bodyfat$Skinfold), pred.bfat.plim[,2], lty=4, col=3)
lines(log(bodyfat$Skinfold), pred.bfat.plim[,3], lty=4, col=3)
title("Scatter plot with the fitted line and prediction intervals")
symb <- c("Fitted line", "95% CI for mean", "95% CI for observation")
## legend(locator(1), legend = symb, lty = c(1, 2, 4), col = c(5, 2, 3))
# Shows where we predicted earlier
abline(v=log(70))
summary(bfat.lm)
anova(bfat.lm)
Number of bomb hits in London during World War II
Description
Number of bomb hits in London during World War II
Usage
bombhits
Format
A data frame with two columns and six rows:
- numberhit
The number of bomb hits during World War II in each of the 576 areas in London.
- freq
Frequency of the number of hits
Source
Shaw and Shaw (2019).
Examples
summary(bombhits)
# Create a vector of data
x <- c(rep(0, 229), rep(1, 211), rep(2, 93), rep(3, 35), rep(4, 7), 5)
y <- c(229, 211, 93, 35, 7, 1) # Frequencies
rel_freq <- y/576
xbar <- mean(x)
pois_prob <- dpois(x=0:5, lambda=xbar)
fit_freq <- pois_prob * 576
#Check
cbind(x=0:5, obs_freq=y, rel_freq =round(rel_freq, 4),
Poisson_prob=round(pois_prob, 4), fit_freq=round(fit_freq, 1))
obs_freq <- y
xuniques <- 0:5
a <- data.frame(xuniques=0:5, obs_freq =y, fit_freq=fit_freq)
barplot(rbind(obs_freq, fit_freq),
args.legend = list(x = "topright"),
xlab="No of bomb hits",
names.arg = xuniques, beside=TRUE,
col=c("darkblue","red"),
legend =c("Observed", "Fitted"),
main="Observed and Poisson distribution fitted frequencies
for the number of bomb hits in London")
Draws a butterfly as on the front cover of the book
Description
Draws a butterfly as on the front cover of the book
Usage
butterfly(color = 2, a = 2, b = 4)
Arguments
color |
This is the color to use in the plot. It can take any value that R can use for color, e.g. 1, 2, "blue" etc. |
a |
Parameter controlling the shape of the butterfly |
b |
Second parameter controlling the shape of the butterfly |
Value
No return value, called for side effects. It generates a plot whose colour and shape are determined by the supplied parameters.
Examples
butterfly(color = 6)
old.par <- par(no.readonly = TRUE)
par(mfrow=c(2, 2))
butterfly(color = 6)
butterfly(a=5, b=5, color=2)
butterfly(a=10, b=1.5, color = "seagreen")
butterfly(a=20, b=4, color = "blue")
par(old.par) # par(mfrow=c(1, 1))
Breaking strength of cement data
Description
Breaking strength of cement data
Usage
cement
Format
A data frame with 36 rows and 3 columns:
- strength
Breaking strength in pounds per square inch
- gauger
Three different gauger machines which mixes cement with water
- breaker
Three different breakers breaking the cement cubes
Examples
summary(cement)
Weekly number of failures of a university computer system over a period of two years. This is a data vector containing 104 values.
Description
Weekly number of failures of a university computer system over a period of two years. This is a data vector containing 104 values.
Usage
cfail
Format
An object of class numeric
of length 104.
Examples
summary(cfail)
# 95% Confidence interval
c(3.75-1.96 * 3.381/sqrt(104), 3.75+1.96*3.381/sqrt(104)) # =(3.10,4.40).
x <- cfail
n <- length(x)
h <- qnorm(0.975)
# 95% Confidence interval Using quadratic inversion
mean(x) + (h*h)/(2*n) + c(-1, 1) * h/sqrt(n) * sqrt(h*h/(4*n) + mean(x))
# Modelling
# Observed frequencies
obs_freq <- as.vector(table(x))
# Obtain unique x values
xuniques <- sort(unique(x))
lam_hat <- mean(x)
fit_freq <- n * dpois(xuniques, lambda=lam_hat)
fit_freq <- round(fit_freq, 1)
# Create a data frame for plotting
a <- data.frame(xuniques=xuniques, obs_freq = obs_freq, fit_freq=fit_freq)
barplot(rbind(obs_freq, fit_freq), args.legend = list(x = "topright"),
xlab="No of weekly computer failures",
names.arg = xuniques, beside=TRUE, col=c("darkblue","red"),
legend =c("Observed", "Fitted"),
main="Observed and Poisson distribution fitted frequencies
for the computer failure data: cfail")
Testing of cheese data set
Description
Testing of cheese data set
Usage
cheese
Format
A data frame with 30 rows and 5 columns
- Taste
A measure of taste quality of cheese
- AceticAcid
Concentration of Acetic acid
- H2S
Concentration of hydrogen sulphide
- LacticAcid
Concentration lactic acid
- logH2S
Logarithm of H2S
Examples
data(cheese)
summary(cheese)
pairs(cheese)
cheese.lm <- lm(Taste ~ AceticAcid + LacticAcid + logH2S, data=cheese, subset=2:30)
# Check the diagnostics
plot(cheese.lm$fit, cheese.lm$res, xlab="Fitted values", ylab = "Residuals")
abline(h=0)
# Should be a random scatter
qqnorm(cheese.lm$res, col=2)
qqline(cheese.lm$res, col="blue")
summary(cheese.lm)
cheese.lm2 <- lm(Taste ~ LacticAcid + logH2S, data=cheese)
# Check the diagnostics
plot(cheese.lm2$fit, cheese.lm2$res, xlab="Fitted values", ylab = "Residuals")
abline(h=0)
qqnorm(cheese.lm2$res, col=2)
qqline(cheese.lm2$res, col="blue")
summary(cheese.lm2)
# How can we predict?
newcheese <- data.frame(AceticAcid = 300, LacticAcid = 1.5, logH2S=4)
cheese.pred <- predict(cheese.lm2, newdata=newcheese, se.fit=TRUE)
cheese.pred
# Obtain confidence interval
cheese.pred$fit + c(-1, 1) * qt(0.975, df=27) * cheese.pred$se.fit
# Using R to predict
cheese.pred.conf.limits <- predict(cheese.lm2, newdata=newcheese, interval="confidence")
cheese.pred.conf.limits
# How to find prediction interval
cheese.pred.pred.limits <- predict(cheese.lm2, newdata=newcheese, interval="prediction")
cheese.pred.pred.limits
Nitrous oxide emission data
Description
Nitrous oxide emission data
Usage
emissions
Format
An object of class data.frame
with 54 rows and 13 columns.
Source
Australian Traffic Accident Research Bureau @format A data frame with thirteen columns and 54 rows.
- Make
Make of the car
- Odometer
Odometer reading of the car
- Capacity
Engine capacity of the car
- CS505
A measurement taken while running the engine from a cold start for 505 seconds
- T867
A measurement taken while running the engine in transition from cold to hot for 867 seconds
- H505
A measurement taken while running the hot engine for 505 seconds
- ADR27
A previously used measurement standard
- ADR37
Result of the Australian standard ADR37test
- logCS505
Logarithm of CS505
- logT867
Logarithm of T867
- logH505
Logarithm of H505
- logADR27
Logarithm of ADR27
- logADR37
Logarithm of ADR37
Examples
summary(emissions)
rawdata <- emissions[, c(8, 4:7)]
pairs(rawdata)
# Fit the model on the raw scale
raw.lm <- lm(ADR37 ~ ADR27 + CS505 + T867 + H505, data=rawdata)
old.par <- par(no.readonly = TRUE)
par(mfrow=c(2,1))
plot(raw.lm$fit, raw.lm$res,xlab="Fitted values",ylab="Residuals", main="Anscombe plot")
abline(h=0)
qqnorm(raw.lm$res,main="Normal probability plot", col=2)
qqline(raw.lm$res, col="blue")
# summary(raw.lm)
logdata <- log(rawdata)
# This only logs the values but not the column names!
# We can use the following command to change the column names or you can use
# fix(logdata) to do it.
dimnames(logdata)[[2]] <- c("logADR37", "logCS505", "logT867", "logH505", "logADR27")
pairs(logdata)
log.lm <- lm(logADR37 ~ logADR27 + logCS505 + logT867 + logH505, data=logdata)
plot(log.lm$fit, log.lm$res,xlab="Fitted values",ylab="Residuals", main="Anscombe plot")
abline(h=0)
qqnorm(log.lm$res,main="Normal probability plot", col=2)
qqline(log.lm$res, col="blue")
summary(log.lm)
log.lm2 <- lm(logADR37 ~ logADR27 + logT867 + logH505, data=logdata)
summary(log.lm2)
plot(log.lm2$fit, log.lm2$res,xlab="Fitted values",ylab="Residuals", main="Anscombe plot")
abline(h=0)
qqnorm(log.lm2$res,main="Normal probability plot", col=2)
qqline(log.lm2$res, col="blue")
par(old.par)
#####################################
# Multicollinearity Analysis
######################################
mod.adr27 <- lm(logADR27 ~ logT867 + logCS505 + logH505, data=logdata)
summary(mod.adr27) # Multiple R^2 = 0.9936,
mod.t867 <- lm(logT867 ~ logADR27 + logH505 + logCS505, data=logdata)
summary(mod.t867) # Multiple R^2 = 0.977,
mod.cs505 <- lm(logCS505 ~ logADR27 + logH505 + logT867, data=logdata)
summary(mod.cs505) # Multiple R^2 = 0.9837,
mod.h505 <- lm(logH505 ~ logADR27 + logCS505 + logT867, data=logdata)
summary(mod.h505) # Multiple R^2 = 0.5784,
# Variance inflation factors
vifs <- c(0.9936, 0.977, 0.9837, 0.5784)
vifs <- 1/(1-vifs)
#Condition numbers
X <- logdata
# X is a copy of logdata
X[,1] <- 1
# the first column of X is 1
# this is for the intercept
X <- as.matrix(X)
# Coerces X to be a matrix
xtx <- t(X) %*% X # Gives X^T X
eigenvalues <- eigen(xtx)$values
kappa <- max(eigenvalues)/min(eigenvalues)
kappa <- sqrt(kappa)
# kappa = 244 is much LARGER than 30!
### Validation statistic
# Fit the log.lm2 model with the first 45 observations
# use the fitted model to predict the remaining 9 observations
# Calculate the mean square error validation statistic
log.lmsub <- lm(logADR37 ~ logADR27 + logT867 + logH505, data=logdata, subset=1:45)
# Now predict all 54 observations using the fitted model
mod.pred <- predict(log.lmsub, logdata, se.fit=TRUE)
mod.pred$fit # provides all the 54 predicted values
logdata$pred <- mod.pred$fit
# Get only last 9
a <- logdata[46:54, ]
validation.residuals <- a$logADR37 - a$pred
validation.stat <- mean(validation.residuals^2)
validation.stat
Errors in guessing ages of Southampton mathematicians
Description
Errors in guessing ages of Southampton mathematicians
Usage
err_age
Format
A data frame with 550 rows and 10 columns
- group
Group number of the students guessing the ages
- size
Number of students in the group
- females
How many female guessers were in the group
- photo
Photograph number guessed, can take value 1 to 10.
- sex
Gender of the photographed person.
- race
Race of the photographed person.
- est_age
Estimated age of the photographed person.
- tru_age
True age of the photographed person.
- error
The value of error, estimated age minus true age
- abs_error
Absolute value of the error
Examples
summary(err_age)
Service (waiting) times (in seconds) of customers at a fast-food restaurant.
Description
Service (waiting) times (in seconds) of customers at a fast-food restaurant.
Usage
ffood
Format
A data frame with 10 rows and 2 columns:
- AM
Waiting times for customers served during 9-10AM
- PM
Waiting times for customers served during 2-3PM
Examples
summary(ffood)
# 95% Confidence interval for the mean waiting time usig t-distribution
a <- c(ffood$AM, ffood$PM)
mean(a) + c(-1, 1) * qt(0.975, df=19) * sqrt(var(a))/sqrt(20)
# Two sample t-test for the difference between morning and afternoon times
t.test(ffood$AM, ffood$PM)
Gas mileage of four models of car
Description
Gas mileage of four models of car
Usage
gasmileage
Format
A data frame with two columns and eleven rows:
- mileage
Mileage obtained
- model
Four types of car models
Examples
summary(gasmileage)
y <- c(22, 26, 28, 24, 29, 29, 32, 28, 23, 24)
xx <- c(1,1,2,2,2,3,3,3,4,4)
# Plot the observations
plot(xx, y, col="red", pch="*", xlab="Model", ylab="Mileage")
# Method1: Hand calculation
ni <- c(2, 3, 3, 2)
means <- tapply(y, xx, mean)
vars <- tapply(y, xx, var)
round(rbind(means, vars), 2)
sum(y^2) # gives 7115
totalSS <- sum(y^2) - 10 * (mean(y))^2 # gives 92.5
RSSf <- sum(vars*(ni-1)) # gives 31.17
groupSS <- totalSS - RSSf # gives 61.3331.17/6
meangroupSS <- groupSS/3 # gives 20.44
meanErrorSS <- RSSf/6 # gives 5.194
Fvalue <- meangroupSS/meanErrorSS # gives 3.936
pvalue <- 1-pf(Fvalue, df1=3, df2=6)
#### Method 2: Illustrate using dummy variables
#################################
#Create the design matrix X for the full regression model
g <- 4
n1 <- 2
n2 <- 3
n3 <- 3
n4 <- 2
n <- n1+n2+n3+n4
X <- matrix(0, ncol=g, nrow=n) #Set X as a zero matrix initially
X[1:n1,1] <- 1 #Determine the first column of X
X[(n1+1):(n1+n2),2] <- 1 #the 2nd column
X[(n1+n2+1):(n1+n2+n3),3] <- 1 #the 3rd
X[(n1+n2+n3+1):(n1+n2+n3+n4),4] <- 1 #the 4th
#################################
####Fitting the full model####
#################################
#Estimation
XtXinv <- solve(t(X)%*%X)
betahat <- XtXinv %*%t(X)%*%y #Estimation of the coefficients
Yhat <- X%*%betahat #Fitted Y values
Resids <- y - Yhat #Residuals
SSE <- sum(Resids^2) #Error sum of squares
S2hat <- SSE/(n-g) #Estimation of sigma-square; mean square for error
Sigmahat <- sqrt(S2hat)
##############################################################
####Fitting the reduced model -- the 4 means are equal #####
##############################################################
Xr <- matrix(1, ncol=1, nrow=n)
kr <- dim(Xr)[2]
#Estimation
Varr <- solve(t(Xr)%*%Xr)
hbetar <- solve(t(Xr)%*%Xr)%*%t(Xr)%*% y #Estimation of the coefficients
hYr = Xr%*%hbetar #Fitted Y values
Resir <- y - hYr #Residuals
SSEr <- sum(Resir^2) #Total sum of squares
###################################################################
####F-test for comparing the reduced model with the full model ####
###################################################################
FStat <- ((SSEr-SSE)/(g-kr))/(SSE/(n-g)) #The test statistic of the F-test
alpha <- 0.05
Critical_value_F <- qf(1-alpha, g-kr,n-g) #The critical constant of F-test
pvalue_F <- 1-pf(FStat,g-kr, n-g) #p-value of F-test
modelA <- c(22, 26)
modelB <- c(28, 24, 29)
modelC <- c(29, 32, 28)
modelD <- c(23, 24)
SSerror = sum( (modelA-mean(modelA))^2 ) + sum( (modelB-mean(modelB))^2 )
+ sum( (modelC-mean(modelC))^2 ) + sum( (modelD-mean(modelD))^2 )
SStotal <- sum( (y-mean(y))^2 )
SSgroup <- SStotal-SSerror
####
#### Method 3: Use the built-in function lm directly
#####################################
aa <- "modelA"
bb <- "modelB"
cc <- "modelC"
dd <- "modelD"
Expl <- c(aa,aa,bb,bb,bb,cc,cc,cc,dd,dd)
is.factor(Expl)
Expl <- factor(Expl)
model1 <- lm(y~Expl)
summary(model1)
anova(model1)
###Alternatively ###
xxf <- factor(xx)
is.factor(xxf)
model2 <- lm(y~xxf)
summary(model2)
anova(model2)
Simulation of the Monty Hall Problem. This demonstrates that switching is always better than staying with the initial guess. Programme written by Corey Chivers, 2012
Description
Simulation of the Monty Hall Problem. This demonstrates that switching is always better than staying with the initial guess. Programme written by Corey Chivers, 2012
Usage
monty(strat = "stay", N = 1000, print_games = TRUE)
Arguments
strat |
Strategy to use; possibilities are:
|
N |
How many games to play, defaults to 1000. |
print_games |
Logical; whether to print the results of each game. |
Value
No return value, called for side effects. If the supplied parameter
print_games
is TRUE, then it prints out the result
(Win or Loss) of each of the N simulated games. Finally it reports the
overall percentage of winning.
####################################################
Source
Corey Chivers (2012) Chivers (2012)
Examples
# example code
monty("stay")
monty("switch")
monty("random")
Body weight and length of possums (tree living furry animals who are mostly nocturnal (marsupial) caught in 7 different regions of Australia.
Description
Body weight and length of possums (tree living furry animals who are mostly nocturnal (marsupial) caught in 7 different regions of Australia.
Usage
possum
Format
A data frame with 101 rows and 3 columns:
- Body_Weight
Body weight in kilogram
- Length
Body length of the possum
- Location
7 different regions of Australia: (1) Western Austrailia (W.A), (2) South Australia (S.A), (3) Northern Territory (N.T), (4) Queensland (QuL), (5) New South Wales (NSW), (6) Victoria (Vic) and (7) Tasmania (Tas).
Source
Lindenmayer and Donnelly (1995).
References
Lindenmayer DBVKLCRB, Donnelly CF (1995). “Morphological variation among columns of the mountain brushtail possum, Trichosurus caninus Ogilby (Phalangeridae: Marsupiala).” Australian Journal of Zoology, 43, 449-458.
Examples
head(possum)
dim(possum)
summary(possum)
## Code lines used in the book
## Create a new data set
poss <- possum
poss$region <- factor(poss$Location)
levels(poss$region) <- c("W.A", "S.A", "N.T", "QuL", "NSW", "Vic", "Tas")
colnames(poss)<-c("y","z","Location", "x")
head(poss)
# Draw side by side boxplots
boxplot(y~x, data=poss, col=2:8, xlab="region", ylab="Weight")
# Obtain scatter plot
# Start with a skeleton plot
plot(poss$z, poss$y, type="n", xlab="Length", ylab="Weight")
# Add points for the seven regions
for (i in 1:7) {
points(poss$z[poss$Location==i],poss$y[poss$Location==i],type="p", pch=as.character(i), col=i)
}
## Add legends
legend(x=76, y=4.2, legend=paste(as.character(1:7), levels(poss$x)), lty=1:7, col=1:7)
# Start modelling
#Fit the model with interaction.
poss.lm1<-lm(y~z+x+z:x,data=poss)
summary(poss.lm1)
plot(poss$z, poss$y,type="n", xlab="Length", ylab="Weight")
for (i in 1:7) {
lines(poss$z[poss$Location==i],poss.lm1$fit[poss$Location==i],type="l",
lty=i, col=i, lwd=1.8)
points(poss$z[poss$Location==i],poss$y[poss$Location==i],type="p",
pch=as.character(i), col=i)
}
poss.lm0 <- lm(y~z,data=poss)
abline(poss.lm0, lwd=3, col=9)
# Has drawn the seven parallel regression lines
legend(x=76, y=4.2, legend=paste(as.character(1:7), levels(poss$x)),
lty=1:7, col=1:7)
n <- length(possum$Body_Weight)
# Wrong model since Location is not a numeric covariate
wrong.lm <- lm(Body_Weight~Location, data=possum)
summary(wrong.lm)
nis <- table(possum$Location)
meanwts <- tapply(possum$Body_Weight, possum$Location, mean)
varwts <- tapply(possum$Body_Weight, possum$Location, var)
datasums <- data.frame(nis=nis, mean=meanwts, var=varwts)
datasums <- data.frame(nis=nis, mean=meanwts, var=varwts)
modelss <- sum(datasums[,2] * (meanwts - mean(meanwts))^2)
residss <- sum( (datasums[,2] - 1) * varwts)
fvalue <- (modelss/6) / (residss/94)
fcritical <- qf(0.95, df1= 6, df2=94)
x <- seq(from=0, to=12, length=200)
y <- df(x, df1=6, df2=94)
plot(x, y, type="l", xlab="", ylab="Density of F(6, 94)", col=4)
abline(v=fcritical, lty=3, col=3)
abline(v=fvalue, lty=2, col=2)
pvalue <- 1-pf(fvalue, df1=6, df2=94)
### Doing the above in R
# Convert the Location column to a factor
possum$Location <- as.factor(possum$Location)
summary(possum) # Now Location is a factor
# Put the identifiability constraint:
options(contrasts=c("contr.treatment", "contr.poly"))
colnames(possum) <- c("y", "z", "x")
# Fit model M1
possum.lm1 <- lm(y~x, data=possum)
summary(possum.lm1)
anova(possum.lm1)
possum.lm2 <- lm(y~z, data=poss)
summary(possum.lm2)
anova(possum.lm2)
# Include both location and length but no interaction
possum.lm3 <- lm(y~x+z, data=poss)
summary(possum.lm3)
anova(possum.lm3)
# Include interaction effect
possum.lm4 <- lm(y~x+z+x:z, data=poss)
summary(possum.lm4)
anova(possum.lm4)
anova(possum.lm2, possum.lm3)
#Check the diagnostics for M3
plot(possum.lm3$fit, possum.lm3$res,xlab="Fitted values",ylab="Residuals",
main="Anscombe plot")
abline(h=0)
qqnorm(possum.lm3$res,main="Normal probability plot", col=2)
qqline(possum.lm3$res, col="blue")
Puffin nesting data set. It contains data regarding nesting habits of common puffin
Description
Puffin nesting data set. It contains data regarding nesting habits of common puffin
Usage
puffin
Format
A data frame with 38 rows and 5 columns:
- Nesting_Frequency
Number of nests
- Grass_Cover
Percentage of area covered by grass
- Mean_Soil_Depth
Mean soil depth in centimeter
- Slope_Angle
Slope angle in degrees
- Distance_from_Edge
Distance of the plot from the cliff edge in meter
Source
Nettleship (1972).
References
Nettleship DN (1972). “Breeding Success of the Common Puffin (Fratercula arctica L.) on Different Habitats at Great Island, Newfoundland.” Ecological Monographs, 42, 239-268.
Examples
head(puffin)
dim(puffin)
summary(puffin)
pairs(puffin)
puffin$sqrtfreq <- sqrt(puffin$Nesting_Frequency)
puff.sqlm <- lm(sqrtfreq~ Grass_Cover + Mean_Soil_Depth + Slope_Angle
+Distance_from_Edge, data=puffin)
old.par <- par(no.readonly = TRUE)
par(mfrow=c(2,1))
qqnorm(puff.sqlm$res,main="Normal probability plot", col=2)
qqline(puff.sqlm$res, col="blue")
plot(puff.sqlm$fit, puff.sqlm$res,xlab="Fitted values",ylab="Residuals",
main="Anscombe plot")
abline(h=0)
summary(puff.sqlm)
par(old.par)
#####################################
# F test for two betas at the same time:
######################################
puff.sqlm2 <- lm(sqrtfreq~ Mean_Soil_Depth + Distance_from_Edge, data=puffin)
anova(puff.sqlm)
anova(puff.sqlm2)
fval <- 1/2*(14.245-12.756)/0.387 # 1.924
qf(0.95, 2, 33) # 3.28
1-pf(fval, 2, 33) # 0.162
anova(puff.sqlm2, puff.sqlm)
Riece yield data
Description
Riece yield data
Usage
rice
Format
A data frame with three columns and 68 rows:
- Yield
Yield of rice in kilograms
- Days
Number of days after flowering before harvesting
Source
Bal and Ojha (1975).
Examples
summary(rice)
plot(rice$Days, rice$Yield, pch="*", xlab="Days", ylab="Yield")
rice$daymin31 <- rice$Days-31
rice.lm <- lm(Yield ~ daymin31, data=rice)
summary(rice.lm)
# Check the diagnostics
plot(rice.lm$fit, rice.lm$res, xlab="Fitted values", ylab = "Residuals")
abline(h=0)
# Should be a random scatter
# Needs a quadratic term
qqnorm(rice.lm$res, col=2)
qqline(rice.lm$res, col="blue")
rice.lm2 <- lm(Yield ~ daymin31 + I(daymin31^2) , data=rice)
old.par <- par(no.readonly = TRUE)
par(mfrow=c(1, 2))
plot(rice.lm2$fit, rice.lm2$res, xlab="Fitted values", ylab = "Residuals")
abline(h=0)
# Should be a random scatter
# Much better plot!
qqnorm(rice.lm2$res, col=2)
qqline(rice.lm2$res, col="blue")
summary(rice.lm2)
par(old.par) # par(mfrow=c(1,1))
plot(rice$Days, rice$Yield, xlab="Days", ylab="Yield")
lines(rice$Days, rice.lm2$fit, lty=1, col=3)
rice.lm3 <- lm(Yield ~ daymin31 + I(daymin31^2)+I(daymin31^3) , data=rice)
#check the diagnostics
summary(rice.lm3) # Will print the summary of the fitted model
#### Predict at a new value of Days=31.1465
# Create a new data set called new
new <- data.frame(daymin31=32.1465-31)
a <- predict(rice.lm2, newdata=new, se.fit=TRUE)
# Confidence interval for the mean of rice yield at day=31.1465
a <- predict(rice.lm2, newdata=new, interval="confidence")
a
# fit lwr upr
# [1,] 3676.766 3511.904 3841.628
# Prediction interval for a future yield at day=31.1465
b <- predict(rice.lm2, newdata=new, interval="prediction")
b
# fit lwr upr
#[1,] 3676.766 3206.461 4147.071
Illustration of the CLT for samples from the Bernoulli distribution
Description
Illustration of the CLT for samples from the Bernoulli distribution
Usage
see_the_clt_for_Bernoulli(nsize = 10, nrep = 10000, prob = 0.8)
Arguments
nsize |
Sample size, n. Its default value is 10. |
nrep |
Number of replications. How many samples of size |
prob |
True probability of success for the Bernoulli trials |
Value
A vector of means of the replicated samples. It also has the side effect of drawing a histogram of the standardized sample means and a superimposed density function of the standard normal distribution. The better the CLT approximation, the closer are the superimposed density and the histogram.
Examples
a <- see_the_clt_for_Bernoulli()
old.par <- par(no.readonly = TRUE)
par(mfrow=c(2, 3))
a30 <- see_the_clt_for_Bernoulli(nsize=30)
a50 <- see_the_clt_for_Bernoulli(nsize=50)
a100 <- see_the_clt_for_Bernoulli(nsize=100)
a500 <- see_the_clt_for_Bernoulli(nsize=500)
a1000 <- see_the_clt_for_Bernoulli(nsize=1000)
a5000 <- see_the_clt_for_Bernoulli(nsize=5000)
par(old.par)
Illustration of the central limit theorem for sampling from the uniform distribution
Description
Illustration of the central limit theorem for sampling from the uniform distribution
Usage
see_the_clt_for_uniform(nsize = 10, nrep = 10000)
Arguments
nsize |
Sample size, n. Its default value is 10. |
nrep |
Number of replications. How many samples of size |
Value
A vector of means of the replicated samples. The function also
has the side effect of drawing a histogram of the sample means and
two superimposed density functions: one estimated from the data using
the density
function and the other is the density of the CLT
approximated normal distribution. The better the CLT approximation, the
closer are the two superimposed densities.
Examples
a <- see_the_clt_for_uniform()
old.par <- par(no.readonly = TRUE)
par(mfrow=c(2, 3))
a1 <- see_the_clt_for_uniform(nsize=1)
a2 <- see_the_clt_for_uniform(nsize=2)
a3 <- see_the_clt_for_uniform(nsize=5)
a4 <- see_the_clt_for_uniform(nsize=10)
a5 <- see_the_clt_for_uniform(nsize=20)
a6 <- see_the_clt_for_uniform(nsize=50)
par(old.par)
ybars <- see_the_clt_for_uniform(nsize=12)
zbars <- (ybars - mean(ybars))/sd(ybars)
k <- 100
u <- seq(from=min(zbars), to= max(zbars), length=k)
ecdf <- rep(NA, k)
for(i in 1:k) ecdf[i] <- length(zbars[zbars<u[i]])/length(zbars)
tcdf <- pnorm(u)
plot(u, tcdf, type="l", col="red", lwd=4, xlab="", ylab="cdf")
lines(u, ecdf, lty=2, col="darkgreen", lwd=4)
symb <- c("cdf of sample means", "cdf of N(0, 1)")
legend(x=-3.5, y=0.4, legend = symb, lty = c(2, 1),
col = c("darkgreen","red"), bty="n")
Illustration of the weak law of large numbers for sampling from the uniform distribution
Description
Illustration of the weak law of large numbers for sampling from the uniform distribution
Usage
see_the_wlln_for_uniform(nsize = 10, nrep = 1000)
Arguments
nsize |
Sample size, n. Its default value is 10. |
nrep |
Number of replications. How many samples of size |
Value
A list giving the x values and the density estimates y, from the generated random samples. The function also draws the empirical density on the current graphics device.
Examples
a1 <- see_the_wlln_for_uniform(nsize=1, nrep=50000)
a2 <- see_the_wlln_for_uniform(nsize=10, nrep=50000)
a3 <- see_the_wlln_for_uniform(nsize=50, nrep=50000)
a4 <- see_the_wlln_for_uniform(nsize=100, nrep=50000)
plot(a4, type="l", lwd=2, ylim=range(c(a1$y, a2$y, a3$y, a4$y)), col=1,
lty=1, xlab="mean", ylab="density estimates")
lines(a3, type="l", lwd=2, col=2, lty=2)
lines(a2, type="l", lwd=2, col=3, lty=3)
lines(a1, type="l", lwd=2, col=4, lty=4)
symb <- c("n=1", "n=10", "n=50", "n=100")
legend(x=0.37, y=11.5, legend = symb, lty =4:1, col = 4:1)
Weight gain data from 68 first year students during their first 12 weeks in college
Description
Weight gain data from 68 first year students during their first 12 weeks in college
Usage
wgain
Format
A data frame with three columns and 68 rows:
- student
Student number, 1 to 68.
- initial
Initial weight in kilogram
- final
Final weight in kilogram
Examples
summary(wgain)
plot(wgain$initial, wgain$final)
abline(0, 1, col="red")
plot(wgain$initial, wgain$final, xlab="Wt in Week 1",
ylab="Wt in Week 12", pch="*", las=1)
abline(0, 1, col="red")
title("A scatter plot of the weights in Week 12 against
the weights in Week 1")
# 95% Confidence interval for mean weight gain
x <- wgain$final - wgain$initial
mean(x) + c(-1, 1) * qt(0.975, df=67) * sqrt(var(x)/68)
# t-test to test the mean difference equals 0
t.test(x)