Type: | Package |
Title: | Companion to "Statistics Using R: An Integrative Approach" |
Version: | 1.0.4 |
Depends: | R (≥ 3.5.0) |
Description: | Access to the datasets and many of the functions used in "Statistics Using R: An Integrative Approach". These datasets include a subset of the National Education Longitudinal Study, the Framingham Heart Study, as well as several simulated datasets used in the examples throughout the textbook. The functions included in the package reproduce some of the functionality of 'Stata' that is not directly available in 'R'. The package also contains a tutorial on basic data frame management, including how to handle missing data. |
License: | GPL-2 | GPL-3 [expanded from: GPL (≥ 2)] |
Encoding: | UTF-8 |
LazyData: | true |
RoxygenNote: | 7.0.0 |
Imports: | learnr |
NeedsCompilation: | no |
Packaged: | 2020-08-24 22:05:29 UTC; daphnaharel |
Author: | Daphna Harel [cre, aut], Sharon Weinberg [ctb], Sarah Abramowitz [ctb] |
Maintainer: | Daphna Harel <daphna.harel@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2020-08-25 22:30:02 UTC |
Anscombe's Four Datasets
Description
This dataset is used to illustrate the importance of statistical display as an adjunct to summary statistics. Anscombe (1973) fabricated four different bivariate datasets such that, for all datasets, the respective X and Y means, X and Y standard deviations, and correlations, slopes, intercepts, and standard errors of estimate are equal. Accordingly, without a visual representation of these four panels, one might assume that the data values for all four datasets are the same. Scatterplots illustrate, however, the extent to which these datasets are different from one another.
Usage
Anscombe
Format
A data frame with 11 rows and 8 variables:
- x1
values of X for the first dataset
- y1
values of Y for the first dataset
- x2
values of X for the second dataset
- y2
values of Y for the second dataset
- x3
values of X for the third dataset
- y3
values of Y for the third dataset
- x4
values of X for the fourth dataset
- y4
values of Y for the fourth dataset
Heights and Weights of U.S. Basketball Players
Description
The dataset consists of the heights and weights of the 24 scoring leaders, 12 each from the U.S. Women’s and Men’s National Basketball Association, for the 2014 – 2015 season. These data are taken from the ESPN website at espn.com.
Usage
Basketball
Format
A data frame with 20 rows and 6 variables:
- player
name of player
- gender
gender of player
- heightin
height of player in inches
- weightlb
weight of player in pounds
- games
number of games played
- points
average total points scored per game
Source
Blood Pressure Data of African-American Adult Males
Description
The data were collected to determine whether an increase in calcium intake reduces blood pressure among African-American adult males. The data are based on a sample of 21 African-American adult males selected randomly from the population of African-American adult males. Ten of the 21 men were randomly assigned to a treatment condition that required them to take a calcium supplement for 12 weeks. The remaining 11 men received a placebo for the 12 weeks. At both the beginning and the end of this time period, systolic blood pressure readings of all men were recorded. These data are adapted from the Data and Story Library (DASL) website.
Usage
Blood
Format
A data frame with 21 rows and 4 variables:
- id
case number
- treatmen
treatment condition
- systolc1
initial blood pressure
- systolc2
final blood pressure
Brain Size and IQ Data
Description
The data are based on a study by Willerman et al. (1991) of the relationships between brain size, gender, and intelligence. The research participants consisted of 40 right-handed introductory psychology students with no history of alcoholism, unconsciousness, brain damage, epilepsy, or heart disease who were selected from a larger pool of introductory psychology students with total Scholastic Aptitude Test Scores higher than 1350 or lower than 940. The students in the study took four subtests (Vocabulary, Similarities, Block Design, and Picture Completion) of the Wechsler (1981) Adult Intelligence Scale-Revised. Among the students with Wechsler full-scale IQ’s less than 103, 10 males and 10 females were randomly selected. Similarly, among the students with Wechsler full-scale IQ’s greater than 130, 10 males and 10 females were randomly selected, yielding a randomized blocks design. MRI scans were performed at the same facility for all 40 research participants to measure brain size. The scans consisted of 18 horizontal MRI images. The computer counted all pixels with non-zero gray scale in each of the 18 images, and the total count served as an index for brain size. The dataset and description are adapted from the Data and Story Library (DASL) website.
Usage
Brainsz
Format
A data frame with 40 rows and 7 variables:
- ID
case number
- GENDER
gender of student
- FSIQ
full-scale IQ score based on WAIS-R
- VIQ
verbal IQ score based on WAIS-R
- PIQ
performance IQ score based on WAIS-R
- MRI
pixel count from 18 MRI scans
- IQDI
group membership based on FSIQ score
Exercise 14.1 Figures
Description
This dataset contains simulated data for the figures accompanying Exercise 14.1 of Chapter 14. The data represent the results of a fictional study to determine whether there is a relationship between gender, teaching method, and achievement in reading. Each set of scores reflects a scenario with a different relationship among the variables.
Usage
Chapter14_Figures
Format
A data frame with 12 rows and 7 variables:
- sex
individual's sex
- score1
reading achievement score for first scenario
- method
teaching method
- score2
reading achievement score for second scenario
- score3
reading achievement score for third scenario
- score4
reading achievement score for fourth scenario
- score5
reading achievement score for fifth scenario
Value and Circulation of Currency
Description
This dataset contains, for the smaller bill denominations, the value of the bill and the total value in circulation. The source for these data is The World Almanac and Book of Facts 2014.
Usage
Currency
Format
A data frame with 5 rows and 3 variables:
- BillValue
denomination
- TotalCirculation
total currency in circulation in U.S. dollars
- NumberCirculation
total number of bills in circulation
Exercise, Food Intake, and Weight Loss
Description
A fabricated dataset constructed by Darlington (1990) to demonstrate the importance of including all relevant variables in an analysis. This dataset contains information about exercise, food intake, and weight loss for a fictional set of dieters.
Usage
Exercise
Format
A data frame with 10 rows and 4 variables:
- ID
case number
- Exercise
average daily number of hours exercised in that week
- FoodIntake
average daily number of calories consumed in one particular week that is more than a baseline of 1,000 calories, as measured in increments of 100 calories
- WeightLoss
number of pounds lost in that week
References
"Regression and linear models." Darlington, R. B. (1990, ISBN:978-0070153721)
Exercise 14.5 Data
Description
This dataset contains simulated data for the figures accompanying Exercise 14.1 of Chapter 14. The data represent the results of a fictional study in which a college professor examines the effect of the grade level of the students and the time of the course on how well undergraduate students at her college do in her course.
Usage
Exercise14_5
Format
A data frame with 40 rows and 3 variables:
- Time
time of day student takes the course
- Year
year of college in which the student is enrolled
- Score
final exam score
Figure 15.1 Data
Description
This dataset contains simulated data for Figure 15.1 of Chapter 15.
Usage
Figure15_1
Format
A list with 3 elements:
- x
an integer-scaled independent variable
- y
an integer-scaled outcome variable
- f
frequency of value pair
Figure 15.12 Data
Description
This dataset contains simulated data for Figures 15.12 - 15.13 of Chapter 15.
Usage
Figure15_12
Format
A data frame with 9 rows and 4 variables:
- x
a numeric independent variable for Figure 15.12
- y
a numeric outcome variable for Figure 15.12
- xpr
a numeric independent variable for Figure 15.13
- ypr
a numeric outcome variable for Figure 15.13
Figure 15.9 Data
Description
This dataset contains simulated data for Figures 15.9 - 15.11 of Chapter 15.
Usage
Figure15_9
Format
A data frame with 24 rows and 4 variables:
- x
a numeric independent variable for Figure 15.9
- y
a numeric outcome variable for Figure 15.9
- res
residual value for regression of
y
onx
- log_y
log of the outcome variable
y
Figure 2.4. Annual Number of Deaths in New York City: Tobacco vs. Other
Description
This dataset contains data on causes of death in New York City that were used for Figure 2.4 of Chapter 2.
Usage
Figure2_4
Format
A data frame with 591,200 rows and 1 variable:
- causes
cause of death
Figure 3.2 Data
Description
This dataset contains simulated test scores of Spanish fluency used to generate Figure 3.2 of Chapter 3.
Usage
Figure3_2
Format
A data frame with 100 rows and 1 variable:
- fluency
score on test of Spanish fluency
Figure 3.3 Data
Description
This dataset contains simulated scores used to generate Figure 3.3 of Chapter 3.
Usage
Figure3_3
Format
A data frame with 45 rows and 1 variable:
- score
numeric score from rectangular distribution
Figure 3.5(A) Data
Description
This dataset contains simulated scores used to generate Figure 3.5(A) of Chapter 3.
Usage
Figure3_5a
Format
A data frame with 121 rows and 1 variable:
- DistnA
numeric score from a symmetric distribution
Figure 3.5(B) Data
Description
This dataset contains simulated scores used to generate Figure 3.5(B) of Chapter 3.
Usage
Figure3_5b
Format
A data frame with 75 rows and 1 variable:
- DistnB
numeric score from a symmetric distribution
Figures 3.6 and 3.7 Data
Description
This dataset contains simulated scores used to generate Figures 3.6 ad 3.7 of Chapter 3.
Usage
Figure3_6and7
Format
A data frame with 69 rows and 2 variables:
- NegSkew
numeric score from a distribution with severe negative skew
- PosSkew
numeric score from a distribution with severe positive skew
Figure 5.5 Data
Description
This dataset contains simulated scores used to generate Figures 5.5(A) - 5.5(I) of Chapter 5.
Usage
Figure5_5
Format
A data frame with 10 rows and 18 variables:
- ax
days elapsed in a given year
- ay
days remaining in that same year
- bx
age of elementary school student
- by
number of seconds to run a 100-yard dash
- cx
introversion score of adolescent boy
- cy
aggression score of adolescent boy
- dx
moodiness score of college freshman
- dy
English ability score of college freshman
- ex
weight of male college student
- ey
achievement score in statistics of male college student
- fx
expected grade in course of college student
- fy
course evaluation score given by college student
- gx
IQ score of child in grades K – 3
- gy
reading achievement score of child in grades K – 3
- hx
arithmetic reasoning score of elementary school student
- hy
arithmetic fundamentals score of elementary school student
- ix
diameter of tree
- iy
circumference of tree
Framingham Heart Study
Description
The Framingham Heart Study is a long term prospective study of the etiology of cardiovascular disease among a population of non-institutionalized people in the community of Framingham, Massachusetts. The Framingham Heart Study was a landmark study in epidemiology in that it was the first prospective study of cardiovascular disease and identified the concept of risk factors and their joint effects. The study began in 1956 and 5,209 subjects were initially enrolled in the study. In our dataset, we included variables from the first examination in 1956 and the third examination, in 1968. Clinic examination data has included cardiovascular disease risk factors and markers of disease such as blood pressure, blood chemistry, lung function, smoking history, health behaviors, ECG tracings, echocardiography, and medication use. Through regular surveillance of area hospitals, participant contact, and death certificates, the Framingham Heart Study reviews and adjudicates events for the occurrence of any of the following types of coronary heart disease(CHD): angina pectoris, myocardial infarction, heart failure, and cerebrovascular disease.
Usage
Framingham
Format
A data frame with 400 rows and 33 variables:
- ID
case number
- SEX
sex
- TOTCHOL1
serum cholesterol (mg/dL) at initial examination
- AGE1
age (years) at initial examination
- SYSBP1
systolic blood pressure (mmHg) at initial examination
- DIABP1
diastolic blood pressure (mmHg) at initial examination
- CURSMOKE1
indicator that participant currently is a cigarette smoker at initial examination
- CIGPDAY1
cigarettes smoked per day at initial examination
- BMI1
Body Mass Index (kg/(M*M)) at initial examination
- DIABETES1
indicator that participant is diabetic at initial examination
- BPMEDS1
use of anti-hypertensive medication at initial examination
- HEARTRTE1
ventricular rate (beats/min) at initial examination
- GLUCOSE1
casual glucose (mg/dL) at initial examination
- PREVCHD1
prevalent CHD (angina pectoris, myocardial infarction, or coronary insufficiency) at initial examination
- TIME1
days since initial examination
- TIMECHD1
days from initial examination to any CHD event
- TOTCHOL3
serum cholesterol (mg/dL) at third examination
- AGE3
age (years) at third examination
- SYSBP3
systolic blood pressure (mmHg) at third examination
- DIABP3
diastolic blood pressure (mmHg) at third examination
- CURSMOKE3
indicator that participant currently is a cigarette smoker at third examination
- CIGPDAY3
cigarettes smoked per day at third examination
- BMI3
Body Mass Index (kg/(M*M) at third examination
- DIABETES3
indicator that participant is diabetic at third examination
- BPMEDS3
use of anti-hypertensive medication at third examination
- HEARTRTE3
ventricular rate (beats/min) at third examination
- GLUCOSE3
casual glucose (mg/dL) at third examination
- PREVCHD3
prevalent CHD (angina pectoris, myocardial infarction, or coronary insufficiency) at third examination
- TIME3
days since initial examination at third examination
- HDLC3
HDL cholesterol (mg/dL) at third examination
- LDLC3
LDL cholesterol (mg/dL) at third examination
- TIMECHD3
days from initial examination to any CHD event at third examination
- ANYCHD4
indicator of event of hospitalized myocardial infarction, angina pectoris, coronary insufficiency, or fatal CHD by the end of the study
Details
The associated dataset is a subset of the data collected as part of the Framingham study and includes laboratory, clinic, questionnaire, and adjudicated event data on 400 participants. These participants for the dataset have been chosen so that among all male participants, 100 smokers and 100 non-smokers were selected at random. A similar procedure resulted in 100 female smokers and 100 female non-smokers. This procedure resulted in an over-sampling of smokers. The data for each participant is on one row. People who had any type of CHD in the initial examination period are not included in the dataset.
McDonald's Hamburger Nutrition Information
Description
This dataset contains the fat grams and calories associated with the different types of hamburger sold by McDonald’s. The data are from McDonald’s Nutrition Information Center.
Usage
Hamburger
Format
A data frame with 5 rows and 4 variables:
- name
type of burger
- fat
grams of fat
- calories
total calories
- cheese
cheese added
Ice Cream Sales Data
Description
This dataset contains fabricated data for the temperature, relative humidity, and ice cream sales for 30 days randomly selected between May 15th and September 6th.
Usage
IceCream
Format
A data frame with 30 rows and 4 variables:
- id
case number
- temp
temperature in degrees Fahrenheit
- barsold
number of ice cream bars sold
- relhumid
relative humidity
Clinton Impeachment Votes
Description
On February 12, 1999, for only the second time in the nation’s history, the U.S. Senate voted on whether to remove a president, based on impeachment articles passed by the U.S. House. Professor Alan Reifman of Texas Tech University created the dataset consisting of descriptions of each senator that can be used to understand some of the reasons that the senators voted the way they did. The data are taken from the Journal of Statistics Education [online].
Usage
Impeach
Format
A data frame with 100 rows and 11 variables:
- name
senator’s name
- state
state the senator represents
- region
geographic region of the U.S.
- vote1
vote on perjury
- vote2
vote on obstruction of justice
- guilty
total number of guilty votes
- party
political party of senator
- conserva
conservatism score, defined as the senator’s degree of ideological conservatism, based on 1997 voting records as judged by the American Conservative Union, where the scores ranged from 0 to 100 and 100 is most conservative
- supportc
state voter support for Clinton, defined as the percent of the vote Clinton received in the 1996 presidential election in the senator’s state
- reelect
year the senator’s seat is up for reelection
- newbie
indicator for whether the senator is in their first-term
Learning Disabilities in Elementary Students
Description
This dataset is a subset of data from a study by Susan Tomasi and Sharon L. Weinberg (1999), which profiled learning disabled students in an urban setting. According to Public Law 94.142, enacted in 1976, a team may determine that a child has a learning disability (LD) if a severe discrepancy exists between a child’s actual achievement in, for example, math or reading, and his or her intellectual ability. The dataset consists of six variables, described below, on 105 elementary school children from an urban area who were classified as LD and who, as a result, had been receiving special education services for at least three years. Of the 105 children, 42 are female and 63 are male. There are two main types of placements for these students: part-time resource room placements, in which the students get additional instruction to supplement regular classroom instruction, and self-contained classroom placements, in which students are segregated full time. In this dataset, 66 students are in resource room placements while 39 are in self-contained classroom placements. For inferential purposes, we consider the children in the dataset to be a random sample of all children attending public elementary school in a certain city who have been diagnosed with learning disabilities. Many students in the dataset have missing values for either math or reading comprehension, or both. Such omissions can lead to problems when generalizing results. There are statistical remedies for missing data that are beyond the scope of this text. In this case, we will assume that there is no pattern to the missing values, so that our sample is representative of the population.
Usage
Learndis
Format
A data frame with 105 rows and 6 variables:
- grade
student’s grade level
- gender
student’s gender
- placemen
type of placement: “RR” for part time in resource room or “MIS” for full time in self-contained classroom
- readcomp
reading comprehension score, with possible range of 0 to 200
- mathcomp
math comprehension score, with possible range of 0 to 200
- iq
student’a intellectual ability, as measured by IQ score with possible range of 0 to 200
References
"Classifying children as learning disabled: An analysis of current practice in an urban setting." Tomasi, S., & Weinberg, S. L. (1999) <doi:10.2307/1511150>
Likert-Scale Assertiveness Measure
Description
This dataset contains fabricated data for a single survey item measured on a Likert scale. It is given that a survey was administered to 30 individuals and included an item measuring assertiveness by having the individual indicate agreement with the statement: “I have the ability to stand up for my own rights without denying the rights of others.” The response options were: 1 = "strongly agree"; 2 = "agree"; 3 = "neutral"; 4 = "disagree"; 5 = "strongly disagree." Notice that on this scale, high scores are associated with low levels of assertiveness.
Usage
Likert
Format
A data frame with 30 rows and 1 variable:
- Assertiveness
five-point Likert-scale score of assertiveness, with high scores associated with low levels of assertiveness
Manual Dexterity
Description
This fictional dataset contains the treatment group number and the manual dexterity scores for 30 individuals selected by the director of a drug rehabilitation center. There are three treatments, and the individuals are randomly assigned ten to a treatment. After five weeks of treatment, a manual dexterity test is administered for which a higher score indicates greater manual dexterity.
Usage
ManDext
Format
A data frame with 30 rows and 3 variables:
- ManualDex
manual dexterity score
- Sex
individual’s sex
- Treatment
treatment group assignment
Manual Dexterity (Dataset #2)
Description
This is a second fictional dataset that expands on ManDext
, adding predicted outcome variables from regression analyses under alternative scenarios.
Usage
ManDext2
Format
A data frame with 30 rows and 9 variables:
- ManualDex
manual dexterity score
- Sex
individual’s sex
- Treatment
treatment group assignment
- yhat
predicted outcome for disordinal interaction scenario
- yhat2
predicted outcome for ordinal interaction scenario
- yhat3
predicted outcome for first no-interaction scenario
- yhat4
predicted outcome for second no-interaction scenario
- yhat5
predicted outcome for third no-interaction scenario
- yhat6
predicted outcome for fourth no-interaction scenario
Marijuana Use of Twelfth Graders
Description
The dataset contains the year and percentage of twelfth graders who have ever used marijuana for several recent years. The source for these data is The World Almanac and Book of Facts 2014.
Usage
Marijuana
Format
A data frame with 23 rows and 2 variables:
- Year
year for which data was collected
- MarijuanaUse
percentage of twelfth graders who reported that they have ever used marijuana
National Education Longitudinal Study (NELS) of 1988
Description
In response to pressure from federal and state agencies to monitor school effectiveness in the United States, the National Center of Education Statistics (NCES) of the U.S. Department of Education conducted a survey in the spring of 1988, the National Education Longitudinal Study (NELS). The participants consisted of a nationally representative sample of approximately 25,000 eighth graders to measure achievement outcomes in four core subject areas (English, history, mathematics, and science), in addition to personal, familial, social, institutional, and cultural factors that might relate to these outcomes. Details on the design and initial analysis of this survey may be referenced in Horn, Hafner, and Owings (1992). A follow-up of these students was conducted during tenth grade in the spring of 1990; a second follow-up was conducted during the twelfth grade in the spring of 1992; and, finally, a third follow-up was conducted in the spring of 1994.
Usage
NELS
Format
A data frame with 500 rows and 48 variables:
- id
case number
- advmath8
indicator for whether advanced math taken in eighth grade
- urban
urbanicity, a measure of the type of environment in which the student lives
- region
geographic region of school
- gender
student's gender
- famsize
student’s family size
- parmarl8
parents' marital status in eighth grade
- homelang
home language background
- slfcnc08
self-concept in eighth grade
- slfcnc10
self-concept in tenth grade
- slfcnc12
self-concept in twelfth grade
- schtyp8
school type in eighth grade
- tcherint
likert-scale variable classifying student agreement with the statement, “My teachers are interested in students”
- late12
number of times late for school in twelfth grade
- cuts12
number of times skipped/cut classes in twelfth grade
- absent12
number of times student missed school in twelfth grade
- approg
indicator for whether advanced placement program taken
- hwkin12
time spent on homework weekly in school per week in twelfth grade
- hwkout12
time spent on homework out of school per week in twelfth grade
- excurr12
time spent weekly on extracurricular activities in twelfth grade, in hours
- computer
indicator for whether computer owned by family in eighth grade
- hsprog
type of high school program
- unitengl
units in English (NAEP), or number of years of English taken in high school
- unitmath
units in mathematics (NAEP), or number of years of math taken in high school
- unitcalc
units in calculus (NAEP), or number of years of calculus taken in high school
- schattrt
school average daily attendance rate
- apoffer
number of advanced placement courses offered by school
- nursery
indicator for whether nursery school attended
- algebra8
indicator for whether algebra taken in eighth grade
- numinst
number of post-secondary institutions attended
- edexpect
highest level of education expected
- expinc30
expected income at age 30, in dollars
- achrdg08
reading achievement in eighth grade
- achmat08
math achievement in eighth grade
- achsci08
science achievement in eighth grade
- achsls08
social studies achievement in eighth grade
- achrdg10
reading achievement in tenth grade
- achmat10
math achievement in tenth grade
- achsci10
science achievement in tenth grade
- achsls10
social studies achievement in tenth grade
- achrdg12
reading achievement in twelfth grade
- achmat12
math achievement in twelfth grade
- achsci12
science achievement in twelfth grade
- achsls12
social studies achievement in twelfth grade
- cigarett
indicator for whether smoked cigarettes ever
- alcbinge
indicator for whether ever binged on alcohol
- marijuan
indicator for whether smoked marijuana ever
- ses
socioeconomic status score, ranging from 0 to 35, and given as a composite of father’s education level, mother’s education level, father’s occupation, mother’s education, and family income
Details
For this dataset, we have selected a sub-sample of 500 cases and 48 variables. The cases were sampled randomly from the approximately 5,000 students who responded to all four administrations of the survey, who were always at grade level (neither repeated nor skipped a grade), and who pursued some form of post-secondary education. The particular variables were selected to explore the relationships between student and home background variables, self-concept, educational and income aspirations, academic motivation, risk-taking behavior, and academic achievement.
References
"A profile of American eighth-grade mathematics and science instruction." Horn, L., Hafner, & Owings (1992) <https://nces.ed.gov/pubsearch/pubsinfo.asp?pubid=92486>
Gender and Political Party Affiliation
Description
This dataset contains data on a fabricated random sample of 200 individuals, 100 females and 100 males, drawn from a population of interest. There are only two variables, both of which are categorical: gender and political party affiliation.
Usage
Politics
Format
A data frame with 200 rows and 2 variables:
- Gender
individual's gender
- Party
individual's political party affiliation
Educational Measures of the 50 States and Washington, D.C.
Description
This dataset includes different educational measures of the 50 states and Washington, D.C. These data are from The 2014 World Almanac and Book of Facts.
Usage
States
Format
A data frame with 51 rows and 10 variables:
- state
name of state
- region
region of the country in which the state is located
- enrollmt
total public school enrollment 2011 - 2012
- stuteach
average number of pupils per teacher 2011 - 2012
- teachpay
average annual salary for public school teachers 2011 - 2012
- educexpe
average expenditure per pupil 2011 - 2012
- satcr
average SAT Critical Reading score 2013
- satm
average SAT Math score 2013
- satw
average SAT Writing score 2013
- pertak
percentage of eligible students taking the SAT 2012
Significant Statisticians
Description
This dataset includes data on 12 statisticians who each have contributed significantly to the field of modern statistics.
Usage
Statisticians
Format
A data frame with 12 rows and 5 variables:
- Statistician
name of statistician
- Gender
gender of statistician, where 1 = “Female” and 2 = “Male”
- Birth
year of birth
- Death
year of death
- AmStat
number of references in The American Statistician, 1995-2005
Stepping and Heart Rate
Description
Students at Ohio State University conducted an experiment in the fall of 1993 to explore the nature of the relationship between a person's heart rate and the frequency at which that person stepped up and down on steps of various heights. The response variable, heart rate, was measured in beats per minute. For each person, the resting heart rate was measured before a trial (HRInit) and after stepping (HRFinal). There were two different step heights (Height): 5.75 inches (coded as 1 = Low), and 11.5 inches (coded as 2 = High). There were three rates of stepping (Freq): 14 steps/min. (coded as 1 = Slow), 21 steps/min. (coded as 2 = Medium), and 28 steps/min. (coded as 3 = Fast). This resulted in six possible height/frequency combinations. Each subject performed the activity for three minutes. Subjects were kept on pace by the beat of an electric metronome. One experimenter counted the subject's heart rate, in beats per minute, for 20 seconds before and after each trial. The subject always rested between trials until her or his heart rate returned to close to the beginning rate. Another experimenter kept track of the time spent stepping. Each subject was always measured and timed by the same pair of experimenters to reduce variability in the experiment. The dataset and description are adapted from the Data and Story Library (DASL) website.
Usage
Stepping
Format
A data frame with 30 rows and 6 variables:
- Order
overall performance order of the trial
- Block
subject and experimenters' block number
- Height
step height
- Freq
rate of stepping
- HRInit
resting heart rate of the subject before a trial, in beats per minute
- HRFinal
final heart rate of the subject after a trial, in beats per minute
Average Monthy Temperatures for Two Cities
Description
This dataset gives the average monthly temperatures (in degrees Fahrenheit) for Springfield, MO and San Francisco, CA. These data are from Burrill and Hopensperger (1993).
Usage
Temp
Format
A data frame with 24 rows and 2 variables:
- City
city where temperature was measured
- Temperature
average monthly temperature, in degrees Fahrenheit
References
"Exploring Statistics with the T1-81" Burrill, G., & Hopensperger, P. (1993, ISBN:9780201524321)
Upper Body Strength
Description
This simulated dataset consists of the number of hours eight individuals spend at the gym on a weekly basis along with measures of their upper body strength.
Usage
UpperBodyStrength
Format
A data frame with 8 rows and 3 variables:
- gym
number of hours spent at the gym weekly
- strength
upper body strength score
- gender
individual's gender
Wage and Education Data from the 1985 Current Population Survey
Description
This is a subsample of 100 males and 100 females randomly selected from the 534 cases that comprised the 1985 Current Population Survey in a way that controls for highest education level attained. The sample of 200 contains 20 males and 20 females with less than a high school diploma, 20 males and 20 females with a high school diploma, 20 males and 20 females with some college training, 20 males and 20 females with a college diploma, and 20 males and 20 females with some graduate school training. The data include information about gender, highest education level attained, and hourly wage.
Usage
Wages
Format
A data frame with 400 rows and 9 variables:
- id
case number
- educ
number of years of education
- south
indicator for whether individual lives in the South
- sex
individual’s sex
- exper
number of years of work experience
- wage
wage (dollars per hour)
- occup
occupation category
- marr
marital status
- ed
highest education level
Bootstrapped Mean
Description
Function to obtain a sampling distribution of means by bootstrapping.
Usage
boot.mean(x, B, n = length(x))
Arguments
x |
original sample, given as a numeric or logical object, to be used to generate bootstrapped samples. |
B |
number of bootstrapped samples to be generated by randomly sampling with replacement. |
n |
size of each bootstrapped sample. Default setting is the size of the original sample. |
Value
A list with components:
Replications |
number of bootstrapped means computed. |
mean |
mean of bootstrapped means. |
se |
standard error, estimated as the standard deviation of bootstrapped means. |
bootstrap.samples |
means of bootstrapped samples. |
Examples
# using simple vector
a = 1:10
set.seed(1234)
boot.mean(a, B = 500)
# using variable from data frame
set.seed(1234)
boot.mean(Framingham$AGE3, B = 1000)
Cumulative Percentage Table
Description
Returns as a named vector the cumulative percentage frequency distribution of a variable x
at each unique value.
Usage
cumulative.table(x)
Arguments
x |
object containing data for a single variable. |
Details
If x
contains NA
values (missing data), the cumulative percentage table will not reach 100. The table will end with the cumulative percentage of non-missing data within the object; the value remaining after subtracting this value from 100 represents the percentage of NA
values within the object.
Value
A named numeric vector containing cumulative percentage frequencies, named by unique values of x
and ordered numerically or alphabetically by name.
See Also
Examples
# using variable without NA values
cumulative.table(NELS$famsize)
# using variable with NA values
cumulative.table(NELS$parmarl8)
Levene's Test for Homogeneity of Variance
Description
Function to test the homogeneity of variance for two populations, an assumption of the independent samples t-test. The null hypothesis tested is that the two population variances are equal; the alternative is that the two population variances are not equal.
Usage
levenes.test(y, group)
Arguments
y |
outcome variable of interest, given as a numeric object. |
group |
a factor or character object with two levels indicating group membership. |
Value
An anova table containing test results: two values for degrees of freedom, the F-value, and the p-value.
See Also
Examples
# using simple data frame
value = c(7,2,4,4,8,3,61,2,80,4)
grp = rep(c("A","B"), each = 5)
ex_data = data.frame(value = value, grp = grp)
levenes.test(ex_data$value, group = ex_data$grp)
# using variable without NA values
levenes.test(NELS$famsize, group = NELS$gender)
# using variable with NA values
levenes.test(NELS$achrdg12, group = NELS$gender)
Leverage
Description
Returns the leverage values for a linear regression model.
Usage
leverage(x)
Arguments
x |
linear regression model given as an |
Value
A numeric vector of leverage values.
See Also
lm
, rstudent()
, cooks.distance()
Examples
mod = lm(Framingham$SYSBP1 ~ Framingham$TOTCHOL1 + Framingham$AGE1)
leverage(mod)
Line Graph
Description
Function to plot the estimated density values of a variable as a line.
Usage
line.graph(x, ...)
Arguments
x |
numeric object to be plotted. |
... |
additional arguments to be passed to the |
Value
A line graph of the estimated density distribution of a variable.
See Also
Examples
line.graph(Temp$Temperature[Temp$City == "SanFrancisco"])
line.graph(IceCream$barsold)
Percentage Table
Description
For one variable, returns a frequency distribution table given in percentages. For two variables, returns a contingency table given in percentages.
Usage
percent.table(x, y = NULL)
Arguments
x |
object containing data for a single variable. |
y |
optional second object to create a contingency table given in percentages. Default setting ignores second object by setting |
Value
A table of frequency percentages (for one variable) or a contingency table of percentages (for two variables).
See Also
Examples
# frequency table for one variable
percent.table(NELS$region)
# cross-tabulation for two variables
percent.table(Wages$south,Wages$occup)
Standard Error of Skewness
Description
Function to obtain the standard error of the skewness of a distribution of values.
Usage
se.skew(x)
Arguments
x |
numeric object containing the values for a variable. |
Details
Standard error of skewness is computed on non-missing values using the following equation.
\sqrt( 6*N*(N-1) / ((N-2)*(N+1)*(N+3)) )
Value
Standard error of skewness for x
.
See Also
Examples
se.skew(Temp$Temperature[Temp$City == "Springfield"])
se.skew(Temp$Temperature[Temp$City == "SanFrancisco"])
Skewness of a Distribution
Description
Function to obtain the skewness value of a distribution of values.
Usage
skew(x)
Arguments
x |
numeric object containing the values for a variable. |
Details
Skewness value computed on non-missing values using the ratio of \Sigma((x - m)^3) / N
to \sqrt(\Sigma((x - m)^2) / N) ^3
.
Value
Skewness value of x
.
See Also
Examples
skew(IceCream$relhumid)
skew(IceCream$temp)
Skewness Ratio
Description
Returns the ratio of a distribution's skewness value to its standard error of skewness.
Usage
skew.ratio(x)
Arguments
x |
numeric object containing the values for a variable. |
Details
skew.ratio
relies on the functions skew
and se.skew
to compute the skewness value and standard error of skewness, respectively.
Value
Skewness ratio of x
.
See Also
Examples
# skew ratio computed two ways
skew.ratio(NELS$achmat12)
skew(NELS$achmat12) / se.skew(NELS$achmat12)
Mode
Description
Function to obtain the mode(s) of a distribution.
Usage
the.mode(x)
Arguments
x |
object containing data for a single variable. |
Value
A numeric vector of the value(s) of the distribution that have the highest frequency of occurrence.
See Also
Examples
# single mode for factor variable
the.mode(NELS$urban)
# bimodal numeric variable
a = c(14,24,62,12,12,12,36,17,11,99,99,99)
the.mode(a)