The wlsd (wrangling longitudinal survival data) package
supports processing of repeated observation data for model building in
survival analysis. A longitudinal study or panel study is a repeated
observation method by which subjects are monitored over time (Hsiao 2003). This type of study sees
application in analyzing disease progression such as heart deterioration
after transplant (Jackson 2024). When
preparing raw data into a format suitable for survival analysis models,
knowledge of the appropriate format for the model is needed as well as
the time and ability to transform the data. The functions available in
this package automate some of the preparation steps to simplify and
reduce the time spent formatting.
Some longitudinal data formats might include the long format or the counting process format. The first table below shows a longitudinal data set in long format.
| id | time | event | var1 | var2 |
|---|---|---|---|---|
| 1 | 0 | 0 | 10.4 | 10 |
| 1 | 31 | 0 | 11.3 | 10 |
| 1 | 64 | 0 | 12.7 | 10 |
| 1 | 96 | 1 | 17.5 | 10 |
| 2 | 0 | 0 | 1.2 | 25 |
| 2 | 33 | 0 | 5.9 | 25 |
| 2 | 59 | 1 | 4.4 | 25 |
| 3 | 0 | 0 | 10.6 | 16 |
| 3 | 28 | 1 | 8.0 | 16 |
Information in the table above is for three subjects with identifiers
denoted by id. Each row corresponds to a particular
observation for a subject with time recording the
occurrence of the observation. Here, each subject starts at time 0 and
every subsequent value represents the number of time units since time 0.
There are 3 additional variables; event and
var1 vary with time while var2 is constant
over time. This format is easily legible and can be quickly filtered.
Using the first row as an example for interpretation, at
time 0 subject 1 had an event value of 0 and
var1 value of 10.4 with a var2 value of
10.
The next table shows the same data in counting process format.
| id | time1 | time2 | event | var1 | var2 |
|---|---|---|---|---|---|
| 1 | 0 | 31 | 0 | 10.4 | 10 |
| 1 | 31 | 64 | 0 | 11.3 | 10 |
| 1 | 64 | 96 | 1 | 12.7 | 10 |
| 2 | 0 | 33 | 0 | 1.2 | 25 |
| 2 | 33 | 59 | 1 | 5.9 | 25 |
| 3 | 0 | 28 | 1 | 10.6 | 16 |
The biggest difference of this format is the way time is defined
which is column-wise instead of row-wise. In that regard, there are two
columns for a start time and a stop time of an interval, namely
time1 and time2. Mathematically, the interval
is considered open left and closed right. That is, \((time1, time2] = \{x|time1 < x \leq
time2\}\). Therefore, multiple time intervals for a particular
subject will not overlap. Any time related covariates are interpreted
over the interval while any status indicator variables are interpreted
at the end of the interval. Looking at the first row in the above table,
event and var1 are the two columns for the
time dependent variables with event being a status
indicator and var1 being a covariate. The first value of
var1 is 10.4 so subject 1 has this value over the interval
from time 0 to time 31. Since the value of event is 0, this
subject does not have an event at the end of this interval. Looking at
the second interval for subject 1, the value of var1 is
11.3 over the interval from time 31 to time 64 and the subject again
does not have an event at the end of the interval. Subject 1 has an
event at time 96 which is the end of the third interval.
Note that there is a loss of some information transitioning from the
long format to the counting process format as the time interval
formulation means that the first status indicator value and last
covariate value will be omitted. For subject 1, this is the first value
of 0 for event at time 0 and the last value of
var1, 17.5, at time 96. A detailed description of the
counting process format is given by Collett
(2015) in its use for survival analysis. Therneau, Crowson, and Atkinson (2020) also
gives a description of the counting process format with use case
examples on how it is leveraged in survival analysis.
The figure below depicts data formats with possible transitions in the package.
Format transitions for wlsd.
The solid arrows represent explicitly defined transitions that have their own function while the single dashed arrow represents a transition that is possible but does not have its own function for use. This transition is discussed a bit more in the “Aggregating Longitudinal Data to Count Format” section below.
The long2cp() function is used to transition data from
long format to counting process format. It requires arguments for
data, id, time and
status. The data argument is used for the
data.frame object for transition while id is
the column name for the subject identifier. The time
argument is for the column name of the single time column that is to be
transitioned to an interval across columns while the status
argument is used for any status indicator variables that need to be
transitioned accordingly. The value of any status indicator variables is
assumed to be associated with the end point of the interval. So, any
column names supplied to the status argument are
transitioned under this assumption. The result is the first value is
dropped for each subject. All other columns are processed such that the
value is associated with the first time point of the interval. This
results in the last value for each subject being dropped. For constant
columns, this is irrelevant but for any other time dependent covariates
that are not named in the status argument, this is
important to note.
The code below shows an example using the long format data from earlier.
long2cp(data = long_data, id = "id", time = "time", status = "event")
#> id time1 time2 event var1 var2
#> 1 1 0 31 0 10.4 10
#> 2 1 31 64 0 11.3 10
#> 3 1 64 96 1 12.7 10
#> 4 2 0 33 0 1.2 25
#> 5 2 33 59 1 5.9 25
#> 6 3 0 28 1 10.6 16Looking at the output above for the subject with id of
1, since the event column was supplied to the
status argument then the values associated with the end
points of that column are kept. All other time dependent covariates keep
values associated with the starting points. In this case, that is the
var1 column. Using row 3 as an example, individual 1
experienced the event at time 96 which is kept as the value for
event. However, their last recorded value for
var1 was 12.7 at time 64. This is then used for the 3rd row
as 12.7 is used over the interval from time 64 to time 96.
By default, the function will keep all subjects through the
drop = FALSE argument. However, it may be desirable to drop
individuals that only have 1 row of data in long format as they do not
have enough information for time intervals. Consider the
dropped_long data set below which is missing the 2nd
observation for id 3. This subject has only one time point
so a start and stop time interval is not really feasible. However, to
maintain the data set that is input to long2cp(), the
subject is kept in the output data set with a time interval that starts
and stops at the one time point given. All covariates are then
associated with that one point.
dropped_long <- long_data[1:8,]
long2cp(dropped_long, id = "id", time = "time", status = "event", drop = "FALSE")
#> id time1 time2 event var1 var2
#> 1 1 0 31 0 10.4 10
#> 2 1 31 64 0 11.3 10
#> 3 1 64 96 1 12.7 10
#> 4 2 0 33 0 1.2 25
#> 5 2 33 59 1 5.9 25
#> 6 3 0 0 0 10.6 16If the subject is intended to be dropped for analysis purposes, then
the drop = TRUE argument can be used to exclude these rows
from the output data set.
long2cp(dropped_long, id = "id", time = "time", status = "event", drop = "TRUE")
#> id time1 time2 event var1 var2
#> 1 1 0 31 0 10.4 10
#> 2 1 31 64 0 11.3 10
#> 3 1 64 96 1 12.7 10
#> 4 2 0 33 0 1.2 25
#> 5 2 33 59 1 5.9 25The single row for id value 3 is missing in the output
above.
The transition from counting process format to long format is done
through the cp2long() function. Arguments are similar to
the previous function but take two time points as input instead of one.
This is to consolidate the two end points of the interval into a single
column.
cp2long(data = cp_data, id = "id", time1 = "time1", time2 = "time2", status = "event")
#> id time event var1 var2
#> 1 1 0 NA 10.4 10
#> 2 1 31 0 11.3 10
#> 3 1 64 0 12.7 10
#> 4 1 96 1 NA NA
#> 5 2 0 NA 1.2 25
#> 6 2 33 0 5.9 25
#> 7 2 59 1 NA NA
#> 8 3 0 NA 10.6 16
#> 9 3 28 1 NA NANote that in this example there are a few NA values
which differ from the example counting process table. This is due to the
missing information about the first time covariates mentioned earlier.
By default, the function will not make any assumptions about this data
and will simply leave NA values for any unknown values.
Therefore, any column names supplied to the status will be
missing the first value for each id as these are assumed to
be associated with the endpoint of the interval so there is no value for
the starting point of the first interval. Similarly, any other time
dependent columns will be missing the last value of each id
as they are associated with the start time point over the whole interval
so the last value of the last interval will be missing. If desired, the
fill = TRUE option may be used in order to insert values
for any constant columns that might result in an NA due to
this reasoning.
cp2long(cp_data, id = "id", time1 = "time1", time2 = "time2", status = "event", fill = TRUE)
#> id time event var1 var2
#> 1 1 0 NA 10.4 10
#> 2 1 31 0 11.3 10
#> 3 1 64 0 12.7 10
#> 4 1 96 1 NA 10
#> 5 2 0 NA 1.2 25
#> 6 2 33 0 5.9 25
#> 7 2 59 1 NA 25
#> 8 3 0 NA 10.6 16
#> 9 3 28 1 NA 16Here, the var2 column was constant and NA
values were populated with more relevant values.
It may be useful to aggregate longitudinal data into count format
with tallies of repeated observations. Transitions to the count format
are done through the long2count() function. While
originally conceived to perform the transition from long format to count
format, the logic here is the same as that in the counting process
format to count format so this function can be used for both
transitions. The major difference between the long format and counting
process format is how time is organized so the logic of aggregating any
covariates row-wise should be the same but the aggregated values will be
different across formats due to the 1 value loss of information from
long format to counting process format.
The long2count() function requires a data set and
identifier column to perform aggregation operations. It also requires an
argument to either the event or state
argument. This will determine how to treat events of interest under the
observation scheme of the data set. The event argument will
treat the given column or columns as numeric counts of observations
while the state argument will treat the column largely as a
factor with different levels.
Consider the first example below with the long format data and the
single event column called event.
long2count(long_data, id = "id", event = "event")
#> id time var1 var2 event.counts count.weight
#> 1 1 47.75000 12.975000 10 1 4
#> 2 2 30.66667 3.833333 25 1 3
#> 3 3 14.00000 9.300000 16 1 2Since the argument supplied to event was a count of
occurrences, the output column is a sum of rows. Any columns that are
constant within the id argument grouping are carried over
to the output as is while any that are time varying (or change row-wise)
are then aggregated into the final count format output.
The default aggregation function for any time varying covariate is
mean() but this can be changed through the FUN
argument. For example, setting FUN = max will take the
maximum value over time and use that as value in the aggregated count
data set.
long2count(long_data, id = "id", event = "event", FUN = max)
#> id time var1 var2 event.counts count.weight
#> 1 1 96 17.5 10 1 4
#> 2 2 59 5.9 25 1 3
#> 3 3 28 10.6 16 1 2For multiple events, the concatenated names of event columns can be
supplied to the event argument.
long_data2 <- long_data
long_data2$event2 <- c(2,2,3,1,2,3,2,1,3)
long2count(long_data2, id = "id", event = c("event","event2"))
#> id time var1 var2 event.counts event2.counts count.weight
#> 1 1 47.75000 12.975000 10 1 8 4
#> 2 2 30.66667 3.833333 25 1 7 3
#> 3 3 14.00000 9.300000 16 1 4 2If the observational thing is to be treated as a state instead of an
event, then the state argument will handle the aggregations as such.
Consider the event column as a binary state such that “0”
is coded as healthy and “1” is coded for unhealthy. The
state argument will instead count the number of occurrences
of each state.
long2count(long_data, id = "id", state = "event")
#> id time event var1 var2 event.counts count.weight
#> 1 1 47.75000 0 12.975000 10 3 4
#> 2 1 47.75000 1 12.975000 10 1 4
#> 3 2 30.66667 0 3.833333 25 2 3
#> 4 2 30.66667 1 3.833333 25 1 3
#> 5 3 14.00000 0 9.300000 16 1 2
#> 6 3 14.00000 1 9.300000 16 1 2For the example above, subject 1 now has two rows for each of the
different levels of “event”. From the event.counts column,
this subject has 3 occurrences of state 0 being healthy and 1 occurrence
of state 1 being unhealthy. This can be expanded to larger state spaces
with more detailed descriptions.
An argument is required to be supplied to either the
event or state argument. The
event argument will handle multiple columns as events but
the state argument will only accept one column to be
treated as a state. Combinations of event and
state are allowed such as 1 state and 1 event or 1 state
and multiple events.
There exist a few other functions within the wlsd
package to help with small niche cases that may arise in practice. They
are all formalized functions based on preparations of the
LBP data set (discussed later). The three notable functions
are basedate(), events2state(), and
takefirst(). The first, basedate(), creates a
new row for each individual which can serve as their baseline or first
visit and then be populated with new information. The
events2state() function is meant to consolidate multiple
columns of event observations to a single state column. That is, it
aggregates multiple columns into a single factor-like column based on
the combinations of values from the input columns. The
takefirst() function takes the first number of rows for
each subject up to and including a given criteria. These functions are
discussed in greater detail in the “Applications” section below as the
example context gives greater understanding to their use.
In order to demonstrate the use cases of this package, consider this example using a low-back pain (LBP) data set that needs to be prepared for fitting to several models in a large survival analysis. The data set originally comes from Garg et al. (2013) which was used in Ingulli (2020) for a comprehensive survival analysis where some of this example originates. See the appendix section of Ingulli (2020) a larger discussion on data preparation for this set using some early versions of this package.
The data is available within the package via LBP. It
consists of 466 individuals with associated low back pain statuses over
a 6 year observation period. The first 5 rows and the first 10 columns
of data are shown below.
LBP[1:5,1:10]
#> sid Baseline.date Date time_to_row case.lbp case.med case.sc case.lt
#> 1 1 2004-07-01 2004-08-23 53 0 0 0 0
#> 2 1 2004-07-01 2004-09-30 91 0 0 0 0
#> 3 1 2004-07-01 2004-12-01 153 0 0 0 0
#> 4 1 2004-07-01 2005-01-12 195 0 0 0 0
#> 5 1 2004-07-01 2005-02-15 229 0 0 0 0
#> gender age
#> 1 F 28.1
#> 2 F 28.1
#> 3 F 28.1
#> 4 F 28.1
#> 5 F 28.1The data set is organized in long format with subject column
sid, primary time column time_to_row, and
several time varying LBP variables including case.lbp,
case.med, case.sc, and case.lt.
These last 4 variables represent the binary occurrence of LBP, LBP
requiring medication, LBP requiring medical care, and LBP resulting in
lost time from work, respectively, at the time of observation. See the
help file via help(LBP) for more information.
At the time of construction for this particular data set, the
baseline data was not available. That is, the first observations were
missing. For analysis purposes, the assumption was made that all
individuals would be free from all LBP at this time point and the
analysis would continue. That assumed information can be added to the
data set through the basedate() function. In the code
below, a new row is added for each individual that can then be populated
with the new LBP values.
LBPwBaseline <- basedate(LBP, "sid")
LBPwBaseline[1:5,1:10]
#> sid Baseline.date Date time_to_row case.lbp case.med case.sc case.lt gender
#> 1 1 2004-07-01 NA NA NA NA NA NA F
#> 2 1 2004-07-01 12653 53 0 0 0 0 F
#> 3 1 2004-07-01 12691 91 0 0 0 0 F
#> 4 1 2004-07-01 12753 153 0 0 0 0 F
#> 5 1 2004-07-01 12795 195 0 0 0 0 F
#> age
#> 1 28.1
#> 2 28.1
#> 3 28.1
#> 4 28.1
#> 5 28.1As baseline information is unavailable, there are a few data points
that require information. The baseline date for all individuals is
stored as a constant in another column, Baseline.date. This
data point can be dragged over to the baseline Date column
for each individual. Subsequently, we can fill in 0s for all other
missing data points as we will assume that each individual starts the
study without any instances of the case definitions which is indicated
by 0 and the baseline visit which is the first observation is time
time_to_row equal to 0.
LBPwBaseline$Date <- ifelse(is.na(LBPwBaseline$Date), LBPwBaseline$Baseline.date, LBPwBaseline$Date)
LBPwBaseline[is.na(LBPwBaseline)] <- 0Any constant columns within subject groups are carried over to the
new row while anything that changes over time or row-wise, is left as
NA.
Through exploration, a few cases can be found where
case.lbp is 0 while another “case.” column is 1.
LBPwBaseline[404:406,1:8]
#> sid Baseline.date Date time_to_row case.lbp case.med case.sc case.lt
#> 404 27 2004-05-25 13147 584 1 1 1 0
#> 405 27 2004-05-25 13179 616 0 1 1 0
#> 406 27 2004-05-25 13236 673 1 1 1 1Intuitively, this does not make sense as these other cases are
defined to be based on the subject having LBP and therefore
case.lbp. This was most likely due to data entry errors for
the case.lbp column so these are revalued to 1 if another
case value is also 1.
LBPwBaseline$case.lbp <- ifelse(LBPwBaseline$case.lbp == 0 &
(LBPwBaseline$case.med == 1 |
LBPwBaseline$case.sc == 1 |
LBPwBaseline$case.lt == 1),
1,
LBPwBaseline$case.lbp)
LBPwBaseline[404:406,1:8]
#> sid Baseline.date Date time_to_row case.lbp case.med case.sc case.lt
#> 404 27 2004-05-25 13147 584 1 1 1 0
#> 405 27 2004-05-25 13179 616 1 1 1 0
#> 406 27 2004-05-25 13236 673 1 1 1 1As is, the data set is in a format that can be used for a multi-state
model analysis such as that provided by the msm package.
The package uses a state argument which is taken as a single column. The
multiple LBP columns cannot be passed in their current form. The
events2state() function can be used to consolidate those
multiple columns into a single column based on the different
combinations of values.
LBP_msm <- events2state(LBPwBaseline, events = c("case.lbp",
"case.med",
"case.sc",
"case.lt"))
#> Combination Levels: 0.0.0.0 1.0.0.0 1.1.0.0 1.0.1.0 1.1.1.0 1.0.0.1 1.1.0.1 1.0.1.1 1.1.1.1
#> Numbered Levels: 1 2 3 4 5 6 7 8 9In the code above, the 4 given input columns have values which when
combined give 9 total combinations. The combination levels are output to
the console when the function runs. By default, these combinations are
changed to numbers through the number = TRUE argument. The
interaction() function handles the combinations of columns
which by default drops unused combinations. This can be toggled in the
drop argument by drop = FALSE.
The LBP_msm data set is suitable input for the
msm package. This is largely the form used for the analysis
in Ingulli (2020) which has a section on
using the msm package for multi-state model analysis of
this data set.
In order to transition the data set to counting process format, the
long2cp() function is used below.
LBP_cp <- long2cp(LBPwBaseline, id = "sid", time = "time_to_row", status = c("case.lbp", "case.med", "case.sc", "case.lt"))
LBP_cp[1:5,1:10]
#> sid Baseline.date Date time1 time2 case.lbp case.med case.sc case.lt gender
#> 1 1 2004-07-01 12600 0 53 0 0 0 0 F
#> 2 1 2004-07-01 12653 53 91 0 0 0 0 F
#> 3 1 2004-07-01 12691 91 153 0 0 0 0 F
#> 4 1 2004-07-01 12753 153 195 0 0 0 0 F
#> 5 1 2004-07-01 12795 195 229 0 0 0 0 FAny subjects with 1 row are dropped since they do not have enough
time information for the desired analysis. The survival
package contains analysis models that take data in counting process
format as input. The LBP_cp data set has multiple
occurrences of LBP so a repeated event or recurrent model might be more
suitable.
If a simple Cox proportional-hazards model for the time to first
instance of LBP is desired, all observations after the first occurrence
of LBP can be truncated using the takefirst() function.
LBP_cp2 <- takefirst(LBP_cp, id = "sid",
criteria.column = "case.lbp",
criteria = 1)
LBP_cp2[11:14,1:8]
#> sid Baseline.date Date time1 time2 case.lbp case.med case.sc
#> 11 1 2004-07-01 13014 414 455 0 0 0
#> 12 1 2004-07-01 13055 455 517 1 1 0
#> 13 2 2004-07-01 12600 0 11 0 0 0
#> 14 2 2004-07-01 12611 11 53 1 1 1In this form, LBP_cp2 has all rows up to and including
the first time case.lbp is 1 for each subject. From this
point, the data can be supplied to the coxph() function in
survival.
The transitions to count format are done through
long2count(). Only the LBP related values and time itself
vary row-wise. As events, the LBP columns can be counted through the
event argument as demonstrated below.
LBP_count <- long2count(LBPwBaseline, "sid", event = c("case.lbp",
"case.med",
"case.sc",
"case.lt"))
LBP_count[1:5,c(1,20:24)]
#> sid case.lbp.counts case.med.counts case.sc.counts case.lt.counts
#> 1 1 2 1 0 0
#> 112 2 6 2 1 1
#> 223 3 15 2 1 0
#> 334 4 5 0 0 0
#> 412 5 9 0 0 0
#> count.weight
#> 1 18
#> 112 16
#> 223 17
#> 334 18
#> 412 16If looking at the LBP_msm data set with the state
column, the state argument can be used and we can sum all
the LBP variables to count their occurrences over time.
LBP_state <- long2count(LBP_msm, "sid", state = "state", FUN = sum)
LBP_state[1:9,c(1,5:8,24:26)]
#> sid case.lbp case.med case.sc case.lt state state.counts count.weight
#> 1 1 2 1 0 0 1 16 18
#> 2 1 2 1 0 0 2 1 18
#> 3 1 2 1 0 0 3 1 18
#> 4 1 2 1 0 0 4 0 18
#> 5 1 2 1 0 0 5 0 18
#> 6 1 2 1 0 0 6 0 18
#> 7 1 2 1 0 0 7 0 18
#> 8 1 2 1 0 0 8 0 18
#> 9 1 2 1 0 0 9 0 18In the case of using the state argument, all 9 states
are counted for each subject. The other LBP columns are summed getting
the total number of occurrences for each column. Either count format
data set should be suitable for some sort of count regression model. The
former might offer a more straightforward approach relating a single
count to covariates of interest. The latter data set might be suitable
with a random effect and a reduced state space. See Ingulli (2020) for an example analysis using
count regression on only the counts of case.lbp.