Type: | Package |
Title: | Joint Segmentation of Multivariate (Copy Number) Signals |
Version: | 1.0.3 |
Description: | Methods for fast segmentation of multivariate signals into piecewise constant profiles and for generating realistic copy-number profiles. A typical application is the joint segmentation of total DNA copy numbers and allelic ratios obtained from Single Nucleotide Polymorphism (SNP) microarrays in cancer studies. The methods are described in Pierre-Jean, Rigaill and Neuvial (2015) <doi:10.1093/bib/bbu026>. |
License: | LGPL-2.1 | LGPL-3 [expanded from: LGPL (≥ 2.1)] |
Depends: | R (≥ 3.1.0) |
Imports: | acnr (≥ 0.3.1), matrixStats (≥ 0.6.0), DNAcopy |
Suggests: | PSCBS, R.cache, digest, changepoint (≥ 1.0.2), knitr, rmarkdown, testthat |
VignetteBuilder: | knitr |
URL: | https://github.com/mpierrejean/jointseg |
RoxygenNote: | 7.1.1 |
BugReports: | https://github.com/mpierrejean/jointseg/issues |
NeedsCompilation: | yes |
Packaged: | 2025-05-14 06:22:03 UTC; mpierre-jean |
Author: | Morgane Pierre-Jean [aut, cre], Pierre Neuvial [aut], Guillem Rigaill [aut] |
Maintainer: | Morgane Pierre-Jean <mpierrejean.pro@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2025-05-14 12:00:02 UTC |
Pruned dynamic programming algorithm
Description
Low-level API for the pruned dynamic programming algorithm (pDPA)
Usage
Fpsn(x, Kmax, mini = min(x), maxi = max(x))
Arguments
x |
A vector of double : the signal to be segmented |
Kmax |
Max number of segments |
mini |
Min value for the mean parameter of the segment |
maxi |
Max value for the mean parameter of the segment |
Details
This implementation uses functional pruning and segment neighborhood, and the L2-loss function
Value
A list with a vector containing the position of the change-points
Author(s)
Guillem Rigaill
References
Rigaill, G. (2015). A pruned dynamic programming algorithm to recover the best segmentations with 1 to K_max change-points. Journal de la Societe Francaise de Statistique, 156(4), 180-205.
See Also
doDynamicProgramming
for a higher-level function
Examples
## load known real copy number regions
affyDat <- acnr::loadCnRegionData(dataSet="GSE29172", tumorFraction=1)
## generate a synthetic CN profile
K <- 10
len <- 1e4
sim <- getCopyNumberDataByResampling(len, K, minLength=100, regData=affyDat)
datS <- sim$profile
## run pruned DPA segmentation
res <- Fpsn(datS[["c"]], Kmax=2*K+1)
## plot segmentation results for the true number of breakpoints
bkp <- res$t.est[K+1, 1:K]
plotSeg(datS, breakpoints=bkp)
Parent-Specific copy number segmentation
Description
This function splits (bivariate) copy number signals into parent-specific (PS) segments using recursive binary segmentation
Usage
PSSeg(
data,
method,
stat = NULL,
dropOutliers = TRUE,
rankTransform = FALSE,
...,
profile = FALSE,
verbose = FALSE
)
Arguments
data |
Data frame containing the following columns:
These data are assumed to be ordered by genome position. |
method |
|
stat |
A vector containing the names or indices of the columns of
|
dropOutliers |
If TRUE, outliers are droped by using DNAcopy package |
rankTransform |
If TRUE, data are replaced by their ranks before segmentation |
... |
Further arguments to be passed to |
profile |
Trace time and memory usage ? |
verbose |
A |
Details
Before segmentation, the decrease in heterozygosity d=2|b-1/2|
defined
in Bengtsson et al, 2010 is calculated from the input data. d
is only
defined for heterozygous SNPs, that is, SNPs for which
data$genotype==1/2
. d
may be seen as a "mirrored" version of
allelic ratios (b
): it converts them to a piecewise-constant signals
by taking advantage of the bimodality of b
for heterozygous SNPs. The
rationale for this transformation is that allelic ratios (b
) are only
informative for heterozygous SNPs (see e.g. Staaf et al, 2008).
Before segmentation, the outliers in the copy number signal are droped according the method explained by Venkatraman, E. S. and Olshen, A. B., 2007.
The resulting data are then segmented using the jointSeg
function, which combines an initial segmentation according to argument
method
and pruning of candidate change points by dynamic programming
(skipped when the initial segmentation *is* dynamic programming).
If argument stat
is not provided, then dynamic programming is run on
the two dimensional statistic "(c,d)"
.
If argument stat
is provided, then dynamic programming is run on
stat
; in this case we implicitly assume that stat
is a
piecewise-constant signal.
Value
A list with elements
- bestBkp
Best set of breakpoints after dynamic programming
- initBkp
Results of the initial segmentation, using 'doNnn', where 'Nnn' corresponds to argument
method
- dpBkpList
Results of dynamic programming, a list of vectors of breakpoint positions for the best model with k breakpoints for k=1, 2, ... K where
K=length(initBkp)
- prof
a
matrix
providing time usage (in seconds) and memory usage (in Mb) for the main steps of the program. Only defined if argumentprofile
is set toTRUE
Author(s)
Morgane Pierre-Jean and Pierre Neuvial
References
Bengtsson, H., Neuvial, P., & Speed, T. P. (2010). TumorBoost: Normalization of allele-specific tumor copy numbers from a single pair of tumor-normal genotyping microarrays. BMC bioinformatics, 11(1), 245.
Staaf, J., Lindgren, D., Vallon-Christersson, et al. (2008). Segmentation-based detection of allelic imbalance and loss-of-heterozygosity in cancer cells using whole genome SNP arrays. Genome Biol, 9(9), R136.
Pierre-Jean, M, Rigaill, G. J. and Neuvial, P. (2015). "Performance Evaluation of DNA Copy Number Segmentation Methods." *Briefings in Bioinformatics*, no. 4: 600-615.
See Also
Examples
## load known real copy number regions
affyDat <- acnr::loadCnRegionData(dataSet="GSE29172", tumorFraction=0.5)
## generate a synthetic CN profile
K <- 10
len <- 1e4
sim <- getCopyNumberDataByResampling(len, K, regData=affyDat)
datS <- sim$profile
## run binary segmentation (+ dynamic programming)
resRBS <- PSSeg(data=datS, method="RBS", stat=c("c", "d"), K=2*K, profile=TRUE)
resRBS$prof
getTpFp(resRBS$bestBkp, sim$bkp, tol=5)
plotSeg(datS, breakpoints=list(sim$bkp, resRBS$bestBkp))
Get best candidate change point
Description
Get best candidate change point according to binary segmentation
Usage
anotherBkp(Y, weightFUN = defaultWeights, verbose = FALSE)
Arguments
Y |
A |
weightFUN |
A |
verbose |
A |
Details
Contrary to oneBkp
, anotherBkp
handles missing values (NA:s).
Value
A numerical value, a candidate breakpoint position
Author(s)
Morgane Pierre-Jean and Pierre Neuvial
Examples
p <- 2
n <- 100
sim <- randomProfile(n, 1, 1, p)
Y <- sim$profile
bkp <- anotherBkp(Y)
print(bkp)
print(oneBkp(Y))
## stopifnot(identical(oneBkp(Y), bkp))
plotSeg(Y, list(sim$bkp, bkp))
## robustness to NA:s
h <- 2
idxs <- seq(from=max(sim$bkp[1]-h, 1), min(sim$bkp[1]+h, n))
Y[idxs, p] <- NA
oneBkp(Y) ## does not work
bkp <- anotherBkp(Y) ## works
bkp-sim$bkp
binMissingValues
Description
Perform binning in order to remove missing values
Usage
binMissingValues(Y, verbose = FALSE)
Arguments
Y |
A numeric matrix |
verbose |
A |
Details
Some segmentation methods (in particular, GFLars) do not natively handle the
situation when some observations have missing values in one or more
dimensions. In order to avoid dropping the corresponding observations
entirely, binMissingValues
bins the signal values of the last complete
observation before a (range of) observations with missing entries using the
binMeans
function.
In the specific case when the first row has NA values, the first non-missing entry is replicated in order to make smoothing possible. This choice is arbitrary but some arbitrary choice is needed in that case.
Note
Currently this function is only used by doGFLars
in order
to make it possible to run GFLars segmentation on SNP array data where most
markers (on the order of 2/3 to 5/6) have missing values, because of
uninformative or missing allelic ratio signals.
The binMissingValues
function may be used for other segmentation
methods suffering from the same limitation. However, we emphasize that
handling missing values natively in the segmentation method would be a
better solution.
Currently this function is only used by doGFLars
in order
to make it possible to run GFLars segmentation on SNP array data where most
markers (on the order of 2/3 to 5/6) have missing values, because of
uninformative or missing allelic ratio signals. The binMissingValues
function may be used for other segmentation methods suffering from the same
limitation. However, we emphasize that handling missing values natively in
the segmentation method would be a better solution.
Author(s)
Morgane Pierre-Jean and Pierre Neuvial
References
Bleakley, K., & Vert, J. P. (2011). The group fused lasso for multiple change-point detection. arXiv preprint arXiv:1106.4199.
Vert, J. P., & Bleakley, K. (2010). Fast detection of multiple change-points shared by many signals using group LARS. Advances in Neural Information Processing Systems, 23, 2343-2351.
Examples
sim <- randomProfile(10, 1, 0.1, 3)
Y <- sim$profile
Y[c(4, 8), 2] <- NA
Y[c(7, 8), 3] <- NA
res <- binMissingValues(Y)
Y <- sim$profile
Y[1:5, 2] <- NA
Yb <- binMissingValues(Y)
Y <- sim$profile
Y[3:5, 2] <- NA
Yb <- binMissingValues(Y)
Compute default weights for the weighted group fused Lasso
Description
Compute default weights for the weighted group fused Lasso
Usage
defaultWeights(n)
Arguments
n |
Number of observations |
Value
Vector of default weights in the reference article.
Note
This implementation is derived from the MATLAB code by Vert and Bleakley: http://cbio.ensmp.fr/GFLseg.
Author(s)
Morgane Pierre-Jean and Pierre Neuvial
References
Bleakley, K., & Vert, J. P. (2011). The group fused lasso for multiple change-point detection. arXiv preprint arXiv:1106.4199.
Examples
defaultWeights(10)
Run CBS segmentation
Description
This function is a wrapper for convenient use of the CBS
segmentation
method by PSSeg
. It applies the
segment
function and reshapes the results
Usage
doCBS(y, ..., verbose = FALSE)
Arguments
y |
A numeric vector, the signal to be segmented |
... |
Arguments to be passed to |
verbose |
A |
Value
A list with a single element:
- bkp
breakpoint positions
Author(s)
Morgane Pierre-Jean and Pierre Neuvial
See Also
Examples
## load known real copy number regions
affyDat <- acnr::loadCnRegionData(dataSet="GSE29172", tumorFraction=1)
## generate a synthetic CN profile
K <- 10
len <- 1e4
sim <- getCopyNumberDataByResampling(len, K, minLength=100, regData=affyDat)
datS <- sim$profile
## run CBS segmentation
res <- doCBS(datS[["c"]])
getTpFp(res$bkp, sim$bkp, tol=5, relax = -1) ## true and false positives
plotSeg(datS, breakpoints=list(sim$bkp, res$bkp))
Run segmentation by dynamic programming
Description
High-level function for univariate or multivariate segmentation by dynamic programming
Usage
doDynamicProgramming(Y, K, stat = NULL, verbose = FALSE)
Arguments
Y |
A numeric vector or a matrix, the signal to be segmented |
K |
The number of change points to find |
stat |
A vector containing the names or indices of the columns of
|
verbose |
A |
Details
If the signal is uni-dimensional, this function simply uses the segmentation
method provided in the cghseg
package reshapes the results.
If the signal is multi-dimensional, this function applies the
pruneByDP
function and reshapes the results.
Value
bkp |
A vector of |
dpseg |
A list of two elements
|
Note
This is essentially a wrapper for convenient segmentation by dynamic
programming using the PSSeg
function.
Author(s)
Morgane Pierre-Jean and Pierre Neuvial
References
Rigaill, G. (2015). A pruned dynamic programming algorithm to recover the best segmentations with 1 to K_max change-points. Journal de la Societe Francaise de Statistique, 156(4), 180-205.
Examples
## load known real copy number regions
affyDat <- acnr::loadCnRegionData(dataSet="GSE29172", tumorFraction=1)
## generate a synthetic CN profile
K <- 10
len <- 1e4
sim <- getCopyNumberDataByResampling(len, K, minLength=100, regData=affyDat)
datS <- sim$profile
## run pruned DPA segmentation
resDP <- doDynamicProgramming(datS[["c"]], K=K)
getTpFp(resDP$bkp, sim$bkp, tol=5, relax = -1) ## true and false positives
plotSeg(datS, breakpoints=list(sim$bkp, resDP$bkp))
## run 2d dynamic programming segmentation
K <- 2
len <- 1e3
sim <- getCopyNumberDataByResampling(len, K, minLength=100, regData=affyDat)
datS <- sim$profile
datS$d <- 2*abs(datS$b-1/2)
datS[which(datS$genotype!=0.5),"d"] <- NA
Y = cbind(datS$c,datS$d)
resDP2d <- doDynamicProgramming(Y, K = K)
Group fused Lars segmentation
Description
High-level function for multivariate fused Lars (GFLars) segmentation
Usage
doGFLars(Y, K, stat = NULL, ..., verbose = FALSE)
Arguments
Y |
A |
K |
The number of change points to find |
stat |
A vector containing the names or indices of the columns of
|
... |
Further arguments to be passed to 'segmentByGFLars' |
verbose |
A |
Details
This function is a wrapper around the lower-level segmentation function
segmentByGFLars
. It can be run on p-dimensional,
piecewise-constant data in order to defined a set of candidate change
points. It is recommended to prune this list of candidates using dynamic
programming (pruneByDP
), combined with a selection of the best
number of change points. The jointSeg
function provides a
convenient wrapper for performing segmentation, pruning and model selection.
For the specific case of DNA copy number data segmentation, see the
dedicated wrapper PSSeg
.
The default weights \sqrt{n/(i*(n-i))}
are calibrated as suggested by
Bleakley and Vert (2011). Using this calibration, the first breakpoint
maximizes the likelihood ratio test (LRT) statistic.
Value
An object of the same structure as the output of
segmentByGFLars
Note
This implementation is derived from the MATLAB code by Vert and Bleakley:
Author(s)
Morgane Pierre-Jean and Pierre Neuvial
References
Bleakley, K., & Vert, J. P. (2011). The group fused lasso for multiple change-point detection. arXiv preprint arXiv:1106.4199.
Vert, J. P., & Bleakley, K. (2010). Fast detection of multiple change-points shared by many signals using group LARS. Advances in Neural Information Processing Systems, 23, 2343-2351.
Examples
p <- 2
trueK <- 10
sim <- randomProfile(1e4, trueK, 1, p)
Y <- sim$profile
K <- 2*trueK
res <- doGFLars(Y, K)
print(res$bkp)
print(sim$bkp)
plotSeg(Y, res$bkp)
## a toy example with missing values
sim <- randomProfile(1e2, 1, 0.1, 2)
Y <- sim$profile
Y[3:50, 2] <- NA
res <- doGFLars(Y, 10, 2, verbose=TRUE)
print(res$bkp)
print(sim$bkp)
plotSeg(Y, res$bkp)
Run Paired PSCBS segmentation
Description
This function is a wrapper for convenient use of the PSCBS
segmentation method by PSSeg
. It applies the
segmentByPairedPSCBS
function and reshapes the results
Usage
doPSCBS(Y, ..., verbose = FALSE)
Arguments
Y |
A matrix of signals to be segmented, containing the following columns
|
... |
Arguments to be passed to
|
verbose |
A |
Value
A list with a single element:
- bkp
breakpoint positions
Author(s)
Morgane Pierre-Jean and Pierre Neuvial
See Also
Examples
## Not run:
## load known real copy number regions
affyDat <- acnr::loadCnRegionData(dataSet="GSE29172", tumorFraction=1)
## generate a synthetic CN profile
K <- 10
len <- 1e4
sim <- getCopyNumberDataByResampling(len, K, minLength=100, regData=affyDat)
datS <- sim$profile
## run PSCBS segmentation
Y <- as.matrix(subset(datS, select=c(c, b, genotype)))
res <- doPSCBS(Y)
getTpFp(res$bkp, sim$bkp, tol=5, relax = -1) ## true and false positives
plotSeg(datS, breakpoints=list(sim$bkp, res$bkp))
## End(Not run)
Run PSCN segmentation (defunct)
Description
The 'PSCN' package is not maintained anymore and it is not available for R >= 3.0.0. The original 'doPSCN' function has been moved to the directory 'zzz.defunct'. The skeleton of that function is kept for backward compatibility.
Usage
doPSCN(
Y,
alpha = 0.01,
platform = c("Illumina", "Affymetrix"),
verbose = FALSE,
...
)
Arguments
Y |
The signal to be segmented, a matrix containing the following columns:
|
alpha |
sensitivity level in [0,1] to be passed to
|
platform |
Specifies form which array platform 'Y' was generated: Illumina or Affymetrix |
verbose |
A |
... |
Further arguments to be passed to |
Author(s)
Morgane Pierre-Jean and Pierre Neuvial
References
Chen, H., Xing, H., & Zhang, N. R. (2011). Estimation of parent specific DNA copy number in tumors using high-density genotyping arrays. PLoS computational biology, 7(1), e1001060.
See Also
Examples
print("The 'PSCN' package is not available for R >= 3.0.0.")
print("See http://cran.r-project.org/web/packages/PSCN/index.html")
Run RBS segmentation
Description
High-level function for multivariate recursive binary (RBS) segmentation
Usage
doRBS(Y, K, stat = NULL, ..., verbose = FALSE)
Arguments
Y |
A |
K |
The number of change points to find |
stat |
A vector containing the names or indices of the columns of
|
... |
Further arguments to be passed to 'segmentByRBS' |
verbose |
A |
Details
This function is a wrapper around the lower-level segmentation function
segmentByRBS
. It can be run on p-dimensional,
piecewise-constant data in order to defined a set of candidate change
points. It is recommended to prune this list of candidates using dynamic
programming (pruneByDP
), combined with a selection of the best
number of change points. The jointSeg
function provides a
convenient wrapper for performing segmentation, pruning and model selection.
Value
An object of the same structure as the output of
segmentByRBS
Author(s)
Morgane Pierre-Jean and Pierre Neuvial
References
Gey, S., & Lebarbier, E. (2008). Using CART to Detect Multiple Change Points in the Mean for Large Sample. http://hal.archives-ouvertes.fr/hal-00327146/
See Also
Examples
p <- 2
trueK <- 10
len <- 5e4
sim <- randomProfile(len, trueK, 1, p)
Y <- sim$profile
K <- 2*trueK
res <- doRBS(Y, K)
getTpFp(res$bkp, sim$bkp, tol=10, relax = -1) ## true and false positives
cols <- rep(2, K)
cols[1:trueK] <- 3
par(mfrow=c(p,1))
for (ii in 1:p) {
plot(Y[, ii], pch=19, cex=0.2)
abline(v=res$bkp[1:trueK], col= cols)
abline(v=sim$bkp, col=8, lty=2)
}
## NA:s in one dimension at a true breakpoint
jj <- sim$bkp[1]
Y[jj-seq(-10, 10), p] <- NA
res2 <- doRBS(Y, K)
getTpFp(res2$bkp, sim$bkp, tol=10, relax = -1) ## true and false positives
## NA:s in both dimensions at a true breakpoint
Y[jj-seq(-10, 10), ] <- NA
res3 <- doRBS(Y, K)
getTpFp(res3$bkp, sim$bkp, tol=10, relax = -1) ## true and false positives
Robust standard deviation estimator
Description
Estimate standard deviation of an unimodal signal with possible changes in mean
Usage
estimateSd(y, method = c("Hall", "von Neumann"))
Arguments
y |
A numeric vector |
method |
Method used to estimate standard deviation |
Details
von Neumann's estimator is proportional to the mean absolute deviation
(mad
) of the first-order differences of the original signals:
mad(diff(y)
. By construction this estimator is robust to 1) changes
in the mean of the signal (through the use of differences) and 2) outliers
(through the use of mad
instead of mean
).
The proportionality constant 1/\sqrt 2 \times 1/\Phi^{-1}(3/4)
ensures
that the resulting estimator is consistent for the estimation of the
standard deviation in the case of Gaussian signals.
Hall's estimator is a weigthed sum of squared elements of y. Let m=3.
sigma^2 =
(\sum_{k=1}^{n-m}\sum_{j=1}^{m+1}(\code{wei[i]}\code{y}[i+k])^2)/(n-m)
Author(s)
Morgane Pierre-Jean and Pierre Neuvial
References
Von Neumann, J., Kent, R. H., Bellinson, H. R., & Hart, B. T. (1941). The mean square successive difference. The Annals of Mathematical Statistics, 153-162.
Peter Hall, J. W. Kay and D. M. Titterington (1990). Asymptotically Optimal Difference-Based Estimation of Variance in Nonparametric Regression Biometrika,521-528
Examples
n <- 1e4
y <- rnorm(n) ## a signal with no change in mean
estimateSd(y)
estimateSd(y, method="von Neumann")
sd(y)
mad(y)
z <- y + rep(c(0,2), each=n/2) ## a signal with *a single* change in mean
estimateSd(z)
estimateSd(z, method="von Neumann")
sd(z)
mad(z)
z <- y + rep(c(0,2), each=100) ## a signal with many changes in mean
estimateSd(z)
estimateSd(z, method="von Neumann")
sd(z)
mad(z)
Generate a copy number profile by resampling
Description
Generate a copy number profile by resampling input data
Usage
getCopyNumberDataByResampling(
length,
nBkp = NA,
bkp = NULL,
regData = NULL,
regions = NULL,
regAnnot = NULL,
minLength = 0,
regionSize = 0,
connex = TRUE
)
Arguments
length |
length of the profile |
nBkp |
number of breakpoints. If |
bkp |
a numeric vector of breakpoint positions that may be used to
bypass the breakpoint generation step. Defaults to |
regData |
a data.frame containing copy number data for different types of copy number regions. Columns:
|
regions |
a character vector of region labels that may be used to
bypass the region label generation step. Defaults to |
regAnnot |
a data.frame containing annotation data for each copy number region. Columns:
If |
minLength |
minimum length of region between breakpoints. Defaults to 0. |
regionSize |
If |
connex |
If |
Details
This function generates a random copy number profile of length 'length',
with 'nBkp' breakpoints randomly chosen. Between two breakpoints, the
profile is constant and taken among the different types of regions in
regData
.
Elements of regData[["region"]]
must be of the form "(C1,C2)"
,
where C1
denotes the minor copy number and C2
denotes the
major copy number. For example,
- (1,1)
Normal
- (0,1)
Hemizygous deletion
- (0,0)
Homozygous deletion
- (1,2)
Single copy gain
- (0,2)
Copy-neutral LOH
- (2,2)
Balanced two-copy gain
- (1,3)
Unbalanced two-copy gain
- (0,3)
Single-copy gain with LOH
If 'connex' is set to TRUE (the default), transitions between copy number regions are constrained in such a way that for any breakpoint, one of the minor and the major copy number does not change. Equivalently, this means that all breakpoints can be seen in both total copy numbers and allelic ratios.
Value
A list with elements
- profile
the profile (a
length
by2
data.frame containing the same fields as the input dataregData
.- bkp
a vector of bkp positions (the last row index before a breakpoint)
regions |
a character vector of region labels |
Author(s)
Morgane Pierre-Jean and Pierre Neuvial
References
Pierre-Jean, M, Rigaill, G. J. and Neuvial, P. (2015). "Performance Evaluation of DNA Copy Number Segmentation Methods." *Briefings in Bioinformatics*, no. 4: 600-615.
Examples
affyDat <- acnr::loadCnRegionData(dataSet="GSE29172", tumorFraction=1)
sim <- getCopyNumberDataByResampling(len=1e4, nBkp=5, minLength=100, regData=affyDat)
plotSeg(sim$profile, sim$bkp)
## another run with identical parameters
bkp <- sim$bkp
regions <- sim$regions
sim2 <- getCopyNumberDataByResampling(len=1e4, bkp=bkp, regData=affyDat, regions=regions)
plotSeg(sim2$profile, bkp)
## change tumor fraction but keep same "truth"
affyDatC <- acnr::loadCnRegionData(dataSet="GSE29172", tumorFraction=0.5)
simC <- getCopyNumberDataByResampling(len=1e4, bkp=bkp, regData=affyDatC, regions=regions)
plotSeg(simC$profile, bkp)
## restrict to only normal, single copy gain, and copy-neutral LOH
## with the same bkp
affyDatR <- subset(affyDat, region %in% c("(1,1)", "(0,2)", "(1,2)"))
simR <- getCopyNumberDataByResampling(len=1e4, bkp=bkp, regData=affyDatR)
plotSeg(simR$profile, bkp)
## Same 'truth', on another dataSet
regions <- simR$regions
illuDat <- acnr::loadCnRegionData(dataSet="GSE11976", tumorFraction=1)
sim <- getCopyNumberDataByResampling(len=1e4, bkp=bkp, regData=illuDat, regions=regions)
plotSeg(sim$profile, sim$bkp)
Calculate the number of true positives and false positives
Description
Calculate the number of true positives and false positives among candidate breakpoints
Usage
getTpFp(candidates, trueBkp, tol, relax = -1)
Arguments
candidates |
Breakpoints found by the methods |
trueBkp |
True breakpoints |
tol |
Tolerance on the position of candidate breakpoints called true |
relax |
Controls the way multiple breapoints within tolerance area are recorded.
|
Value
A list with elements:
- TP
The number of true positives
- FP
The number of false positives
Author(s)
Morgane Pierre-Jean and Pierre Neuvial
Examples
## load known real copy number regions
affyDat <- acnr::loadCnRegionData(dataSet="GSE29172", tumorFraction=0.7)
## generate a synthetic CN profile
K <- 10
len <- 2e4
sim <- getCopyNumberDataByResampling(len, K, minLength=100, regData=affyDat)
datS <- sim$profile
## (group-)fused Lasso segmentation
res <- PSSeg(data=datS, K=2*K, method="GFLars", stat="c", profile=TRUE)
## results of the initial (group-)fused lasso segmentation
getTpFp(res$initBkp, sim$bkp, tol=10, relax=-1)
getTpFp(res$initBkp, sim$bkp, tol=10, relax=0)
getTpFp(res$initBkp, sim$bkp, tol=10, relax=1)
plotSeg(datS, breakpoints=list(sim$bkp, res$initBkp))
## results after pruning (group-)fused Lasso candidates by dynamic programming)
getTpFp(res$bestBkp, sim$bkp, tol=10, relax=-1)
getTpFp(res$bestBkp, sim$bkp, tol=10, relax=0)
getTpFp(res$bestBkp, sim$bkp, tol=10, relax=1)
plotSeg(datS, breakpoints=list(sim$bkp, res$bestBkp))
Get the contribution of one dimension to the RSE.
Description
Get the contribution of one dimension to the Residual Squared Error (RSE)
Usage
getUnivJ(y, candCP)
Arguments
y |
Input vector of signals for the considered dimension |
candCP |
A vector of candidate change points |
Details
This function is used internally by pruneByDP
.
Author(s)
Morgane Pierre-Jean and Pierre Neuvial
See Also
Get the binary test statistic for one dimension
Description
Get the binary test statistic for one dimension
Usage
getUnivStat(y, weightFUN = defaultWeights)
Arguments
y |
Input vector of signals for the considered dimension |
weightFUN |
A |
Details
This function is used internally by anotherBkp
.
Value
A vector of length length(y)-1
, the binary test statistic for one dimension
Author(s)
Morgane Pierre-Jean and Pierre Neuvial
See Also
Joint segmentation of multivariate signals
Description
Joint segmentation of multivariate signals in two steps:
first-pass segmentation. By default, a fast, greedy approach is used (see
method
).pruning of the candidate change points obtained by dynamic programming
Usage
jointSeg(
Y,
method = "RBS",
stat = NULL,
dpStat = stat,
segFUN = NULL,
jitter = NULL,
modelSelectionMethod = ifelse(match(method, c("DynamicProgramming", "RBS", "GFLars"),
nomatch = 0) > 0, "Lebarbier", "none"),
modelSelectionOnDP = (match(method, c("DynamicProgramming", "RBS", "GFLars"), nomatch
= 0) > 0),
...,
profile = FALSE,
verbose = FALSE
)
Arguments
Y |
The signal to be segmented (a matrix or a numeric vector) |
method |
A
|
stat |
A vector containing the names or indices of the columns of
|
dpStat |
A vector containing the names or indices of the columns of
|
segFUN |
The segmentation function to be used when |
jitter |
Uncertainty on breakpoint position after initial segmentation.
Defaults to |
modelSelectionMethod |
A character value, the name of the method used to perform model selection. |
modelSelectionOnDP |
A logical value. If |
... |
Further arguments to be passed to the lower-level segmentation
method determined by argument |
profile |
A |
verbose |
A |
Details
If modelSelectionOnDP
is set to FALSE
, then model selection is
run on the sets of the form bkp[1:k]
for 1 \leq k \leq
length(bkp)
, where bkp
is the set of breakpoints identified by the
initial segmentation. In particular, this implies that the candidate
breakpoints in bkp
are sorted by order of appearance and not by
position.
If jitter
is not NULL
, it should be a vector of integer
indices. The set of candidate breakpoints passed on to dynamic programming is
augmented by all indices distant from an element of jitter
from one of
the candidates. For example, if jitter==c(-1, 0, 1)
and the initial
set of breakpoints is c(1,5)
then dynamic programming is run on
c(1,2,4,5,6)
.
If the return value of the initial segmentation has an element named
dpseg
, then initial segmentation results are not pruned by dynamic
programming.
References
Bleakley, K., & Vert, J. P. (2011). The group fused lasso for multiple change-point detection. arXiv preprint arXiv:1106.4199.
Vert, J. P., & Bleakley, K. (2010). Fast detection of multiple change-points shared by many signals using group LARS. Advances in Neural Information Processing Systems, 23, 2343-2351.
Gey, S., & Lebarbier, E. (2008). Using CART to Detect Multiple Change Points in the Mean for Large Sample. http://hal.archives-ouvertes.fr/hal-00327146/
Rigaill, G. (2010). Pruned dynamic programming for optimal multiple change-point detection. arXiv preprint arXiv:1004.0887.
See Also
Examples
## A two-dimensional signal
p <- 2
trueK <- 10
len <- 1e4
sim <- randomProfile(len, trueK, 1, p)
Y <- sim$profile
K <- 2*trueK
res <- jointSeg(Y, method="RBS", K=K)
bkp <- res$bestBkp
getTpFp(bkp, sim$bkp, tol=5, relax = -1) ## true and false positives
plotSeg(Y, list(sim$bkp, res$bestBkp), col=1)
## Now we add some NA:s in one dimension
jj <- sim$bkp[1]
Y[jj-seq(-10,10), p] <- NA
res2 <- jointSeg(Y, method="RBS", K=K, verbose=TRUE)
bkp <- res2$bestBkp
getTpFp(res2$bestBkp, sim$bkp, tol=5, relax = -1) ## true and false positives
leftMultiplyByInvXAtXA
Description
Compute r = inv(X(:,ind)'*X(:,ind))*b , where X is the n*(n-1) design matrix for the weighted group fused lasso.
Usage
leftMultiplyByInvXAtXA(n, ind, val, w = defaultWeights(n), verbose = FALSE)
Arguments
n |
The size of X is n*(n-1) |
ind |
a*1 vector of indices between 1 and n-1, sorted in increasing order |
val |
a*p matrix |
w |
(n-1)*1 vector of weights |
verbose |
A |
Details
This implementation is derived from the MATLAB code of Vert and Bleakley:
Value
- r
the (n-1)*p matrix equal to X'*Y
Author(s)
Morgane Pierre-Jean and Pierre Neuvial
References
Bleakley, K., & Vert, J. P. (2011). The group fused lasso for multiple change-point detection. arXiv preprint arXiv:1106.4199.
Vert, J. P., & Bleakley, K. (2010). Fast detection of multiple change-points shared by many signals using group LARS. Advances in Neural Information Processing Systems, 23, 2343-2351.
Examples
val <- matrix(c(1.56, 1.35, 1.26, 1.15), ncol=2)
ind <- c(5,6)
n <- 10
res <- leftMultiplyByInvXAtXA(n=n, ind=ind, val=val)
res
## [,1] [,2]
## [1,] 1.373189 0.9630868
## [2,] 0.228796 0.3636429
leftMultiplyByXt
Description
Compute X'*Y where X is the n*(n-1) design matrix for the weighted group fused Lasso, with weights defined by the vector w, and Y is any n*p matrix. The computation is done in O(np).
Usage
leftMultiplyByXt(Y, w = defaultWeights(nrow(Y)), verbose = FALSE)
Arguments
Y |
A n*p matrix |
w |
(n-1)*1 vector of weights |
verbose |
A |
Details
This implementation is derived from the MATLAB code of Vert and Bleakley:
Contrary to getUnivStat
it does not handle missing values.
Value
The (n-1)*p matrix equal to X'*Y
Author(s)
Morgane Pierre-Jean and Pierre Neuvial
References
Bleakley, K., & Vert, J. P. (2011). The group fused lasso for multiple change-point detection. arXiv preprint arXiv:1106.4199.\ Vert, J. P., & Bleakley, K. (2010). Fast detection of multiple change-points shared by many signals using group LARS. Advances in Neural Information Processing Systems, 23, 2343-2351.
Examples
Y <- matrix(rnorm(20), ncol=2)
C <- leftMultiplyByXt(Y)
mapPositionsBack
Description
Map breakpoint positions back to original space
Usage
mapPositionsBack(pos)
Arguments
pos |
A sorted list of position (or indices) |
Author(s)
Morgane Pierre-Jean and Pierre Neuvial
Model selection
Description
Select the best number of breakpoints
Usage
modelSelection(
rse,
n,
c = 2.5,
lambdas = NULL,
method = c("Birge", "Lebarbier")
)
Arguments
rse |
RSE as output by |
n |
Length of the profile |
c |
Parameter for the model selection |
lambdas |
A list of candidate values for the calibration of the penalty |
method |
Method to calibrate the constant in the penalty for model selection |
Details
This function is not intended to be called directly, but implicitly through
jointSeg
or PSSeg
.
Value
A list with elements
- kbest
the best number of breakpoints
- lambda
A numerical value, the result of an internal model selection function
Author(s)
Morgane Pierre-Jean and Pierre Neuvial
References
Lebarbier, E. (2005). Detecting multiple change-points in the mean of Gaussian process by model selection. Signal processing, 85(4), 717-736
Birg\'e, L. (2001). Gaussian model selection. J.Eur Math. Soc, 3(3):203-268
See Also
Examples
## load known real copy number regions
affyDat <- acnr::loadCnRegionData(dataSet="GSE29172", tumorFraction=1)
sim <- getCopyNumberDataByResampling(1e4, 5, minLength=100, regData=affyDat)
Y <- as.matrix(sim$profile[, "c"])
## Find candidate breakpoints
K <- 50
resRBS <- segmentByRBS(Y, K=K)
## Prune candidate breakpoints
resDP <- pruneByDP(Y, candCP=resRBS$bkp)
selectedModel <- modelSelection(rse=resDP$rse, n=nrow(Y), method="Lebarbier")
str(selectedModel)
## breakpoints of the best model
bestBkp <- resDP$bkp[[selectedModel$kbest]]
print(bestBkp)
## truth
print(sim$bkp)
## Note that all of the above can be done directly using 'PSSeg'
res <- PSSeg(sim$profile, method="RBS", stat="c", K=K)
## stopifnot(identical(res$bestBkp, bestBkp))
multiplyXtXBySparse
Description
Compute C = t(X)*X*val
, where val
is a row-sparse
(n-1)*p
matrix and X
is the n*(n-1)
design matrix for
the weighted group fused lasso.
Usage
multiplyXtXBySparse(n, ind, val, w = defaultWeights(n), verbose = FALSE)
Arguments
n |
Size of the problem |
ind |
a*1 vector of indices of the non-zero rows of b (each in [1,n-1]) |
val |
a*p matrix whose rows are the non-zero rows of b (same order as ind) |
w |
(n-1)*1 vector of weights |
verbose |
A |
Details
This implementation is derived from the MATLAB code of Vert and Bleakley
Value
The (n-1)*p
matrix equal to t(X)*X*val
Author(s)
Morgane Pierre-Jean and Pierre Neuvial
References
Bleakley, K., & Vert, J. P. (2011). The group fused lasso for multiple change-point detection. arXiv preprint arXiv:1106.4199. https://arxiv.org/abs/1106.4199
Vert, J. P., & Bleakley, K. (2010). Fast detection of multiple change-points shared by many signals using group LARS. Advances in Neural Information Processing Systems, 23, 2343-2351.
Examples
val <- matrix(c(1.56, 1.35, 1.26, 1.15), ncol=2)
ind <- c(5,6)
n <- 10
res <- multiplyXtXBySparse(n=n, ind=ind, val=val)
res
## [,1] [,2]
## [1,] 0.8874235 0.7329904
## [2,] 1.3311352 1.0994855
## [3,] 1.7428651 1.4395645
## [4,] 2.1737347 1.7954524
## [5,] 2.6622704 2.1989711
## [6,] 2.6237347 2.1787857
## [7,] 2.1036678 1.7469149
## [8,] 1.6067028 1.3342283
## [9,] 1.0711352 0.8894855
Get best candidate change point
Description
Get best candidate change point according to binary segmentation
Usage
oneBkp(Y, weights = NULL, verbose = FALSE)
Arguments
Y |
A |
weights |
a |
verbose |
A |
Author(s)
Morgane Pierre-Jean and Pierre Neuvial
Examples
p <- 2
sim <- randomProfile(1e4, 1, 1, p)
Y <- sim$profile
bkp <- jointseg:::oneBkp(Y)
par(mfrow=c(p,1))
for (ii in 1:p) {
plot(Y[, ii], pch=19, cex=0.2)
abline(v=bkp, col=3)
abline(v=sim$bkp, col=8, lty=2)
}
Plot signal and breakpoints with segment-level signal estimates
Description
Plot signal and breakpoints with segment-level signal estimates
Usage
plotSeg(
dat,
breakpoints = NULL,
regNames = NULL,
exclNames = c("genotype", "region", "bT", "bN", "cellularity"),
ylabs = colnames(dat),
ylims = NULL,
binExclPattern = "^b[N|T]*$",
col = "#33333333",
pch = 19,
cex = 0.3
)
Arguments
dat |
A |
breakpoints |
A vector of breakpoints positions, or a |
regNames |
Region labels, a vector of length
|
exclNames |
A vector of column names corresponding to columns that should not be plotted. |
ylabs |
A vector of 'y' labels (column names or indices) that should be plotted. |
ylims |
An optional |
binExclPattern |
A vector of column names or indices in
|
col |
Color of plotting symbol, see |
pch |
Plotting symbol, see |
cex |
Magnification factor for plotting symbol, see |
Details
Argument 'binCols' is mainly used to avoid calculating mean levels for allelic ratios, which would not make sense as they are typically multimodal.
Author(s)
Morgane Pierre-Jean and Pierre Neuvial
Examples
affyDat <- acnr::loadCnRegionData(dataSet="GSE29172", tumorFraction=1)
sim <- getCopyNumberDataByResampling(1e4, 5, minLength=100, regData=affyDat)
dat <- sim$profile
res <- PSSeg(dat, method="RBS", stat=c("c", "d"), K=50)
bkpList <- list(true=sim$bkp, est=res$bestSeg)
plotSeg(dat, breakpoints=bkpList)
profile time and memory usage of a given R expression
Description
profile time and memory usage of a given R expression
Usage
prof(expr, doit = TRUE)
Arguments
expr |
An |
doit |
A boolean variable specifying whether profiling should be performed or not (intended for internal use). |
Details
Profiling is performed using summaryRprof(memory="both")$by.self
.
Note
Our memory profiling is not satistfactory yet!
Author(s)
Morgane Pierre-Jean and Pierre Neuvial
See Also
Exact segmentation of a multivariate signal using dynamic programming.
Description
Exact segmentation of a multivariate signal using dynamic programming.
Usage
pruneByDP(
Y,
candCP = 1:(nrow(Y) - 1),
K = length(candCP),
allowNA = TRUE,
verbose = FALSE
)
Arguments
Y |
A |
candCP |
A vector of candidate change point positions (defaults to 1:(n-1)) |
K |
The maximum number of change points to find |
allowNA |
A boolean value specifying whether missing values should be allowed or not. |
verbose |
A |
Details
This function retrieves the maximum likelihood solution of the gaussian
homoscedastic change model into segments, for K \in {1 \dots
length(candCP)}
. The dynamic programming algorithm used is quadratic in
time. For signals containing more than 1000 points, we recommend using a
first pass segmentation (see segmentByRBS
) to find a smaller
number of candidates, and to run pruneByDP
on these candidates only,
as initially suggested by Gey and Lebarbier (2008). These two steps can be
performed using jointSeg
for generic multivariate signals, and
using PSSeg
for copy number signals from SNP array data.
if allowNA
, the calulation of the cost of removing candidate
breakpoints between i and j for i<j tolerates missing values. Method
!allowNA
is maintained in order to check consistency with the
original dynamic programming in the absence of NA:s.
Value
A list with elements:
bkpList |
A list of vectors of change point positions for the best model with k change points, for k=1, 2, ... K |
rse |
A vector of K+1 residual squared errors |
V |
V[i,j] is the best RSE for segmenting intervals 1 to j |
Note
This implementation is derived from the MATLAB code by Vert and Bleakley:
Author(s)
Morgane Pierre-Jean and Pierre Neuvial
References
Bellman, R. (1961). On the approximation of curves by line segments using dynamic programming. Communications of the ACM, 4(6), 284.
Gey, S., & Lebarbier, E. (2008). Using CART to Detect Multiple Change Points in the Mean for Large Sample. http://hal.archives-ouvertes.fr/hal-00327146/
See Also
Examples
p <- 2
trueK <- 10
sim <- randomProfile(1e4, trueK, 1, p)
Y <- sim$profile
K <- 2*trueK
res <- segmentByRBS(Y, K)
resP <- pruneByDP(Y, res$bkp)
## Note that all of the above can be dmethod=="other"one directly using 'jointSeg'
resJ <- jointSeg(sim$profile, method="RBS", K=K)
stopifnot(identical(resP$bkpList, resJ$dpBkp))
## check consistency when no NA
resP2 <- pruneByDP(Y, res$bkp, allowNA=FALSE)
max(abs(resP$rse-resP2$rse))
plotSeg(Y, list(resP$bkp[[trueK]], sim$bkp), col=1)
Generate a random multi-dimensional profile with breakpoints and noise
Description
Generate a random multi-dimensional profile with breakpoints and noise
Usage
randomProfile(length, nBkp, noiseLevel, dim, minLength = 0)
Arguments
length |
length of the profile |
nBkp |
number of breakpoints |
noiseLevel |
variance of the signal between two breakpoints |
dim |
dimension of the profile |
minLength |
minimum length of region between breakpoints by default minLength = 0 |
Details
Generate a random profile (vector) of length length
, with nBkp
breakpoints randomly chosen. Between two breakpoints, the profile is
constant, uniformly chosen between 0 and 1, and a Gaussian noice of variance
noiseLevel
is added.
Value
a list
with elements
- profile
the profile (a
length
bydim
matrix)- bkp
the list of breakpoints positions (the last position at the left of a breakpoint)
Note
This implementation is derived from the MATLAB code by Vert and Bleakley:.
Author(s)
Morgane Pierre-Jean and Pierre Neuvial
Examples
len <- 1e4
nBkp <- 10
noiseLevel <- 1
dim <- 2
sim <- randomProfile(len, nBkp, noiseLevel, dim)
res <- doGFLars(sim$profile, K=5*nBkp)
str(res)
Extract endpoint matrix from DP result
Description
Extract endpoint matrix from DP result
Usage
retour_sn(path)
Arguments
path |
the path vector of the "colibri_sn_R_c C" function |
Group fused Lars segmentation (low-level)
Description
Low-level function for multivariate fused Lars segmentation (GFLars)
Usage
segmentByGFLars(
Y,
K,
weights = defaultWeights(nrow(Y)),
epsilon = 1e-09,
verbose = FALSE
)
Arguments
Y |
A |
K |
The number of change points to find |
weights |
A |
epsilon |
Values smaller than epsilon are considered null. Defaults to
|
verbose |
A |
Details
This function recrusively looks for the best candidate change point
according to group-fused LARS. This is a low-level function. It is generally
advised to use the wrapper doGFLars
which also works on data
frames, has a convenient argument stat
, and includes a basic
workaround for handling missing values.
See also jointSeg
for combining group fused LARS segmentation
with pruning by dynamic programming (pruneByDP
).
See PSSeg
for segmenting genomic signals from SNP arrays.
The default weights \sqrt{n/(i*(n-i))}
are calibrated as suggested by
Bleakley and Vert (2011). Using this calibration, the first breakpoint
maximizes the likelihood ratio test (LRT) statistic.
Value
A list with elements:
- bkp
A vector of
k
candidate change-point positions- lambda
The estimated lambda values for each change-point
- mean
A vector of length
p
, the mean signal per column- value
A
i x p
matrix of change-point values for the first i change-points- c
\hat{c}
, an-1 x K
matrix
Note
This implementation is derived from the MATLAB code by Vert and Bleakley.
Author(s)
Morgane Pierre-Jean and Pierre Neuvial
References
Bleakley, K., & Vert, J. P. (2011). The group fused lasso for multiple change-point detection. arXiv preprint arXiv:1106.4199.
Vert, J. P., & Bleakley, K. (2010). Fast detection of multiple change-points shared by many signals using group LARS. Advances in Neural Information Processing Systems, 23, 2343-2351.
See Also
PSSeg
, jointSeg
,
doGFLars
, pruneByDP
Examples
p <- 2
trueK <- 10
sim <- randomProfile(1e4, trueK, 1, p)
Y <- sim$profile
K <- 2*trueK
res <- segmentByGFLars(Y, K)
print(res$bkp)
print(sim$bkp)
plotSeg(Y, res$bkp)
Recursive Binary Segmentation (low-level)
Description
Low-level function for multivariate Recursive Binary Segmentation (RBS)
Usage
segmentByRBS(Y, K, minRegionSize = 2, verbose = FALSE)
Arguments
Y |
A |
K |
The number of change points to find |
minRegionSize |
Regions with less than |
verbose |
A |
Details
This function recrusively looks for the best candidate change point
according to binary segmentation. This is the low-level function. It is
generally advised to use the wrapper doRBS
which also works on
data frames and has a convenient argument stat
.
See jointSeg
for combining recursive binary segmentation with
pruning by dynamic programming (pruneByDP
).
See PSSeg
for segmenting genomic signals from SNP arrays.
Each dimension of the original signal is scaled before segmentation, using
estimateSd
.
Value
A list with elements:
bkp |
A |
rse |
the residual squared error (RSE) for the successive segmentations |
gain |
The gain provided by each breakpoints in terms of difference between RSE |
Author(s)
Morgane Pierre-Jean and Pierre Neuvial
References
Gey, S., & Lebarbier, E. (2008). Using CART to Detect Multiple Change Points in the Mean for Large Sample. http://hal.archives-ouvertes.fr/hal-00327146/
See Also
PSSeg
, jointSeg
, doRBS
,
pruneByDP
Examples
p <- 2
trueK <- 10
len <- 1e4
sim <- randomProfile(len, trueK, 1, p)
Y <- sim$profile
K <- 2*trueK
res <- segmentByRBS(Y, K)
getTpFp(res$bkp, sim$bkp, tol=10, relax = -1) ## true and false positives
cols <- rep(2, K)
cols[1:trueK] <- 3
par(mfrow=c(p,1))
for (ii in 1:p) {
plot(Y[, ii], pch=19, cex=0.2)
abline(v=res$bkp[1:trueK], col= cols)
abline(v=sim$bkp, col=8, lty=2)
}
## NA:s in one dimension at a true breakpoint
jj <- sim$bkp[1]
Y[jj-seq(-10, 10), p] <- NA
res2 <- segmentByRBS(Y, K)
getTpFp(res2$bkp, sim$bkp, tol=10, relax = -1) ## true and false positives
## NA:s in both dimensions at a true breakpoint
Y[jj-seq(-10, 10), ] <- NA
res3 <- segmentByRBS(Y, K)
getTpFp(res3$bkp, sim$bkp, tol=10, relax = -1) ## true and false positives