---
title: "Genome-wide Melting Temperature Profiling: An E. coli Case Study"
author: "Junhui Li, Lihua Julie Zhu"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Genome-wide Melting Temperature Profiling: An E. coli Case Study}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  echo      = TRUE,
  message   = FALSE,
  warning   = FALSE,
  fig.align = "center",
  fig.width = 7,
  fig.height = 6
)
```

# Introduction

This vignette demonstrates how to compute melting temperature (Tm) across an
entire bacterial genome and visualise the results using **TmCalculator**. We
use *Escherichia coli* K-12 MG1655 (NCBI assembly GCF\_000005845.2 / ASM584v2)
as the reference organism, calculating Tm for every non-overlapping 200 bp
window along the single circular chromosome.

The workflow covers five steps:

1. **Build a BSgenome data package** from the NCBI assembly accession.
2. **Generate non-overlapping genomic windows** with `make_genomiccoord()`.
3. **Compute Tm and GC content** for all windows with `tm_calculate()`.
4. **Visualise** the genome-wide Tm profile with `plot_genome_track()` —
   circular plots, linear plots, zoom views, and multi-omics overlays.
5. **Statistical testing** — compare Tm inside MutL-AR peaks versus the rest
   of the genome with `compare_groups()`.

---

# Prerequisites

## R packages

```{r packages}
library(TmCalculator)
library(BSgenome)
library(GenomicRanges)
```

## Step 1 — Building the *E. coli* BSgenome data package {#build-bsgenome}

TmCalculator retrieves sequences from BSgenome data packages. Because an
*E. coli* BSgenome package is not on Bioconductor, we forge one locally from
the NCBI RefSeq assembly using **BSgenomeForge**. This step only needs to be
run once.

```{r forge, eval=FALSE}
library(BSgenomeForge)

forgeBSgenomeDataPkgFromNCBI(
  assembly_accession = "GCF_000005845.2",
  pkg_maintainer     = "Junhui Li <ljh.biostat@gmail.com>",
  destdir            = "."
)

install.packages(
  "./BSgenome.Ecoli.NCBI.ASM584v2",
  repos = NULL,
  type  = "source"
)
```

The GitHub repo ships package metadata only; sequence data must be forged once.
The chunk below installs the package (if needed), forges it from NCBI when
the `Ecoli` genome object is still missing, then loads it. **Knit from the top**
so this chunk runs before `load-genome`.

```{r setup-ecoli-bsgenome, echo=FALSE, message=FALSE, warning=FALSE}
ecoli_pkg   <- "BSgenome.Ecoli.NCBI.ASM584v2"
genome_obj  <- "Ecoli"   # BSgenomeObjname in DESCRIPTION; not the package name

.ecoli_genome_ready <- function() {
  if (!requireNamespace(ecoli_pkg, quietly = TRUE)) return(FALSE)
  exists(genome_obj, envir = asNamespace(ecoli_pkg), inherits = FALSE)
}

if (!requireNamespace(ecoli_pkg, quietly = TRUE)) {
  if (!requireNamespace("remotes", quietly = TRUE)) {
    utils::install.packages("remotes", repos = "https://cloud.r-project.org")
  }
  remotes::install_github(
    "JunhuiLi1017/BSgenome.Ecoli.NCBI.ASM584v2",
    upgrade = "never",
    quiet = TRUE
  )
}

if (!.ecoli_genome_ready()) {
  if (!requireNamespace("BiocManager", quietly = TRUE)) {
    utils::install.packages("BiocManager", repos = "https://cloud.r-project.org")
  }
  if (!requireNamespace("BSgenomeForge", quietly = TRUE)) {
    BiocManager::install("BSgenomeForge", ask = FALSE, update = FALSE)
  }
  pkgdir <- BSgenomeForge::forgeBSgenomeDataPkgFromNCBI(
    assembly_accession = "GCF_000005845.2",
    pkg_maintainer     = "Junhui Li <ljh.biostat@gmail.com>",
    destdir            = tempdir()
  )
  utils::install.packages(pkgdir, repos = NULL, type = "source", quiet = TRUE)
  if (ecoli_pkg %in% loadedNamespaces()) {
    unloadNamespace(ecoli_pkg)
  }
}

if (!.ecoli_genome_ready()) {
  stop(
    "Could not load genome object '", genome_obj, "' from package '", ecoli_pkg, "'.\n",
    "Run the manual 'forge' chunk above, or forgeBSgenomeDataPkgFromNCBI() locally.",
    call. = FALSE
  )
}
```

Load the genome (object `Ecoli`; package `BSgenome.Ecoli.NCBI.ASM584v2` for coordinates):

```{r load-genome}
ecoli_pkg  <- "BSgenome.Ecoli.NCBI.ASM584v2"
genome_obj <- "Ecoli"

suppressPackageStartupMessages(library(ecoli_pkg, character.only = TRUE))
genome      <- base::get(genome_obj, envir = asNamespace(ecoli_pkg))
genome_name <- ecoli_pkg
chr_name    <- "U00096.3"
chr_length  <- length(genome[[chr_name]])

cat("Chromosome:", chr_name, "\n")
cat("Length:    ", format(chr_length, big.mark = ","), "bp\n")
```

---

# Step 2 — Generate Genomic Windows

`make_genomiccoord()` tiles the chromosome into non-overlapping 200 bp bins.

```{r make-coord}
bins_gc <- make_genomiccoord(
  bsgenome    = genome_name,
  chromosomes = chr_name,
  window      = 200L,
  slide       = 200L,
  start       = 1,
  end         = chr_length,
  strand      = "+"
)

cat("Total windows:", length(bins_gc), "\n")
```

Resolve coordinates against the BSgenome package:

```{r coor-to-gr}
input_new <- list(pkg_name = genome_name, seq = bins_gc)
runtime1 <- system.time({
  gr_batch <- to_genomic_ranges_fast(input_new)
})

cat(sprintf(
  "Coordinate resolution: %.2f s (elapsed)\n",
  runtime1["elapsed"]
))
```

---

# Step 3 — Compute Genome-wide Tm

We use the nearest-neighbour method (SantaLucia & Hicks 2004) at 50 mM Na+.

```{r tm-calculate}
runtime2 <- system.time({
  tm_ASM584v2 <- tm_calculate(
    gr_batch,
    method   = "tm_nn",
    nn_table = "DNA_NN_SantaLucia_2004",
    Na       = 50            # mM; standard PCR-like conditions
  )
})

cat(sprintf(
  "Tm calculation: %.2f s (elapsed) for %s windows\n",
  runtime2["elapsed"],
  format(length(bins_gc), big.mark = ",")
))

Tm <- as.data.frame(tm_ASM584v2$gr[, c("Tm", "GC")])
summary(Tm[, c("Tm", "GC")])
```

---

# Step 4 — Visualise with `plot_genome_track()`

`plot_genome_track()` is the unified plotting function in TmCalculator. It
supports both **linear** (karyoploteR-based) and **circular** (base R graphics)
layouts from the same track list, with features including ideogram tracks,
per-track highlights, proportional track heights, multi-region zoom, and
customisable legends.

## Assemble the track list

We overlay the Tm/GC profile with four independently measured genomic datasets:

| Track | Source |
|-------|--------|
| MutL-AR | MutL protein ChIP-seq peaks (mismatch-repair activity) |
| Microsatellites | Tandem-repeat density per 200 bp bin |
| Cruciform | Cruciform-forming sequence density per 200 bp bin |
| ssDNA | Single-stranded DNA regions |
| GATC sites | GATC methylation-site density per 200 bp bin |

```{r track-list}
# Reference labels: replication origin (ori) and terminus (dif)
label <- data.frame(
  seqnames = genome_name,
  start    = c(3925804, 1590777),
  end      = c(3925804, 1590777),
  label    = c("ori", "dif")
)
data(ecoli_rep_hotspots)
tracks <- list(
  # Ideogram: MutL-AR peaks shown inside the chromosome bar
  list(type = "rect", data = ecoli_rep_hotspots$all_peaks_IP_mutH,
       col = "#2C3E50", bg.col = "grey", name = "MutL-AR",
       legend_font_col = "#2C3E50", ideogram = TRUE, height = 0.5),

  # Sequence thermodynamics
  list(type = "line", data = Tm, value_col = "GC",
       name = "GC content", col = "#4A90E2",
       legend_font_col = "#4A90E2"),
  list(type = "line", data = Tm, value_col = "Tm",
       name = "Melting temp", col = "#E06666",
       legend_font_col = "#E06666", height = 2),

  # Repeat / structural features
  list(type = "line", data = ecoli_rep_hotspots$bins_rep,
       value_col = "count", name = "Microsatellites", col = "#2ECC71",
       legend_font_col = "#2ECC71"),
  list(type = "line", data = ecoli_rep_hotspots$bins_cru,
       value_col = "count", name = "Cruciform", col = "#3B3E6B",
       legend_font_col = "#3B3E6B"),

  # ssDNA regions
  list(data = ecoli_rep_hotspots$ssdna, name = "ssDNA",
       col = "#8E44AD", legend_font_col = "#8E44AD"),

  # GATC methylation sites
  list(type = "line", data = ecoli_rep_hotspots$bins_gatc,
       value_col = "count", name = "GATC sites", col = "#D35400",
       legend_font_col = "#D35400"),

  # Global highlight: translucent bands at MutL-AR peaks across all tracks
  list(type = "highlight", data = ecoli_rep_hotspots$all_peaks_IP_mutH,
       col = "#F1C40F", alpha = 0.18)
)
```

## Circular genome map

```{r circular-full, fig.width=8, fig.height=8, fig.cap="Circular genome map of E. coli K-12 MG1655. Concentric rings from outside in: MutL-AR peaks (grey ideogram), GC content, melting temperature, microsatellite density, cruciform sequences, ssDNA regions, and GATC site density. Yellow highlight bands mark MutL-AR peak regions."}
plot_genome_track(
  genome_name = genome_name,
  genome_size = chr_length,
  track_list  = tracks,
  circular    = TRUE,
  label       = label
)
```

## Linear genome map

The same `tracks` list works for a linear karyoploteR layout — simply omit
`circular = TRUE`. The ideogram track is drawn inside the chromosome bar;
all other tracks are stacked as horizontal panels.

```{r linear-full, fig.width=10, fig.height=5, fig.cap="Linear genome view of E. coli K-12 MG1655. MutL-AR peaks are drawn inside the chromosome ideogram bar."}
plot_genome_track(
  genome_name = genome_name,
  genome_size = chr_length,
  track_list  = tracks
)
```

## Zoom — single region

The `zoom` parameter accepts a character string (e.g. `"chr:start-end"`) or a
GRanges object. In linear mode, the karyoploteR view is restricted to that
region; in circular mode, only data overlapping the region is drawn.

```{r zoom-single, fig.width=10, fig.height=5, fig.cap="Zoomed linear view of the 1.0-2.0 Mb region."}
plot_genome_track(
  genome_name = genome_name,
  genome_size = chr_length,
  track_list  = tracks,
  zoom        = "U00096.3:1000000-2000000"
)
```

## Zoom — multiple regions

Pass a character vector to `zoom` to view several disjoint regions at once.
In linear mode each region is drawn as a separate stacked panel; in circular
mode the regions are concatenated around the circle with small gaps between
them.

```{r zoom-multi-linear, fig.width=10, fig.height=8, fig.cap="Two zoomed regions displayed as stacked linear panels."}
plot_genome_track(
  genome_name = genome_name,
  genome_size = chr_length,
  track_list  = tracks,
  zoom        = c("U00096.3:1000000-1200000",
                   "U00096.3:3000000-3200000")
)
```

```{r zoom-multi-circular, fig.width=8, fig.height=8, fig.cap="Two zoomed regions concatenated on a circular plot. Dashed lines separate the regions."}
plot_genome_track(
  genome_name = genome_name,
  genome_size = chr_length,
  track_list  = tracks,
  circular    = TRUE,
  zoom        = c("U00096.3:1000000-1200000",
                   "U00096.3:3000000-3200000")
)
```

## Circular canvas panning

Use `canvas.xlim` and `canvas.ylim` to pan and magnify a portion of the
circular plot. The `circle.margin` parameter controls whitespace around the
plot.

```{r circular-pan, fig.width=7, fig.height=7, fig.cap="Panned circular view showing the upper-right quadrant of the E. coli chromosome."}
plot_genome_track(
  genome_name   = genome_name,
  genome_size   = chr_length,
  track_list    = tracks,
  circular      = TRUE,
  canvas.xlim   = c(0.5, 1),
  canvas.ylim   = c(0,   1),
  circle.margin = c(0.05, 0.05)
)
```

## Per-track highlights

Individual tracks can carry a `highlight` field to draw coloured bands within
that track only. This is useful for marking specific regions of interest on a
per-track basis:

```{r per-track-highlight, fig.width=8, fig.height=8, fig.cap="Circular plot with per-track highlight bands on the Microsatellites track."}
tracks_hl <- tracks
tracks_hl[[4]]$highlight <- list(
  data  = ecoli_rep_hotspots$bins_rep[1100:1200, ],
  col   = "black",
  alpha = 0.12
)

plot_genome_track(
  genome_name = genome_name,
  genome_size = chr_length,
  track_list  = tracks_hl,
  circular    = TRUE
)
```

---

# Step 5 — Statistical Testing

We test whether Tm values inside MutL-AR peaks differ significantly from the
rest of the genome.

### Build MutL-AR peak GRanges

```{r build-mutH-peaks}
mutH_peaks <- GRanges(
  seqnames = ecoli_rep_hotspots$all_peaks_IP_mutH$chr,
  ranges   = IRanges(start = ecoli_rep_hotspots$all_peaks_IP_mutH$start,
                     end   = ecoli_rep_hotspots$all_peaks_IP_mutH$end)
)
seqlevels(mutH_peaks) <- "U00096.3"
```

### Annotate Tm tiles with peak membership

```{r annotate-mutH-peaks}
mutH_peaks$peak_id <- paste0("mutH_", seq_along(mutH_peaks))

tm_annot <- integrate_granges(
  gr_tm          = tm_ASM584v2$gr,
  gr_features    = mutH_peaks,
  strategy       = "overlap",
  feature_cols   = "peak_id",
  keep_unmatched = TRUE
)

tm_annot$in_mutH <- ifelse(is.na(tm_annot$peak_id), "non_peak", "peak")
table(tm_annot$in_mutH)
```

### Wilcoxon test on Tm and GC

```{r wilcoxon}
res <- compare_groups(
  gr          = tm_annot,
  target      = c("Tm", "GC"),
  method      = "wilcoxon",
  group       = "in_mutH",
  alternative = "greater",
  posthoc     = FALSE
)
res$results
res$summary
```

---

# Interpreting the Results

MutL-AR-associated windows have lower Tm and higher GC content than the rest
of the genome (p < 0.001 for both). MutL-AR peaks cluster near the
replication terminus, where replication forks converge and mismatch density is
highest. This spatial coincidence with locally elevated microsatellite density
and cruciform-forming sequences is consistent with the replication stress
model of MMR recruitment.

GATC methylation sites are distributed genome-wide but show local density
fluctuations that partially anti-correlate with GC content, reflecting the
sequence context requirements for Dam methyltransferase (5'-GATC-3').

---

# Computational Performance

On a standard desktop computer (single core, R 4.3):

| Step | Function | Time (elapsed) |
|------|----------|---------------|
| Coordinate generation | `make_genomiccoord()` + `coor_to_genomic_ranges()` | ~`r sprintf("%.2f", runtime1["elapsed"])` s |
| Tm calculation (23,208 windows) | `tm_calculate()` | ~`r sprintf("%.2f", runtime2["elapsed"])` s |
| **Total** | | **~`r sprintf("%.2f", runtime1["elapsed"] + runtime2["elapsed"])` s** |

For human-scale analyses (non-overlapping 200 bp windows across the ~3.1 Gb
haploid genome, ~15.5 M windows), we recommend running `tm_calculate()` in
parallelly using `BiocParallel` and splitting input by chromosome. Memory
requirements scale approximately linearly with the number of windows.

---

# Session Information

```{r session-info}
sessionInfo()
```