whisper 0.4.0
- New
serve(): a single-process, OpenAI-compatible HTTP
STT server (POST /v1/audio/transcriptions and
/translations, GET /health) built on base R
sockets, with no new dependencies. It loads the model once and keeps it
resident, so it drops in for the OpenAI API or a Whisper container;
point stt.api at it with set_stt_base().
Returns text, json, or
verbose_json (segment timestamps, plus per-word timestamps
when the request includes timestamp_granularities[]=word).
An example systemd unit ships in
system.file("whisper.service", package = "whisper").
- JIT decoding on CUDA: each generated token’s decoder forward runs as
one
jit_compile’d TorchScript call instead of dozens of
dispatched R->torch calls, several times faster end-to-end and
token-for-token equivalent to the eager path. Covers both greedy and
word-timestamp decoding. On by default via the new jit
argument to transcribe()/whisper_pipeline();
pass jit = FALSE for the eager decoder. No effect on CPU or
beam search.
- Silence handling now matches the reference Whisper, fixing
transcripts that ran past the end of short audio. Three changes, ported
from
openai-whisper: decoding suppresses non-speech tokens
(brackets, music notes, speaker tags) and control tokens at every step,
so output no longer contains
[BLANK_AUDIO]/[MUSIC PLAYING]-style
annotations; the seek loop decodes only the real audio
(content_frames), not the fixed 30s of mel padding, so a 7s
clip no longer trails off into hallucinated text up to 30s; and a
no-speech-probability gate skips windows that read as silence. The
special-token table gains sot_lm and
sot_prev.
- Bound and mitigate degenerate repetition loops, matching the
reference. A long non-speech sound (e.g. a laugh) could make the decoder
emit one token (“ha”) hundreds of times - garbage output, and enough
accumulated cross-attention to exhaust memory on a small GPU. Decoding
is now capped at half the text context (the reference’s
sample_len) rather than the full context, and the default
temperatures enable the existing compression-ratio
fallback, which re-decodes too-repetitive output at a higher
temperature.
- Fix
tokenizer_encode() crashing for models whose
vocab.json omits the <|endoftext|> key
(large-v3): the end-of-text id now comes from the special-token table
(as in the Python reference, which keeps special tokens out of the BPE
vocab), and the lookup can no longer return a list. A regression test
covers a vocab without the key, and encode_special()
resolves the core special tokens from the table too.
whisper_dtype() now falls back to float32 on the GTX
16-series (TU116/TU117: GTX 1630/1650/1660 and Ti/Super variants), which
compute fp16 incorrectly and return NaN (seen as repeated “!” tokens).
Detection is by GPU name, CUDA-gated and tryCatch-guarded (dormant on
non-CUDA/CRAN machines); pass dtype = "float16" to
override.
- New
whisper_tune_gc(): opt-in helper that tunes torch’s
CUDA allocator GC rates for inference. No-op off CUDA, and only sets
options that are unset.
- Scaled dot-product attention now calls the exported
torch::torch_scaled_dot_product_attention() instead of
reaching into torch’s namespace; the torch dependency is floored at
0.17.0, where it is exported.
- README performance table refreshed for the JIT word-timestamp
path.
whisper 0.3.0
- Language auto-detection:
transcribe() now defaults to
language = NULL, which detects the spoken language from the
audio before decoding. New exported function
detect_language() for standalone language identification.
Breaking: previous default was
language = "en". Code relying on the default now
auto-detects instead of assuming English. Pass
language = "en" explicitly to restore old behavior.
- Segment-level and word-level timestamps via DTW alignment
- Beam search decoding with temperature sampling and fallback
- SDPA attention (FlashAttention on GPU)
whisper_pipeline() for cached model reuse across
multiple transcriptions
- Hardcoded special token table (eliminates
added_tokens.json download)
- Fixed invalid multibyte string crash in BPE decoder
- Fixed DTW boundary guards and seek loop in
transcribe_chunk()
whisper 0.1.0
- Initial CRAN submission
- Native R torch implementation of OpenAI Whisper
- Support for all model sizes: tiny, base, small, medium,
large-v3
- Automatic model download from HuggingFace
- Model-specific special token handling for large-v3
compatibility
- KV caching for efficient autoregressive decoding
- Long audio chunking for files longer than 30 seconds
- Optional timestamp and segment extraction