rsynthbio
is an R package that provides a convenient
interface to the Synthesize
Bio API, allowing users to generate realistic gene expression data
based on specified biological conditions. This package enables
researchers to easily access AI-generated transcriptomic data for
various modalities including bulk RNA-seq and single-cell RNA-seq.
Alternatively, you can AI generate datasets from our web platform.
You can install rsynthbio
from CRAN:
If you want the development version, you can install using the
remotes
package to install from GitHub:
if (!("remotes" %in% installed.packages())) {
install.packages("remotes")
}
remotes::install_github("synthesizebio/rsynthbio")
Once installed, load the package:
Before using the Synthesize Bio API, you need to set up your API token. The package provides a secure way to handle authentication:
# Securely prompt for and store your API token
# The token will not be visible in the console
set_synthesize_token()
# You can also store the token in your system keyring for persistence
# across R sessions (requires the 'keyring' package)
set_synthesize_token(use_keyring = TRUE)
Loading your API key for a session.
# In future sessions, load the stored token
load_synthesize_token_from_keyring()
# Check if a token is already set
has_synthesize_token()
You can obtain an API token by registering at Synthesize Bio.
For security reasons, remember to clear your token when you’re done:
# Clear token from current session
clear_synthesize_token()
# Clear token from both session and keyring
clear_synthesize_token(remove_from_keyring = TRUE)
Never hard-code your token in scripts that will be shared or committed to version control.
The package supports multiple data modalities:
Currently supported modalities:
bulk
: Bulk RNA-seq datasingle-cell
: Single-cell RNA-seq
dataThe first step to generating AI-generated gene expression data is to create a query. The package provides sample queries for each modality:
# Get a sample query for bulk RNA-seq
query <- get_valid_query(modality = "bulk")
# Get a sample query for single-cell RNA-seq
query_sc <- get_valid_query(modality = "single-cell")
# Inspect the query structure
str(query)
The query consists of:
modality
: The type of gene expression data to generate
(“bulk” or “single-cell”)mode
: The prediction mode (e.g., “mean estimation” or
“sample generation”)inputs
: A list of biological conditions to generate
data forWe train our models with diverse multi-omics datasets. There are two model modes available today:
Once your query is ready, you can send it to the API to generate gene expression data:
This result will be a list of two dataframes: metadata
and expression
Behind the scenes, the API uses an asynchronous model to handle queries efficiently:
All of this happens automatically when you call
predict_query()
.
You can customize the polling behavior if needed:
You can customize the query to fit your specific research needs:
# Adjust number of samples
query$inputs[[1]]$num_samples <- 10
# Add a new condition
query$inputs[[3]] <- list(
metadata = list(
sex = "male",
sample_type = "primary tissue"
),
num_samples = 3
)
The input metadata is a list of lists.
Here are the available metadata fields:
Biological:
age_years
cell_line_ontology_id
cell_type_ontology_id
developmental_stage
disease_ontology_id
ethnicity
genotype
race
sample_type
(“cell line”, “organoid”, “other”, “primary
cells”, “primary tissue”, “xenograft”)sex
(“male”, “female”)tissue_ontology_id
Perturbational:
perturbation_dose
perturbation_ontology_id
perturbation_time
perturbation_type
(“coculture”,“compound”,“control”,“crispr”,“genetic”,“infection”,“other”,“overexpression”,“peptide
or biologic”,“shrna”,“sirna”)Technical:
study
(Bioproject ID)library_selection
(e.g., “cDNA”, “polyA”, “Oligo-dT” -
see https://ena-docs.readthedocs.io/en/latest/submit/reads/webin-cli.html#permitted-values-for-library-selection)library_layout
(“PAIRED”, “SINGLE”)platform
(“illumina”)The following are the valid values or expected formats for selected metadata keys:
Metadata Field | Requirement / Example |
---|---|
cell_line_ontology_id |
Requires a Cellosaurus ID. |
cell_type_ontology_id |
Requires a CL ID. |
disease_ontology_id |
Requires a MONDO ID. |
perturbation_ontology_id |
Must be a valid Ensembl gene ID (e.g.,
ENSG00000156127 ), ChEBI ID (e.g.,
CHEBI:16681 ), ChEMBL ID (e.g.,
CHEMBL1234567 ), or NCBI Taxonomy ID (e.g.,
9606 ). |
tissue_ontology_id |
Requires a UBERON ID. |
To lookup ontology terms, we recommend using the EMBL-EBI Ontology Lookup Service.
Models have a limited acceptable range of metadata input values. If you provide a value that is not in the acceptable range, the API will return an error.
You can also request log-transformed CPM instead of raw counts: