Getting Started

rsynthbio is an R package that provides a convenient interface to the Synthesize Bio API, allowing users to generate realistic gene expression data based on specified biological conditions. This package enables researchers to easily access AI-generated transcriptomic data for various modalities including bulk RNA-seq and single-cell RNA-seq.

Alternatively, you can AI generate datasets from our web platform.

How to install

You can install rsynthbio from CRAN:

install.packages("rsynthbio")

If you want the development version, you can install using the remotes package to install from GitHub:

if (!("remotes" %in% installed.packages())) {
  install.packages("remotes")
}
remotes::install_github("synthesizebio/rsynthbio")

Once installed, load the package:

library(rsynthbio)

Authentication

Before using the Synthesize Bio API, you need to set up your API token. The package provides a secure way to handle authentication:

# Securely prompt for and store your API token
# The token will not be visible in the console
set_synthesize_token()

# You can also store the token in your system keyring for persistence
# across R sessions (requires the 'keyring' package)
set_synthesize_token(use_keyring = TRUE)

Loading your API key for a session.

# In future sessions, load the stored token
load_synthesize_token_from_keyring()

# Check if a token is already set
has_synthesize_token()

You can obtain an API token by registering at Synthesize Bio.

Security Best Practices

For security reasons, remember to clear your token when you’re done:

# Clear token from current session
clear_synthesize_token()

# Clear token from both session and keyring
clear_synthesize_token(remove_from_keyring = TRUE)

Never hard-code your token in scripts that will be shared or committed to version control.

Basic Usage

Available Modalities

The package supports multiple data modalities:

# Check available modalities
get_valid_modalities()

Currently supported modalities:

Creating a Query

The first step to generating AI-generated gene expression data is to create a query. The package provides sample queries for each modality:

# Get a sample query for bulk RNA-seq
query <- get_valid_query(modality = "bulk")

# Get a sample query for single-cell RNA-seq
query_sc <- get_valid_query(modality = "single-cell")

# Inspect the query structure
str(query)

The query consists of:

  1. modality: The type of gene expression data to generate (“bulk” or “single-cell”)
  2. mode: The prediction mode (e.g., “mean estimation” or “sample generation”)
  3. inputs: A list of biological conditions to generate data for

We train our models with diverse multi-omics datasets. There are two model modes available today:

Making a Prediction

Once your query is ready, you can send it to the API to generate gene expression data:

result <- predict_query(query, as_counts = TRUE)

This result will be a list of two dataframes: metadata and expression

Understanding the Async API

Behind the scenes, the API uses an asynchronous model to handle queries efficiently:

  1. Your query is submitted to the API, which returns a query ID
  2. The function automatically polls the status endpoint (default: every 2 seconds)
  3. When the query completes, results are downloaded from a signed URL
  4. Data is parsed and returned as R data frames

All of this happens automatically when you call predict_query().

Controlling Async Behavior

You can customize the polling behavior if needed:

# Increase timeout for large queries (default: 900 seconds = 15 minutes)
result <- predict_query(
  query,
  poll_timeout_seconds = 1800, # 30 minutes
  poll_interval_seconds = 5 # Check every 5 seconds instead of 2
)

Modifying a Query

You can customize the query to fit your specific research needs:

# Adjust number of samples
query$inputs[[1]]$num_samples <- 10

# Add a new condition
query$inputs[[3]] <- list(
  metadata = list(
    sex = "male",
    sample_type = "primary tissue"
  ),
  num_samples = 3
)

The input metadata is a list of lists.

Here are the available metadata fields:

Biological:

Perturbational:

Technical:

Acceptable Metadata Values

The following are the valid values or expected formats for selected metadata keys:

Metadata Field Requirement / Example
cell_line_ontology_id Requires a Cellosaurus ID.
cell_type_ontology_id Requires a CL ID.
disease_ontology_id Requires a MONDO ID.
perturbation_ontology_id Must be a valid Ensembl gene ID (e.g., ENSG00000156127), ChEBI ID (e.g., CHEBI:16681), ChEMBL ID (e.g., CHEMBL1234567), or NCBI Taxonomy ID (e.g., 9606).
tissue_ontology_id Requires a UBERON ID.

To lookup ontology terms, we recommend using the EMBL-EBI Ontology Lookup Service.

Models have a limited acceptable range of metadata input values. If you provide a value that is not in the acceptable range, the API will return an error.

Additional Prediction Options

You can also request log-transformed CPM instead of raw counts:

# Request log-transformed CPM instead of raw counts
result_log <- predict_query(query, as_counts = FALSE)

Working with Results

# Access metadata and expression matrices
metadata <- result$metadata
expression <- result$expression

# Check dimensions
dim(expression)

# View metadata sample
head(metadata)

You may want to process the data in chunks or save it for later use:

# Save results to RDS file
saveRDS(result, "synthesize_results.rds")

# Load previously saved results
result <- readRDS("synthesize_results.rds")

# Export as CSV
write.csv(result$expression, "expression_matrix.csv")
write.csv(result$metadata, "sample_metadata.csv")

Custom Validation

You can validate your queries before sending them to the API:

# Validate structure
validate_query(query)

# Validate modality
validate_modality(query)

Session info

sessionInfo()

Additional Resources