Adding new modules

All modules have the following basic file structure, where all the files are necessary except for script.R that may be replaced by some alternative script. However, to get something working, you only have to consider altering script.R (or some alternative and in that case also rule.smk) and the modules JSON snippets, as shown in the examples below.

module_name
├── rule.smk
├── script.R
├── schema.json
├── info.json
├── bibtex.bib
└── docs.rst

rule.smk contains a Snakemake rule with the required fields on the proper form. The template modules run by default in a container based on the Docker image “bpimages/sandbox:1.0”. To force local execution this should be changed to None, in which case you need to make sure that the used dependencies are installed locally. On deployment (pushing to Benchpress repository) however, this field should be a Docker Hub URI on the form docker://username/image:version (see e.g. this tutorial for building and pushing images to Docker Hub).
script.R contains an R-script that is called by the rule. The variables available in the script are generated both from the Snakemake rule and the JSON object for the module file (from the wildcards dict). See the Snakemake documentation for further details on how to access variables in scripts. Note that the values are passed as strings and might have to be converted to suit your specific script.
schema.json is a JSON schema for the module. On deployment, you should alter this to restrict the fields for the JSON file.
info.json is a JSON file to be parsed when Example.
bibtex.bib is a file with references in BibTeX format that will be accessible in docs.rst (using sphinxcontrib-bibtex syntax) and shown in the documentation.
docs.rst is a documentation file in reStructuredText (RST) format that will be included when Example. This file should contain an overview of the module and further explanation of the JSON fields if needed.

The sections below show how to get started developing your own modules. To get a quick start, you may run the following commands in the benchpress directory, to copy of each of the template modules at once. Then you can use the config file config/sandbox.json which is configured to use each of them.

cp -r resources/module_templates/new_graph workflow/rules/graph/new_graph
cp -r resources/module_templates/new_params workflow/rules/parameters/new_params
cp -r resources/module_templates/new_data workflow/rules/data/new_data
cp -r resources/module_templates/new_alg workflow/rules/structure_learning_algorithms/new_alg
cp -r resources/module_templates/new_mcmcalg workflow/rules/structure_learning_algorithms/new_mcmcalg
cp config/sandbox.json config/mysandbox.json

To test run

snakemake --cores all --use-singularity --configfile config/mysandbox.json

Graph

To create a new graph module, you may copy the template module new_graph as

cp -r resources/module_templates/new_graph workflow/rules/graph/new_graph

Listing 5 shows script.R, which generates a random undirected graph (symmetric binary matrix) and saves it properly in the CSV format specified in File formats.

Listing 5 script.R from new_graph.

# Samples the adjacency matrix for a random undirected graph.

# These are the parameters that are passed to the script from the config file.
p <- as.integer(snakemake@wildcards[["p"]])
cutoff <- as.numeric(snakemake@wildcards[["cutoff"]])

set.seed(as.integer(snakemake@wildcards[["seed"]]))

adjmat <- matrix(runif(p * p), nrow = p, ncol = p) > cutoff
adjmat <- 1 * (adjmat | t(adjmat)) # Make it symmetric (undirected)
diag(adjmat) <- 0 # No self loops
colnames(adjmat) <- as.character(seq(p))

write.table(adjmat,
  file = snakemake@output[["adjmat"]], row.names = FALSE,
  quote = FALSE, col.names = TRUE, sep = ","
)

In order to use the module, you need to add the following piece of JSON to the graph subsection of the resources section in the config file.

"new_graph": [
    {
        "id": "testmat",
        "p": 5,
        "cutoff": 0.8
    }
]

Parameters

To create a new parameters module, you may copy the template module new_params as

cp -r resources/module_templates/new_params workflow/rules/parameters/new_params

Listing 6 shows script.R, which samples a covariance matrix for a multivariate Gaussian distribution from the G-Inverse Wishart distribution and saves it. This template module uses the BDgraph to sample the matrix, so this needs to be installed on your system in to be tested. The format of the saved file depends on the type of parameters used, in this case, since we sample a matrix it can be stored as a CSV file.

Listing 6 script.R from new_params.

library(BDgraph)

set.seed(as.integer(snakemake@wildcards[["seed"]]))

# Read the adjacency matrix
df_adjmat <- read.csv(snakemake@input[["adjmat"]], header = TRUE, check.names = FALSE)
adjmat <- as.matrix(df_adjmat)
p <- dim(adjmat)[2]

precmat <- rgwish(n = 1,
                  adj = adjmat,
                  b = as.integer(snakemake@wildcards[["b"]]),
                  D = diag(p),
                  threshold = as.numeric(snakemake@wildcards[["thresh"]]))
covmat <- solve(precmat)

colnames(covmat) <- colnames(df)
write.table(covmat,
            file = snakemake@output[["params"]],
            row.names = FALSE,
            quote = FALSE, col.names = TRUE, sep = ",")

To use the module, you need to add the following piece of JSON to the parameters section of the JSON file.

"new_params": [
    {
        "id": "gwish",
        "thresh": 1e-8,
        "b": 3
    }
]

Data

The best way to get started is to copy the template module new_data as

cp -r resources/module_templates/new_data workflow/rules/data/new_data

Listing 7 shows script.R, which generates i.i.d multivariate Gaussian data and saves it properly in the CSV format specified in File formats.

Listing 7 script.R from new_data.

library(mvtnorm)
# Samples multivariate normal data given a covariance matrix.

# Read the covariance matrix from the params file.
df_params <- read.csv(snakemake@input[["params"]], 
                      header = TRUE, 
                      check.names = FALSE)
covmat <- as.matrix(df_params)
n <- as.integer(snakemake@wildcards[["n"]])

set.seed(as.integer(snakemake@wildcards[["seed"]]))
data <- rmvnorm(n, mean = rep(0, nrow(covmat)), sigma = covmat)

# Set the proper header and write the data to file. 
colnames(data) <- colnames(df_params)
write.table(data,
            file = snakemake@output[["data"]],
            row.names = FALSE,
            quote = FALSE, col.names = TRUE, sep = ","
            )

In order to use the module, you need to add the following piece of JSON to the data subsection of the resources section in the config file.

"new_data": [
    {
        "id": "testdata",
        "n": 100,
        "standardized": false
    }
]

Algorithm

In order to create a new algorithm module, you may copy the template module new_alg as

cp -r resources/module_templates/new_alg workflow/rules/structure_learning_algorithms/new_alg

Below are two examples of how to implement an algorithm module, one in R and one in Python. Which one to use is set in rule.smk

R

Listing 8 shows script.R , which generates a random binary symmetric matrix (undirected data). Note that the actual algorithm is wrapped into the function myalg which is passed to the function add_timeout. This is to enable the timeout functionality, which writes an empty file if the algorithm has finished before timeout seconds, specified in the config file. However, add_timeout is not needed if your algorithm is able to produce results after a specified amount of time.

Listing 8 script.R from new_alg.

# If you have a Docker image with files to source you can source them here.
# Be careful to use the absolute path in the Docker image.
# R.utils is needed for the timeout so make sure this is installed.
source("workflow/scripts/utils/helpers.R")
# source("/path/in/dockerimage/filetosource.R")

myalg <- function() {
    # The algorithm should be in this function.

    cutoff <- as.numeric(snakemake@wildcards[["cutoff"]])
    data <- read.csv(snakemake@input[["data"]], check.names = FALSE)
    start <- proc.time()[1]

    # This is a fast and naiive algorithm.
    p <- ncol(data)
    Sys.sleep(3)
    set.seed(as.integer(snakemake@wildcards[["seed"]]))
    adjmat <- matrix(runif(p * p), nrow = p, ncol = p) > cutoff
    adjmat <- 1 * (adjmat | t(adjmat))
    diag(adjmat) <- 0
    totaltime <- proc.time()[1] - start
    colnames(adjmat) <- names(data) # Get the labels from the data
    write.csv(adjmat, file = snakemake@output[["adjmat"]], row.names = FALSE, quote = FALSE)
    write(totaltime, file = snakemake@output[["time"]])
    # Write the true number of c.i. tests here if possible.
    cat("None", file = snakemake@output[["ntests"]], sep = "\n") 
}

add_timeout(myalg)

Python

Listing 9 shows script.py, which is the Python analogue of script.R shown in Listing 8. As in the R-script, to enable the timeout functionality, the actual algorithm is wrapped into the function myalg which is then used from Line 39 and below.

Listing 9 script.py from new_alg.

sys.path.append("workflow/scripts/utils")
# If you have a local folder for your program/library, you can add it to the path here.
# sys.path.append("/path/to/my/local/python_lib")

import time
import numpy as np
import pandas as pd
from add_timeout import *

def myalg():
    # Read in data (not used in this algorithm)
    df = pd.read_csv(snakemake.input["data"])

    # The algorithm goes here
    p = df.shape[1]
    np.random.seed(int(snakemake.wildcards["seed"]))
    
    m = np.random.rand(p*p).reshape((p, p))
    m = m > float(snakemake.wildcards["cutoff"])
    adjmat = (m | m.T) * 1
    np.fill_diagonal(adjmat, 0)
    adjmat.astype(int)

    # Save time
    tottime = time.perf_counter() - start
    with open(snakemake.output["time"], "w") as text_file:
        text_file.write(str(tottime))

    # Save adjmat
    adjmat_df = pd.DataFrame(adjmat)
    adjmat_df.columns = df.columns
    adjmat_df.to_csv(snakemake.output["adjmat"], index=False)

    # ntests is not applicable for this algorithm
    with open(snakemake.output["ntests"], "w") as text_file:
        text_file.write("None")


# This part starts the timer 
start = time.perf_counter()

if snakemake.wildcards["timeout"] == "None":
    myalg()
else:
    with timeoutf(int(snakemake.wildcards["timeout"]),
                  snakemake.output["adjmat"],
                  snakemake.output["time"],
                  snakemake.output["ntests"],
                  start):
        myalg()

In order to use the module, you need to add the following piece of JSON to the list of structure learning modules in the structure_learning_algorithms section of the JSON file, making the parameters cutoff and timeout accessible in the script.

"new_alg": [
    {
        "id": "testalg",
        "cutoff": 0.8,
        "timeout": null
    }
]

Algorithm (MCMC)

In order to create a new MCMC algorithm module, you may copy the template module new_mcmcalg as

cp -r resources/module_templates/new_mcmcalg workflow/rules/structure_learning_algorithms/new_mcmcalg

This template runs script.R (shown below) but you may change either the entire file or the content of it.

Listing 10 script.R from new_mcmcalg.

# If you have a Docker image with files to source you can source them here.
# Be careful to use the absolute path in the Docker image.
# R.utils is needed for the timeout. Make sure this is installed.
source("workflow/scripts/utils/helpers.R")
# source("/path/in/dockerimage/filetosource.R")

filename <- file.path(snakemake@output[["seqgraph"]])
filename_data <- snakemake@input[["data"]]
seed <- as.integer(snakemake@wildcards[["mcmc_seed"]])

myalg <- function() {
    data <- read.csv(filename_data, check.names = FALSE)
    p <- ncol(data)
    start <- proc.time()[1]

    # Generate a trajectory of adjacency matrices and scores.
    set.seed(seed)
    adjmats <- list()
    scores <- list()

    cutoff <- as.numeric(snakemake@wildcards[["cutoff"]])
    n_iterations <- as.integer(snakemake@wildcards[["n_iterations"]])
    # sample n_iterations adjacency matrices and scores.
    for (i in 1:n_iterations) {
        adjmat <- matrix(runif(p * p), nrow = p, ncol = p) > cutoff
        adjmat <- 1 * (adjmat | t(adjmat))
        colnames(adjmat) <- names(data)
        diag(adjmat) <- 0
        adjmats[[i]] <- adjmat
        scores[[i]] <- rnorm(1, -1000, 10)
    }
    totaltime <- proc.time()[1] - start

    # Convert the adjacency matrices and scores to the Benchpress format.
    df <- bidagtraj_to_bptraj(adjmats, scores, colnames(data), directed = FALSE)

    write.csv(x = df, file = file.path(snakemake@output[["seqgraph"]]),
              row.names = FALSE, quote = FALSE)

    write(totaltime, file = snakemake@output[["time"]])
}

add_timeout_mcmc(myalg)

MCMC algorithms have, apart from the algorithm parameters, four additional required fields, mcmc_seed, mcmc_estimator, threshold, and burnin_frac. mcmc_seed is the random seed for the algorithm. mcmc_estimator specifies which estimator to use (threshold or map), and threshold specifies the threshold for the estimator if mcmc_estimator is set to threshold. burnin_frac specifies the fraction of the samples to be discarded as burn-in. In order to use the module, you can add the following piece of JSON to the list of structure learning modules in the structure_learning_algorithms section of the JSON file.

"new_mcmcalg": [
    {
        "id": "mcmctest",
        "cutoff": 0.8,
        "burnin_frac": 0.5,
        "mcmc_estimator": "threshold",
        "threshold": 0.5,
        "mcmc_seed": [1, 2, 3],
        "n_iterations": 10000,
        "timeout": null
    }
]

Evaluation

There is not yet a general way of creating evaluation modules as their functionality and output may differ. However, you may either extend or copy one of the existing ones.

Local development

The above examples show how to create new modules from scratch, and the code runs by default in a container based on the image docker://bpimages/sandbox:1.0, where the required packages for the template modules are installed, see the Dockerfile for details. If you instead prefer to run a module using software installed on your local computer, you should change the container field in rule.smk to None, i.e.

container:
    None

Integrating existing project

You may have an existing (possibly version controlled by e.g. git) project that you want to integrate into Benchpress. Below we show how this is done in Python and R.

Python

If you for example have a Python project in your local folder /path/to/my/local/python_lib, you may simply add the folder to the systems library path variable as:

sys.path.append("/path/to/my/local/python_lib")

R

If you for example have an R project in your local folder /path/to/my/local/r/code.R, you may source the code as:

source("/path/to/my/local/r/code.R")

Using Conda environments

Conda is an open-source platform-independent package management system widely used in the data science community. If your local projects requirements are installed on a conda environment, called e.g. myenv, you may specify that the in rule.smk as

container:
    None
conda:
    "myenv"

To use conda, the extra use-conda flag has to be passed to snakemake when running the workflow:

snakemake --cores all --use-singularity --configfile config/sandbox.json --use-conda

Example

The rule and Python script for a structure learning module, using the local conda environment myenv may look like Listing 11 and Listing 12.

Listing 11 Rule using conda.

rule:
    name:
        module_name
    input:
        data = alg_input_data()
    output:
        adjmat = alg_output_adjmat_path(module_name),
        time = alg_output_time_path(module_name),
        ntests = alg_output_ntests_path(module_name)
    container:
        None
    conda:
        "myenv"
    script:
        "script.py"

Listing 12 Script importing local code.

sys.path.append("workflow/scripts/utils")
sys.path.append("/path/to/my/local/python_lib")

import time
import pandas as pd
import numpy as np
from add_timeout import *

def myalg():
    # Read in data
    df = pd.read_csv(snakemake.input["data"])

    # Set the seed
    np.random.seed(int(snakemake.wildcards["seed"]))

    # The algorithm goes here
    adjmat = my_local_algorithm(df) # returns an adjacency matrix

    # Save time
    tottime = time.perf_counter() - start
    with open(snakemake.output["time"], "w") as text_file:
        text_file.write(str(tottime))

    # Save adjmat
    adjmat_df = pd.DataFrame(adjmat)
    adjmat_df.columns = df.columns
    adjmat_df.to_csv(snakemake.output["adjmat"], index=False)

    # ntests is not applicable for this algorithm
    with open(snakemake.output["ntests"], "w") as text_file:
        text_file.write("None")


# This part starts the timer
start = time.perf_counter()

if snakemake.wildcards["timeout"] == "None":
    myalg()
else:
    with timeoutf(int(snakemake.wildcards["timeout"]),
                snakemake.output["adjmat"],
                snakemake.output["time"],
                snakemake.output["ntests"],
                start):
        myalg()

Updating the documentation

When adding a new module you may also update the documentation. First install the requirements by

pip install -r docs/source/requirements.txt

Then make render_docs.sh executable then render and build the documentation

cd docs
chmod +x render_docs.sh

./render_docs.sh
make html

Open build/html/index.html in a web browser to see the result.