Adding new modules
All modules have the following basic file structure, where all the files are necessary except for script.R that may be replaced by some alternative script. However, to get something working, you only have to consider altering script.R (or some alternative and in that case also rule.smk) and the modules JSON snippets, as shown in the examples below.
module_name
├── rule.smk
├── script.R
├── schema.json
├── info.json
├── bibtex.bib
└── docs.rst
rule.smk contains a Snakemake rule with the required fields on the proper form. The template modules run by default in a container based on the Docker image “bpimages/sandbox:1.0”. To force local execution this should be changed to None, in which case you need to make sure that the used dependencies are installed locally. On deployment (pushing to Benchpress repository) however, this field should be a Docker Hub URI on the form docker://username/image:version (see e.g. this tutorial for building and pushing images to Docker Hub).
script.R contains an R-script that is called by the rule. The variables available in the script are generated both from the Snakemake rule and the JSON object for the module file (from the wildcards dict). See the Snakemake documentation for further details on how to access variables in scripts. Note that the values are passed as strings and might have to be converted to suit your specific script.
schema.json is a JSON schema for the module. On deployment, you should alter this to restrict the fields for the JSON file.
bibtex.bib is a file with references in BibTeX format that will be accessible in docs.rst (using sphinxcontrib-bibtex syntax) and shown in the documentation.
docs.rst is a documentation file in reStructuredText (RST) format that will be included when Example. This file should contain an overview of the module and further explanation of the JSON fields if needed.
The sections below show how to get started developing your own modules. To get a quick start, you may run the following commands in the benchpress directory, to copy of each of the template modules at once. Then you can use the config file config/sandbox.json which is configured to use each of them.
cp -r resources/module_templates/new_graph workflow/rules/graph/new_graph
cp -r resources/module_templates/new_params workflow/rules/parameters/new_params
cp -r resources/module_templates/new_data workflow/rules/data/new_data
cp -r resources/module_templates/new_alg workflow/rules/structure_learning_algorithms/new_alg
cp -r resources/module_templates/new_mcmcalg workflow/rules/structure_learning_algorithms/new_mcmcalg
cp config/sandbox.json config/mysandbox.json
To test run
snakemake --cores all --use-singularity --configfile config/mysandbox.json
Graph
To create a new graph module, you may copy the template module new_graph as
cp -r resources/module_templates/new_graph workflow/rules/graph/new_graph
Listing 5 shows script.R, which generates a random undirected graph (symmetric binary matrix) and saves it properly in the CSV format specified in File formats.
# Samples the adjacency matrix for a random undirected graph.
# These are the parameters that are passed to the script from the config file.
p <- as.integer(snakemake@wildcards[["p"]])
cutoff <- as.numeric(snakemake@wildcards[["cutoff"]])
set.seed(as.integer(snakemake@wildcards[["seed"]]))
adjmat <- matrix(runif(p * p), nrow = p, ncol = p) > cutoff
adjmat <- 1 * (adjmat | t(adjmat)) # Make it symmetric (undirected)
diag(adjmat) <- 0 # No self loops
colnames(adjmat) <- as.character(seq(p))
write.table(adjmat,
file = snakemake@output[["adjmat"]], row.names = FALSE,
quote = FALSE, col.names = TRUE, sep = ","
)
In order to use the module, you need to add the following piece of JSON to the graph
subsection of the resources
section in the config file.
"new_graph": [
{
"id": "testmat",
"p": 5,
"cutoff": 0.8
}
]
Parameters
To create a new parameters module, you may copy the template module new_params as
cp -r resources/module_templates/new_params workflow/rules/parameters/new_params
Listing 6 shows script.R, which samples a covariance matrix for a multivariate Gaussian distribution from the G-Inverse Wishart distribution and saves it. This template module uses the BDgraph to sample the matrix, so this needs to be installed on your system in to be tested. The format of the saved file depends on the type of parameters used, in this case, since we sample a matrix it can be stored as a CSV file.
library(BDgraph)
set.seed(as.integer(snakemake@wildcards[["seed"]]))
# Read the adjacency matrix
df_adjmat <- read.csv(snakemake@input[["adjmat"]], header = TRUE, check.names = FALSE)
adjmat <- as.matrix(df_adjmat)
p <- dim(adjmat)[2]
precmat <- rgwish(n = 1,
adj = adjmat,
b = as.integer(snakemake@wildcards[["b"]]),
D = diag(p),
threshold = as.numeric(snakemake@wildcards[["thresh"]]))
covmat <- solve(precmat)
colnames(covmat) <- colnames(df)
write.table(covmat,
file = snakemake@output[["params"]],
row.names = FALSE,
quote = FALSE, col.names = TRUE, sep = ",")
To use the module, you need to add the following piece of JSON to the parameters
section of the JSON file.
"new_params": [
{
"id": "gwish",
"thresh": 1e-8,
"b": 3
}
]
Data
The best way to get started is to copy the template module new_data as
cp -r resources/module_templates/new_data workflow/rules/data/new_data
Listing 7 shows script.R, which generates i.i.d multivariate Gaussian data and saves it properly in the CSV format specified in File formats.
library(mvtnorm)
# Samples multivariate normal data given a covariance matrix.
# Read the covariance matrix from the params file.
df_params <- read.csv(snakemake@input[["params"]],
header = TRUE,
check.names = FALSE)
covmat <- as.matrix(df_params)
n <- as.integer(snakemake@wildcards[["n"]])
set.seed(as.integer(snakemake@wildcards[["seed"]]))
data <- rmvnorm(n, mean = rep(0, nrow(covmat)), sigma = covmat)
# Set the proper header and write the data to file.
colnames(data) <- colnames(df_params)
write.table(data,
file = snakemake@output[["data"]],
row.names = FALSE,
quote = FALSE, col.names = TRUE, sep = ","
)
In order to use the module, you need to add the following piece of JSON to the data
subsection of the resources
section in the config file.
"new_data": [
{
"id": "testdata",
"n": 100,
"standardized": false
}
]
Algorithm
In order to create a new algorithm module, you may copy the template module new_alg as
cp -r resources/module_templates/new_alg workflow/rules/structure_learning_algorithms/new_alg
Below are two examples of how to implement an algorithm module, one in R and one in Python. Which one to use is set in rule.smk
R
Listing 8 shows script.R , which generates a random binary symmetric matrix (undirected data).
Note that the actual algorithm is wrapped into the function myalg which is passed to the function add_timeout.
This is to enable the timeout functionality, which writes an empty file if the algorithm has finished before timeout
seconds, specified in the config file.
However, add_timeout is not needed if your algorithm is able to produce results after a specified amount of time.
# If you have a Docker image with files to source you can source them here.
# Be careful to use the absolute path in the Docker image.
# R.utils is needed for the timeout so make sure this is installed.
source("workflow/scripts/utils/helpers.R")
# source("/path/in/dockerimage/filetosource.R")
myalg <- function() {
# The algorithm should be in this function.
cutoff <- as.numeric(snakemake@wildcards[["cutoff"]])
data <- read.csv(snakemake@input[["data"]], check.names = FALSE)
start <- proc.time()[1]
# This is a fast and naiive algorithm.
p <- ncol(data)
Sys.sleep(3)
set.seed(as.integer(snakemake@wildcards[["seed"]]))
adjmat <- matrix(runif(p * p), nrow = p, ncol = p) > cutoff
adjmat <- 1 * (adjmat | t(adjmat))
diag(adjmat) <- 0
totaltime <- proc.time()[1] - start
colnames(adjmat) <- names(data) # Get the labels from the data
write.csv(adjmat, file = snakemake@output[["adjmat"]], row.names = FALSE, quote = FALSE)
write(totaltime, file = snakemake@output[["time"]])
# Write the true number of c.i. tests here if possible.
cat("None", file = snakemake@output[["ntests"]], sep = "\n")
}
add_timeout(myalg)
Python
Listing 9 shows script.py, which is the Python analogue of script.R shown in Listing 8. As in the R-script, to enable the timeout functionality, the actual algorithm is wrapped into the function myalg which is then used from Line 39 and below.
1sys.path.append("workflow/scripts/utils")
2# If you have a local folder for your program/library, you can add it to the path here.
3# sys.path.append("/path/to/my/local/python_lib")
4
5import time
6import numpy as np
7import pandas as pd
8from add_timeout import *
9
10def myalg():
11 # Read in data (not used in this algorithm)
12 df = pd.read_csv(snakemake.input["data"])
13
14 # The algorithm goes here
15 p = df.shape[1]
16 np.random.seed(int(snakemake.wildcards["seed"]))
17
18 m = np.random.rand(p*p).reshape((p, p))
19 m = m > float(snakemake.wildcards["cutoff"])
20 adjmat = (m | m.T) * 1
21 np.fill_diagonal(adjmat, 0)
22 adjmat.astype(int)
23
24 # Save time
25 tottime = time.perf_counter() - start
26 with open(snakemake.output["time"], "w") as text_file:
27 text_file.write(str(tottime))
28
29 # Save adjmat
30 adjmat_df = pd.DataFrame(adjmat)
31 adjmat_df.columns = df.columns
32 adjmat_df.to_csv(snakemake.output["adjmat"], index=False)
33
34 # ntests is not applicable for this algorithm
35 with open(snakemake.output["ntests"], "w") as text_file:
36 text_file.write("None")
37
38
39# This part starts the timer
40start = time.perf_counter()
41
42if snakemake.wildcards["timeout"] == "None":
43 myalg()
44else:
45 with timeoutf(int(snakemake.wildcards["timeout"]),
46 snakemake.output["adjmat"],
47 snakemake.output["time"],
48 snakemake.output["ntests"],
49 start):
50 myalg()
In order to use the module, you need to add the following piece of JSON to the list of structure learning modules in the structure_learning_algorithms
section of the JSON file, making the parameters cutoff
and timeout
accessible in the script.
"new_alg": [
{
"id": "testalg",
"cutoff": 0.8,
"timeout": null
}
]
Algorithm (MCMC)
In order to create a new MCMC algorithm module, you may copy the template module new_mcmcalg as
cp -r resources/module_templates/new_mcmcalg workflow/rules/structure_learning_algorithms/new_mcmcalg
This template runs script.R (shown below) but you may change either the entire file or the content of it.
# If you have a Docker image with files to source you can source them here.
# Be careful to use the absolute path in the Docker image.
# R.utils is needed for the timeout. Make sure this is installed.
source("workflow/scripts/utils/helpers.R")
# source("/path/in/dockerimage/filetosource.R")
filename <- file.path(snakemake@output[["seqgraph"]])
filename_data <- snakemake@input[["data"]]
seed <- as.integer(snakemake@wildcards[["mcmc_seed"]])
myalg <- function() {
data <- read.csv(filename_data, check.names = FALSE)
p <- ncol(data)
start <- proc.time()[1]
# Generate a trajectory of adjacency matrices and scores.
set.seed(seed)
adjmats <- list()
scores <- list()
cutoff <- as.numeric(snakemake@wildcards[["cutoff"]])
n_iterations <- as.integer(snakemake@wildcards[["n_iterations"]])
# sample n_iterations adjacency matrices and scores.
for (i in 1:n_iterations) {
adjmat <- matrix(runif(p * p), nrow = p, ncol = p) > cutoff
adjmat <- 1 * (adjmat | t(adjmat))
colnames(adjmat) <- names(data)
diag(adjmat) <- 0
adjmats[[i]] <- adjmat
scores[[i]] <- rnorm(1, -1000, 10)
}
totaltime <- proc.time()[1] - start
# Convert the adjacency matrices and scores to the Benchpress format.
df <- bidagtraj_to_bptraj(adjmats, scores, colnames(data), directed = FALSE)
write.csv(x = df, file = file.path(snakemake@output[["seqgraph"]]),
row.names = FALSE, quote = FALSE)
write(totaltime, file = snakemake@output[["time"]])
}
add_timeout_mcmc(myalg)
MCMC algorithms have, apart from the algorithm parameters, four additional required fields, mcmc_seed
, mcmc_estimator
, threshold
, and burnin_frac
.
mcmc_seed
is the random seed for the algorithm. mcmc_estimator
specifies which estimator to use (threshold or map), and threshold
specifies the threshold for the estimator if mcmc_estimator
is set to threshold.
burnin_frac
specifies the fraction of the samples to be discarded as burn-in.
In order to use the module, you can add the following piece of JSON to the list of structure learning modules in the structure_learning_algorithms
section of the JSON file.
"new_mcmcalg": [
{
"id": "mcmctest",
"cutoff": 0.8,
"burnin_frac": 0.5,
"mcmc_estimator": "threshold",
"threshold": 0.5,
"mcmc_seed": [1, 2, 3],
"n_iterations": 10000,
"timeout": null
}
]
Evaluation
There is not yet a general way of creating evaluation modules as their functionality and output may differ. However, you may either extend or copy one of the existing ones.
Local development
The above examples show how to create new modules from scratch, and the code runs by default in a container based on the image docker://bpimages/sandbox:1.0, where the required packages for the template modules are installed, see the Dockerfile for details.
If you instead prefer to run a module using software installed on your local computer, you should change the container
field in rule.smk to None, i.e.
container:
None
Integrating existing project
You may have an existing (possibly version controlled by e.g. git) project that you want to integrate into Benchpress. Below we show how this is done in Python and R.
Python
If you for example have a Python project in your local folder /path/to/my/local/python_lib, you may simply add the folder to the systems library path variable as:
sys.path.append("/path/to/my/local/python_lib")
R
If you for example have an R project in your local folder /path/to/my/local/r/code.R, you may source the code as:
source("/path/to/my/local/r/code.R")
Using Conda environments
Conda is an open-source platform-independent package management system widely used in the data science community. If your local projects requirements are installed on a conda environment, called e.g. myenv, you may specify that the in rule.smk as
container:
None
conda:
"myenv"
To use conda, the extra use-conda flag has to be passed to snakemake when running the workflow:
snakemake --cores all --use-singularity --configfile config/sandbox.json --use-conda
Example
The rule and Python script for a structure learning module, using the local conda environment myenv may look like Listing 11 and Listing 12.
1rule:
2 name:
3 module_name
4 input:
5 data = alg_input_data()
6 output:
7 adjmat = alg_output_adjmat_path(module_name),
8 time = alg_output_time_path(module_name),
9 ntests = alg_output_ntests_path(module_name)
10 container:
11 None
12 conda:
13 "myenv"
14 script:
15 "script.py"
1sys.path.append("workflow/scripts/utils")
2sys.path.append("/path/to/my/local/python_lib")
3
4import time
5import pandas as pd
6import numpy as np
7from add_timeout import *
8
9def myalg():
10 # Read in data
11 df = pd.read_csv(snakemake.input["data"])
12
13 # Set the seed
14 np.random.seed(int(snakemake.wildcards["seed"]))
15
16 # The algorithm goes here
17 adjmat = my_local_algorithm(df) # returns an adjacency matrix
18
19 # Save time
20 tottime = time.perf_counter() - start
21 with open(snakemake.output["time"], "w") as text_file:
22 text_file.write(str(tottime))
23
24 # Save adjmat
25 adjmat_df = pd.DataFrame(adjmat)
26 adjmat_df.columns = df.columns
27 adjmat_df.to_csv(snakemake.output["adjmat"], index=False)
28
29 # ntests is not applicable for this algorithm
30 with open(snakemake.output["ntests"], "w") as text_file:
31 text_file.write("None")
32
33
34# This part starts the timer
35start = time.perf_counter()
36
37if snakemake.wildcards["timeout"] == "None":
38 myalg()
39else:
40 with timeoutf(int(snakemake.wildcards["timeout"]),
41 snakemake.output["adjmat"],
42 snakemake.output["time"],
43 snakemake.output["ntests"],
44 start):
45 myalg()
Updating the documentation
When adding a new module you may also update the documentation. First install the requirements by
pip install -r docs/source/requirements.txt
Then make render_docs.sh executable then render and build the documentation
cd docs
chmod +x render_docs.sh
./render_docs.sh
make html
Open build/html/index.html in a web browser to see the result.