JSON config
The JSON configuration file, together with Snakemake’s command line tool, serve as the interface for the user. Below we describe the main structure of a config file, where we for reference show (in Listing 5 with additional comments) the content of config/paper_pc_vs_dualpc.json , which is a comparison study between PC (pcalg) and Dual PC (dualPC). The results of this study can be found in PC vs. dual PC.
At the highest level there are two main sections, benchmark_setup (Line 2) and resources (Line 40).
The modules used in benchmark_setup are specified in resources and referenced by their corresponding IDs.
resources
The resources section contains the subsections graph (Line 50), parameters (Line 64), data (Line 41), and structure_learning_algorithms (Line 73), which contain the modules used in the study.
Each module in turn has a list of JSON objects, where each of the objects defines a specific parameter setting.
The objects are identified by unique IDs (see Lines 44, 53, 67, 76, and 86).
The parametrisations for the modules can be either single values (see e.g. Line 115) or lists (see e.g. Line 114).
In the case of lists, the module runs for each of the values in the list.
benchmark_setup
The benchmark_setup section contains a list of data setups with the data models (data, Line 5) and evaluation methods (evaluation, Line 13) a user wishes to consider for analysis.
In this example the list contains only one data setup, named pc_vs_dualpc (Line 4).
The
datasection should contain a list, where each item defines a certain data setup. For each seed number \(i\) in the range specified byseed_range(Line 10), a triple (\(G_i, \Theta_i, \mathbf Y_i\)) is generated, where \(G_i\) is obtained as specified bygraph_id(Line 7). Conditional on \(G_i\), the model parameters \(\Theta_i\) are obtained according toparameters_id(Line 8). The data matrix \(\mathbf Y_i = (Y^j)_{j=1}^n\) is sampled conditional on \((G_i,\Theta_i)\) as specified bydata_id(Line 9).The
evaluationsection contains the evaluation methods used for the analysis. Descriptions of the available evaluation methods can be found in Evaluation.
1{
2 "benchmark_setup": [ // the benchmark_setup (only one in this study)
3 {
4 "title": "pc_vs_dualpc",
5 "data": [ // the data setups
6 {
7 "graph_id": "avneigs4_p80", // see line 53
8 "parameters_id": "SEM", // see line 67
9 "data_id": "standardized", // see line 44
10 "seed_range": [1, 10]
11 }
12 ],
13 "evaluation": { // the evaluation modules
14 "benchmarks": {
15 "filename_prefix": "paper_pc_vs_dualpc/",
16 "show_seed": true,
17 "errorbar": true,
18 "errorbarh": false,
19 "scatter": true,
20 "path": true,
21 "text": false,
22 "ids": [
23 "pc-gaussCItest", // see line 86
24 "dualpc" // see line 76
25 ]
26 },
27 "graph_true_plots": true,
28 "graph_true_stats": true,
29 "ggally_ggpairs": false,
30 "graph_plots": [
31 "pc-gaussCItest",
32 "dualpc"
33 ],
34 "mcmc_traj_plots": [],
35 "mcmc_heatmaps": [],
36 "mcmc_autocorr_plots": []
37 }
38 }
39 ],
40 "resources": {
41 "data": { // the data modules
42 "iid": [
43 {
44 "id": "standardized",
45 "standardized": true,
46 "n": 300
47 }
48 ]
49 },
50 "graph": { // the graph modules
51 "pcalg_randdag": [
52 {
53 "id": "avneigs4_p80",
54 "max_parents": 5,
55 "n": 80,
56 "d": 4,
57 "par1": null,
58 "par2": null,
59 "method": "er",
60 "DAG": true
61 }
62 ]
63 },
64 "parameters": { // the parameters modules
65 "sem_params": [
66 {
67 "id": "SEM",
68 "min": 0.25,
69 "max": 1
70 }
71 ]
72 },
73 "structure_learning_algorithms": { // the structure learning modules
74 "dualpc": [
75 {
76 "id": "dualpc",
77 "alpha": [0.001, 0.05, 0.1],
78 "skeleton": false,
79 "pattern_graph": false,
80 "max_ord": null,
81 "timeout": null
82 }
83 ],
84 "pcalg_pc": [
85 {
86 "id": "pc-gaussCItest",
87 "alpha": [0.001, 0.05, 0.1],
88 "NAdelete": true,
89 "mmax": "Inf",
90 "u2pd": "relaxed",
91 "skelmethod": "stable",
92 "conservative": false,
93 "majrule": false,
94 "solveconfl": false,
95 "numCores": 1,
96 "verbose": false,
97 "edgeConstrains": null,
98 "indepTest": "gaussCItest",
99 "timeout": null
100 }
101 ]
102 }
103 }
104}
Data scenarios
Apart from the modules used in Listing 5, Benchpress also provides the special modules fixed_graph, fixed_params, and fixed_data, which allow the user to provide files in their analysis. These modules are not part of the resources section of the JSON file and are referenced by IDs, instead, files are simply referenced by their names. The file formats are described in File formats.
The different sources of data, obtained by combining the fixed files and the ordinary modules, can be summarised in five scenarios shown in the table below. I) Data analysis (fixed data) is the typical scenario for data analysts, where the user provides one or more datasets by hand. II) Data analysis with validation is similar to I) Data analysis (fixed data), with the difference that the user also provides the true graph underlying the data. This situation arises e.g. when replicating a simulation study from the literature, where both the true graph and the dataset are given. Scenarios III) Fixed graph and parameters - V) Fully generated are pure benchmarking scenarios, where either all of the graphs, parameters and data are generated (V) Fully generated) or the graphs and possibly parameters are specified by the user (III) Fixed graph and parameters, IV) Fixed graph).
Graph |
Parameters |
Data |
|
I |
Fixed |
||
II |
Fixed |
Fixed |
|
III |
Fixed |
Fixed |
Generated |
IV |
Fixed |
Generated |
Generated |
V |
Generated |
Generated |
Generated |
The following subsections show some template data examples in the benchmark_setup section that correspond to the scenarios I-IV.
I) Data analysis (fixed data)
In the example below, my_data_file.csv should be a file in resources/data/mydatasets.
{
"graph_id": null,
"parameters_id": null,
"data_id": "my_data_file.csv",
"seed_range": null
}
In the example below, my_data_folder should be a subfolder of resources/data/mydatasets containing data files.
{
"graph_id": null,
"parameters_id": null,
"data_id": "my_data_folder",
"seed_range": null
}
See I) Data analysis (fixed data) for an example of this scenario.
II) Data analysis with validation
{
"graph_id": "my_graph_file.csv",
"parameters_id": null,
"data_id": "my_data_file.csv",
"seed_range": null
}
See II) Data analysis with validation for an example of this scenario.
III) Fixed graph and parameters
{
"graph_id": "my_graph_file.csv",
"parameters_id": "my_params_file.rds",
"data_id": "my_data_id",
"seed_range": [1, 10]
}
IV) Fixed graph
{
"graph_id": "my_graph_file.csv",
"parameters_id": "my_params_id",
"data_id": "my_data_id",
"seed_range": [
1,
3
]
}
See IV) Fixed graph for examples of this scenario.
V) Fully generated
{
"graph_id": "my_graph_id",
"parameters_id": "my_params_id",
"data_id": "my_data_id",
"seed_range": [1, 10]
}
See V) Fully generated for examples of this scenario.