JSON config

The JSON configuration file, together with Snakemake’s command line tool, serve as the interface for the user. Below we describe the main structure of a config file, where we for reference show (in Listing 5 with additional comments) the content of config/paper_pc_vs_dualpc.json , which is a comparison study between PC (pcalg) and Dual PC (dualPC). The results of this study can be found in PC vs. dual PC.

At the highest level there are two main sections, benchmark_setup (Line 2) and resources (Line 40). The modules used in benchmark_setup are specified in resources and referenced by their corresponding IDs.

resources

The resources section contains the subsections graph (Line 50), parameters (Line 64), data (Line 41), and structure_learning_algorithms (Line 73), which contain the modules used in the study. Each module in turn has a list of JSON objects, where each of the objects defines a specific parameter setting. The objects are identified by unique IDs (see Lines 44, 53, 67, 76, and 86). The parametrisations for the modules can be either single values (see e.g. Line 115) or lists (see e.g. Line 114). In the case of lists, the module runs for each of the values in the list.

benchmark_setup

The benchmark_setup section contains a list of data setups with the data models (data, Line 5) and evaluation methods (evaluation, Line 13) a user wishes to consider for analysis. In this example the list contains only one data setup, named pc_vs_dualpc (Line 4).

  • The data section should contain a list, where each item defines a certain data setup. For each seed number \(i\) in the range specified by seed_range (Line 10), a triple (\(G_i, \Theta_i, \mathbf Y_i\)) is generated, where \(G_i\) is obtained as specified by graph_id (Line 7). Conditional on \(G_i\), the model parameters \(\Theta_i\) are obtained according to parameters_id (Line 8). The data matrix \(\mathbf Y_i = (Y^j)_{j=1}^n\) is sampled conditional on \((G_i,\Theta_i)\) as specified by data_id (Line 9).

  • The evaluation section contains the evaluation methods used for the analysis. Descriptions of the available evaluation methods can be found in Evaluation.

Listing 5 Comparison between PC and dual PC.
  1{
  2    "benchmark_setup": [ // the benchmark_setup (only one in this study)
  3        {
  4            "title": "pc_vs_dualpc",
  5            "data": [ // the data setups
  6                {
  7                    "graph_id": "avneigs4_p80", // see line 53
  8                    "parameters_id": "SEM", // see line 67
  9                    "data_id": "standardized", // see line 44
 10                    "seed_range": [1, 10]
 11                }
 12            ],
 13            "evaluation": { // the evaluation modules
 14                "benchmarks": {
 15                    "filename_prefix": "paper_pc_vs_dualpc/",
 16                    "show_seed": true,
 17                    "errorbar": true,
 18                    "errorbarh": false,
 19                    "scatter": true,
 20                    "path": true,
 21                    "text": false,
 22                    "ids": [
 23                        "pc-gaussCItest", // see line 86
 24                        "dualpc" // see line 76
 25                    ]
 26                },
 27                "graph_true_plots": true,
 28                "graph_true_stats": true,
 29                "ggally_ggpairs": false,
 30                "graph_plots": [
 31                    "pc-gaussCItest",
 32                    "dualpc"
 33                ],
 34                "mcmc_traj_plots": [],
 35                "mcmc_heatmaps": [],
 36                "mcmc_autocorr_plots": []
 37            }
 38        }
 39    ],
 40    "resources": {
 41        "data": { // the data modules
 42            "iid": [
 43                {
 44                    "id": "standardized",
 45                    "standardized": true,
 46                    "n": 300
 47                }
 48            ]
 49        },
 50        "graph": { // the graph modules
 51            "pcalg_randdag": [
 52                {
 53                    "id": "avneigs4_p80",
 54                    "max_parents": 5,
 55                    "n": 80,
 56                    "d": 4,
 57                    "par1": null,
 58                    "par2": null,
 59                    "method": "er",
 60                    "DAG": true
 61                }
 62            ]
 63        },
 64        "parameters": { // the parameters modules
 65            "sem_params": [
 66                {
 67                    "id": "SEM",
 68                    "min": 0.25,
 69                    "max": 1
 70                }
 71            ]
 72        },
 73        "structure_learning_algorithms": { // the structure learning modules
 74            "dualpc": [
 75                {
 76                    "id": "dualpc",
 77                    "alpha": [0.001, 0.05, 0.1],
 78                    "skeleton": false,
 79                    "pattern_graph": false,
 80                    "max_ord": null,
 81                    "timeout": null
 82                }
 83            ],
 84            "pcalg_pc": [
 85                {
 86                    "id": "pc-gaussCItest",
 87                    "alpha": [0.001, 0.05, 0.1],
 88                    "NAdelete": true,
 89                    "mmax": "Inf",
 90                    "u2pd": "relaxed",
 91                    "skelmethod": "stable",
 92                    "conservative": false,
 93                    "majrule": false,
 94                    "solveconfl": false,
 95                    "numCores": 1,
 96                    "verbose": false,
 97                    "edgeConstrains": null,
 98                    "indepTest": "gaussCItest",
 99                    "timeout": null
100                }
101            ]
102        }
103    }
104}

Data scenarios

Apart from the modules used in Listing 5, Benchpress also provides the special modules fixed_graph, fixed_params, and fixed_data, which allow the user to provide files in their analysis. These modules are not part of the resources section of the JSON file and are referenced by IDs, instead, files are simply referenced by their names. The file formats are described in File formats.

The different sources of data, obtained by combining the fixed files and the ordinary modules, can be summarised in five scenarios shown in the table below. I) Data analysis (fixed data) is the typical scenario for data analysts, where the user provides one or more datasets by hand. II) Data analysis with validation is similar to I) Data analysis (fixed data), with the difference that the user also provides the true graph underlying the data. This situation arises e.g. when replicating a simulation study from the literature, where both the true graph and the dataset are given. Scenarios III) Fixed graph and parameters - V) Fully generated are pure benchmarking scenarios, where either all of the graphs, parameters and data are generated (V) Fully generated) or the graphs and possibly parameters are specified by the user (III) Fixed graph and parameters, IV) Fixed graph).

Graph

Parameters

Data

I

Fixed

II

Fixed

Fixed

III

Fixed

Fixed

Generated

IV

Fixed

Generated

Generated

V

Generated

Generated

Generated

The following subsections show some template data examples in the benchmark_setup section that correspond to the scenarios I-IV.

I) Data analysis (fixed data)

In the example below, my_data_file.csv should be a file in resources/data/mydatasets.

{
    "graph_id": null,
    "parameters_id": null,
    "data_id": "my_data_file.csv",
    "seed_range": null
}

In the example below, my_data_folder should be a subfolder of resources/data/mydatasets containing data files.

{
    "graph_id": null,
    "parameters_id": null,
    "data_id": "my_data_folder",
    "seed_range": null
}

See I) Data analysis (fixed data) for an example of this scenario.

II) Data analysis with validation

{
    "graph_id": "my_graph_file.csv",
    "parameters_id": null,
    "data_id": "my_data_file.csv",
    "seed_range": null
}

See II) Data analysis with validation for an example of this scenario.

III) Fixed graph and parameters

{
    "graph_id": "my_graph_file.csv",
    "parameters_id": "my_params_file.rds",
    "data_id": "my_data_id",
    "seed_range": [1, 10]
}

IV) Fixed graph

{
    "graph_id": "my_graph_file.csv",
    "parameters_id": "my_params_id",
    "data_id": "my_data_id",
    "seed_range": [
        1,
        3
    ]
}

See IV) Fixed graph for examples of this scenario.

V) Fully generated

{
    "graph_id": "my_graph_id",
    "parameters_id": "my_params_id",
    "data_id": "my_data_id",
    "seed_range": [1, 10]
}

See V) Fully generated for examples of this scenario.