File formats

Graph

A graph \(G=(V, E)\) is represented by its adjacency matrix \(M\), where \(M_{ij}=1\) if \((i,j)\in E\) and \(M_{ij}=0\) if \((i,j)\notin E\).

Example (undirected graph)

Below is an example undirected graph, where \(E = \{(a,b), (a,c), (c,d)\}\) are interpreted as un-ordered pairs (un-directed edges). Undirected graphs have symmetric matrices.

a,b,c,d
0,1,1,0
1,0,0,0
1,0,0,1
0,0,1,0

Example (DAG)

If the graph is directed the adjacency matrix is asymmetric as below.

a,b,c,d
0,1,1,0
0,0,0,0
0,0,0,1
0,0,0,0

MCMC trajectory

When the output of the algorithm is a Markov chain of graphs, we store the output in a compact form by tracking only the changes when moves are accepted, along with the corresponding time index and the score of the resulting graph after acceptance (not the score difference). Additionally, in the first two rows the labels of the variables, which should be read from the data matrix, are recorded. Specifically, the first row (index -2) contains edges from the first variable to each of the rest in the added column, where a dash (-) symbolises an undirected edge, and a right arrow (->) a directed edge. The score column is set to 0 and removed is set to []. The second row (index -1) has the same edges in the removed column, while the score column is set to 0 and added is set to []. The third row (index 0) contains all the vertices in the starting graph along with its score in the score column and [] in the removed column.

Below is an example of a trajectory of undirected graphs \(G_0, G_1, \dots , G_{89}\) , where \(E_i = {(b, c),(a, d)}\) for \(i = 0, \dots , 33, E_i = {(a, d)}\) for \(i = 34, \dots , 88\) and \(E_i = {(c, d),(a, d)}\) for \(i = 89\).

index,score,added,removed
-2,0.0,[a-b;a-c;a-d],[]
-1,0.0,[],[a-b;a-c;a-d]
0,-2325.52,[b-c;a-d],[]
34,-2311.94,[],[b-c]
89,-2310.81,[c-d],[]

Dataset

Observational data

Observations should be stored as row vectors in a matrix, where the columns are separated by commas. The first row should contain the labels of the variables and if the data is categorical, the second row should contain the cardinality (number of levels) of each variable.

Example (continuous)

Below is an example showing two samples from a continuous distribution.

a,b,c,d
0.2,2.3,5.3,0.5
3.2,1.5,2.5,1.2

Example (categorical)

Below is a formatting example of two samples of a categorical distribution where the cardinalities are 2,3,2, and 2.

a,b,c,d
2,3,2,2
1,2,0,1
0,1,1,1

Interventional data

Hard interventions are indicated by additional columns for the interventional variables, stacked on the right part of the data matrix. Below is an example showing samples from a mix of observational and interventional samples from a continuous distribution.

Example (continuous)

If in the continuous example above there would be two additional observations where only \(a\) was intervened, and one when both \(a\) and \(d\) were intervened, it could look as below.

a,b,c,d,a,d
2,2.3,5.3,0.5,0,0
2,1.5,2.5,1.2,0,0
2,0.1,1.5,3.2,1,0
2,1.2,2.2,4.2,1,0
1,1.5,1.4,2.2,1,1

Parameters

Bnlearn bn.fit objects should be stored in RDS format in the directory resources/myparams/bn.fit_networks.
Weight matrices for SEM models should be stored in CSV format in resources/myparams/sem_params.

Note

The column labels and their order in the dataset CSV files should directly correspond to the variable names in the graph and parameters CSV files (if such files exist).