Supplementary material for our paper
"Improvements in the reconstruction of time-varying
gene regulatory networks: dynamic programming and
regularization by information sharing among genes"
(submitted to Bioinformatics, 2010)
Authors: Marco Grzegorczyk and Dirk Husmeier
This ReadMe.txt file describes how our Matlab implementaion
of the regularized cpBGe model can be used.
The cpBGe model is a 'class 2' model, which was proposed in
Grzegorczyk and Husmeier (NIPS 22, 2009, 682-690).
Details are given in our supplementary paper.
The implementation here is the one proposed in our
Bioinformatics main paper with 2 methodological improvements
based on dynamic programming and regularization by
information sharing among genes.
This ReadMe.txt file consists of two parts:
Part 1 deals with the input and output arguments.
Part 2 gives a worked toy example.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% PART 1 - OUR IMPLEMENTATION
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Note that our Matlab implementation of the regularized
cpBGe model requires the 'Statistics Toolbox'
for Matlab. There is only one function that has the
following synthax:
[Run] = PROC_cpBGe_regularized(data, steps, p, k);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Input arguments:
1) data is the data matrix where each row corresponds to
a network node (gene) and each column corresponds to a time point.
E.g.: data(i,t) is the realisation of node X_i at time point t
2) 'steps' is the number of Gibbs sampling steps to be performed
3) 'p' and 'k' are the hyperparameters of the negative binomial
distribution underlying the point process prior for changepoints
(01) step?
We have to look at:
LOG_SCORE = Run.log_score(i-1); % see 1)
DAG = Run.dag{i-1}; % see 2)
NODE_VEC = Run.Node_Vec{i-1}; % see 3)
MATRIX = Run.matrix{i-1}; % see 4)
1)
LOG_SCORE
is the log-score (i.e. log(likelihood*prior)) of the
state (DAG,NODE_VEC,MATRIX) where
2)
DAG is a N-by-N matrix where N is the number of nodes
DAG(k,j)=1 means that there was an edge from X_k to X_j
DAG(k,j)=0 means that there was no edge from X_k to X_J
in the i-th sampled graph
3)
NODE_VEC is a vector of length N where N is the number of nodes.
NODE_VEC(j) is the cluster to which the j-th node has been
clustered
MATRIX is the N-by-(m-1) allocation vector matrix where
N is the number of nodes and
m is the number of observations (time points).
Please note that in a dynamic Bayesian network (with a time
lag of 1) the first time point t=1 cannot be scored, since
there are no realisations of the potential parent nodes at
t=0. That is why MATRIX has (m-1) columns only. The first column
corresponds to the second time point t=2 and the last column
corresponds to the last time point t=m.
4)
The i-th row of MATRIX gives the allocation vector
of the i-th cluster.
E.g.:
MATRIX(i,t)=k means that the (t+1)-th time point has been allocated
to the k-th mixture component (segment) for the i-th cluster.
To find out which nodes belong to the k-th cluster it can be used:
nodes_in_cluster_k = find(NODE_VEC==k);
Please note that 'nodes_in_cluster_k' may be an empty set, if there is no
node in cluster k. This happens if the number of clusters c is lower than the number of nodes,
symbolically c X2
% X1 -> X3
% and that we have sampled (X1,X2,X3) at
% m=20 time points.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% To make the results re-producible:
rand('state',13234);
% First, we generate some 'class 1' data:
% We assume that there is one single changepoint at t=11
% that is common to all three nodes X1, X2, and X3
% ('class 1 data')
data_1 = zeros(3,20);
% Data for X1 without parents
data_1(1,:) = [randn(1,10)-2,randn(1,10)+2];
data_1(2,1) = randn(1);
data_1(3,1) = randn(1);
for t=2:10
data_1(2,t) = (-1)*data_1(1,t-1) + 0.3*randn(1);
data_1(3,t) = (-2)*data_1(1,t-1) + 0.3*randn(1);
end
% cp at t=10 (simply switch the signs of the coefficients)
for t=11:20
data_1(2,t) = (+1)*data_1(1,t-1) + 0.3*randn(1);
data_1(3,t) = (+2)*data_1(1,t-1) + 0.3*randn(1);
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% We also generate some 'class 2 data':
% We assume that
% (i) there is a changepoint at t=11 that is common to X1 and X2, and
% (ii)there are two changepoints for X3 at t=6 and t=15
% ('class 2 data')
data_2 = zeros(3,20);
% Data for X1 without parents
data_2(1,:) = [randn(1,10)+1,randn(1,10)-1];
data_2(2,1) = randn(1);
data_2(3,1) = randn(1);
% Data for X2 without parents
for t=2:10
data_2(2,t) = (-1)*data_2(1,t-1) + 0.3*randn(1);
end
% For X2 there is a cp after t=10
for t=11:20
data_2(2,t) = (+1)*data_2(1,t-1) + 0.3*randn(1);
end
for t=2:10
data_2(2,t) = (-1)*data_2(1,t-1) + 0.3*randn(1);
end
% Data for X3
% there are two changepoints at t=6 and t=15
for t=2:6
data_2(3,t) = (+2)*data_2(1,t-1) + 0.3*randn(1);
end
for t=7:15
data_2(3,t) = (-2)*data_2(1,t-1) + 0.3*randn(1);
end
for t=16:20
data_2(3,t) = (+2)*data_2(1,t-1) + 0.3*randn(1);
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Set the parameters of the negative binomial...
p = 0.0001;
k = 2;
% ...and let's perform 10 Gibbs sampling steps:
steps = 10;
% Run the Gibbs sampler for data_1 and data_2:
[Run_1] = PROC_cpBGe_regularized(data_1, steps, p, k);
[Run_2] = PROC_cpBGe_regularized(data_2, steps, p, k);
% During the MCMC simulation the step index i can be seen
% on the screen (i=1 to i=steps+1)
% The MCMC simulations for data_1 and data_2 should have
% finished in some minutes.
% We look at the marginal edge posterior probabilities:
[N,m] = size(data_1);
DAG_1 = zeros(N,N);
DAG_2 = zeros(N,N);
for i=2:(steps+1)
DAG_1 = DAG_1 + Run_1.dag{i};
DAG_2 = DAG_2 + Run_2.dag{i};
end
DAG_1 = DAG_1/steps;
DAG_2 = DAG_2/steps;
DAG_1
DAG_2
% It can be seen that the correct edges X1->X2 and X1->X3
% have obtained the greatest posterior probabilities.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% There are three plot types that may be helpful:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% First, we monitor the log scores
figure(1)
clf
subplot(2,1,1)
plot(Run_1.log_score)
subplot(2,1,2)
plot(Run_2.log_score)
% For both data sets a plateau is reached. There is no indication
% AGAINST convergence.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Second, we look at a heatmap of the connectivity structures
% between nodes. That is, we want to visualise the clusters of
% nodes:
C_1 = zeros(N,N);
C_2 = zeros(N,N);
for i=2:(steps+1)
VEC_1 = Run_1.Node_Vec{i};
VEC_2 = Run_2.Node_Vec{i};
for j=1:3
for k=1:3
c1_j = VEC_1(j);
c1_k = VEC_1(k);
c2_j = VEC_2(j);
c2_k = VEC_2(k);
if(c1_j==c1_k)
C_1(j,k) = C_1(j,k)+1;
end
if(c2_j==c2_k)
C_2(j,k) = C_2(j,k)+1;
end
end
end
end
C_1 = C_1/steps;
C_2 = C_2/steps;
figure(2)
clf
subplot(1,2,1)
imagesc(C_1,[0,1]);
colormap(gray)
subplot(1,2,2)
imagesc(C_2,[0,1]);
colormap(gray)
% In the second figure there are two panels with heatmaps
% for data_1 (left panel) and data_2 (right panel).
% A grey shading is used to indicate
% the connectivity probabilities between nodes
% (black=0 and C=1 white)
% From the left panel (for data_1) it can be seen that the three
% nodes have been clustered to the same cluster. The heatmap is
% white indicating that there are high connectivities between the
% three nodes (i.e the number of clusters is c=1).
% From the right panel (for data_2) it can be seen that the third
% node is seperated from the first two nodes X1 and X2. That is,
% the nodes X1 and X2 build the first cluster, and X3 is seperated
% from them in a second cluster. It follows that the number of
% clusters is c=2.
% We note that these findings are consistent with the data #
% generation mechanisms described above.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Finally, we want to find out which changepoints have been inferred for the nodes
% in the clusters
[N,m] = size(data_1);
m = m-1; % the first time point t=1 is not allocated
C1_1 = zeros(m,m);
C1_2 = zeros(m,m);
C1_3 = zeros(m,m);
C2_1 = zeros(m,m);
C2_2 = zeros(m,m);
C2_3 = zeros(m,m);
for i=2:(steps+1)
mat_1 = Run_1.matrix{i};
vec_1 = Run_1.Node_Vec{i};
mat_2 = Run_2.matrix{i};
vec_2 = Run_2.Node_Vec{i};
index_1_1 = vec_1(1);
index_1_2 = vec_1(2);
index_1_3 = vec_1(3);
index_2_1 = vec_2(1);
index_2_2 = vec_2(2);
index_2_3 = vec_2(3);
VEC_1_1 = mat_1(index_1_1,:);
VEC_1_2 = mat_1(index_1_2,:);
VEC_1_3 = mat_1(index_1_3,:);
VEC_2_1 = mat_2(index_2_1,:);
VEC_2_2 = mat_2(index_2_2,:);
VEC_2_3 = mat_2(index_2_3,:);
for x=1:m
for y=1:m
if(VEC_1_1(x)==VEC_1_1(y))
C1_1(x,y) = C1_1(x,y)+1;
end
if(VEC_1_2(x)==VEC_1_2(y))
C1_2(x,y) = C1_2(x,y)+1;
end
if(VEC_1_3(x)==VEC_1_3(y))
C1_3(x,y) = C1_3(x,y)+1;
end
if(VEC_2_1(x)==VEC_2_1(y))
C2_1(x,y) = C2_1(x,y)+1;
end
if(VEC_2_2(x)==VEC_2_2(y))
C2_2(x,y) = C2_2(x,y)+1;
end
if(VEC_2_3(x)==VEC_2_3(y))
C2_3(x,y) = C2_3(x,y)+1;
end
end
end
end
C1_1 = C1_1/steps;
C1_2 = C1_2/steps;
C1_3 = C1_3/steps;
C2_1 = C2_1/steps;
C2_2 = C2_2/steps;
C2_3 = C2_3/steps;
figure(3)
clf
subplot(3,2,1)
imagesc(C1_1,[0,1]);
colormap(gray)
subplot(3,2,2)
imagesc(C2_1,[0,1]);
colormap(gray)
subplot(3,2,3)
imagesc(C1_2,[0,1]);
colormap(gray)
subplot(3,2,4)
imagesc(C2_2,[0,1]);
colormap(gray)
subplot(3,2,5)
imagesc(C1_3,[0,1]);
colormap(gray)
subplot(3,2,6)
imagesc(C2_3,[0,1]);
colormap(gray)
% The plot is arranged as a 3-by-2 matrix.
% The first column corresponds to data_1, and
% the second column corresponds to data_2.
% The rows correspond to the three network nodes:
% first row -> X1
% second row -> X2
% third row -> X3
% In each of the six panels there is a heatmap,
% and a grey shading is used to indicate
% the connectivity probabilities between time points
% (black=0 and white=1).
% From the left column (for data_1) it can be seen that
% the three nodes X1, X2 and X3 share the same changepoint
% at t=10.
% From the right column (for data_2) it can be seen that
% the first two nodes X1 and X2 share the same changepoint
% at t=10. For node X3 two changepoint at t=6 and t=15 have
% been inferred.
% From the right panel (for data_2) it can be seen that the third
% node is seperated from the first two nodes X1 and X2. That is,
% the nodes X1 and X2 build the first cluster, and X3 is seperated
% from them in a second cluster. It follows that the number of
% clusters is c=2.
% We note that these findings are consistent with the data
% generation mechanisms described above.
% data_1 -> class 1 data (all nodes share the same changepoint at t=10)
% data_2 -> class 2 data
% (the nodes X1 and X2 share the same changepoint at t=10 and
% node X3 has two changepoints at t=6 and t=15).
% END OF WORKED EXAMPLE