Directed GMs: Bayesian Networks

12 minute read

Published:

This page’s Markdown is generated via pandoc from LaTex. If you feel more comfortable with a LaTex layout, please check here. The original Tex file is also available here.

1. Introduction

The goal of establishing GMs (Graphical Models) is to represent a joint distribution over some set of random variables . Consider the simplest case where each variable is binary-valued, a joint distribution requires total numbers (minus 1 comes from sum-to-one constraint).This explicit representation of the joint distribution is unmanageable from every perspective.

  • Computationally, it’s very expensive to manipulate and too large to store in memory.

  • Cognitively, it is impossible to acquire so many numbers from a human expert, and the numbers are very small and do not correspond to events that people can reasonably contemplate.

  • Statistically, if we want to learn the distribution from date, we would need ridiculously large amounts of data to estimate this many parameters robustly.

However, Bayesian Networks are able to represent compact representations by exploiting Independence Properties.

2. The student Example

We’ll introduce perhaps the simplest example to see how independence assumptions produce a very compact representation of a high-dimensional distribution.

We now assume that a company would like to hire some graduates. The company’s goal is to hire intelligent employees, but there is no way to test intelligence directly. However, the company have access to student’s SAT scores and course grades. Thus, our probability space is induced by three relevant random variables and . Assuming that takes on three values , representing grades and , takes on two values (low intelligence), (high intelligence), takes on two values (low score) and (high score).

We can get some intuitive independences in this example. The student’s intelligence is clearly correlated both with his SAT score and grade. The SAT score and grade are also not independent.If we on the fact that the student received a high score on his SAT, the chances that he gets a high grade in his class are also likely to increase. Thus, we assume that, for our distribution ,

However, it’s quite plausible that our distribution satisfies a conditional independence property. If we know that the student has high intelligence, a high grade on the SAT no longer gives us information about the student’s performance in the class. That is:

Generally, we may assume that

Note that this independence holds only if we assume that student’s intelligence is the only reason why his grade and SAT score might be correlated, which means that it assumes that there is no correlations due to other factors. These assumptions are also not “True” in any formal sense of word, and they are often only approximations of our true beliefs.

Figure 1: Simple Bayesian networks for the student

As in the case of marginal independence, conditional independences allows us to provide a compact specification of the joint distribution. The compact representation is based on a very natural alternative parameterization. By simple probabilistic reasoning, we have that

But now, the conditional independence assumption implies

Hence, we have that

Thus, we have factorized the joint distribution as a product of three conditional probability distributions (CPDs). This factorization immediately leads us to the desired alternative parameterization. Together with , we can specify the joint distribution. For example, .

We note that this probabilistic model would be represented using the Bayesian network shown in Figure 1.

In this case, the alternative parameterization is more compact than the joint. We now have three binomial distributions — , and , and two three-valued multinomial distributions — and . Each of the binomials requires one independent parameter, and each three-valued multinomial requires two independent parameters, for a total of seven ().By contract, our joint distribution has twelve entries, so that eleven independent parameters.

3. Bayesian Networks

Bayesian networks build on the intuition as the naive Bayes model by exploiting conditional independence properties in order to allow a compact and natural representation.However, they are not restricted to the strong independence assumptions naive Bayes model makes.

The core of the Bayesian network representation is a directed acyclic graph (DAG), whose nodes are the random variables in our domain and whose edges correspond, intuitively, to direct influence of one node on another.

We can view the graph in two ways:

  • a data structure that provides the skeleton for representing a joint distribution compactly in a factorized way.

  • a compact representation for a set of conditional independence assumptions about a distribution.

Factorization Theorem

Given a DAG, the most general form of the probability distribution that is consistent with the graph factors according to “node given its parents”:

where is the set of parent node of , and is the number of nodes.See Figure 2 for an example. This graph can be factorized and represented as follows: \begin{split} &P(X_1,X_2,X_3,X_4,X_5,X_6,X_7,X_8) =
&P(X_1)P(X_2)P(X_3\ |\ X_1)P(X_4\ |\ X_2)P(X_5\ |\ X_2)P(X_6\ |\ X_3, X_4)P(X_7\ |\ X_6)P(X_8\ |\ X_5, X_6) \end{split}

Figure 2: Factorize example

Local Structures and Independences

Graphical models have three fundamental local structures that composes bigger structures.

  • Common parent Fixing decouples and . When two variables and have a common parent , conditional independence holds.

  • Cascade Knowing decouples and . When a middle node in a cascaded three random variables is known, a conditional independence holds.

  • V-structure If is not observed, then and are independent. However, if it is given, then the independence is lost. ( and are not independent given ). In this case, and are marginally independent.

The unintuitive V-structure can be described by a simple example. Suppose clock on tower, traffic jam on Eric’s way to campus, and Eric on time for class. If Eric is not on time and the clock is on time, then our belief that occurred is higher.

4. I-maps

Definition 4.1 Let be a distribution over . We define to be the set of independence assertions of the form that hold in P.

Definition 4.2 Let be an any graph object associated with a set of independences . Then is an for a set of independences if

For example, if a graph is totally connected, then every pair of variables are dependent, more formally, . A complete graph is “useless”, since it does not give any knowledge about the structural.

Facts about I-maps

For to be an I-map of , it is necessary that does not mislead us regarding independences in . In other words, any independence that asserts must also hold in , but conversely, may have additional independences that are not reflected in .

Figure 3: I-map example

Example:

Consider a joint probability space over two independent random variables and . There are three possible graphs (as shown in Figure 3 over these two nodes: , which is a disconnected pair ; , which has the edge ; and , which contains . The graph encodes the assumption that . The latter two encode no independence assumptions.

Consider following two distributions:

In the example on the left, and are independent in ; for example, , , and . Thus, , and we have that is an I-map of . In fact, all three graphs are I-maps of : is empty, so that trivially satisfies all the independences in it (similarly for ). In the example on the right, , so that is not an I-map of . Both other graphs are I-maps of .

Local independences

Definition 4.3 A Bayesian network structure is a directed acyclic graph whose nodes represent random variables . Let denote the parents of in , and denote the variables in the graph that are not descendants of . Then encodes the following set of local conditional independence assumptions

In other words, a node is independent of any non descendants given its parents.

5. D-separation

Direct connection The simple case is that and are directly connected via an edge, say . For any network structure that contains the edge , it is possible to construct a distribution where and are correlated regardless of any evidence about any of the other variables in the network. In other words, if and are directly connected, we can always get examples where they influence each other, regardless of .

Figure 4: The four possible two-edge trails from X to Y via Z

Indirect connection Now consider the more complicated case when X and Y are not directly connected, but there is a trail between them in the graph. We begin by considering the simplest such case: a three-node network, where X and Y are not directly connected, but where there is a trail between them via Z. It is clear that there are four cases where X and Y are connected via Z, as shown in Figure 4.

  • Causal trail , and evidential trail : active iff is not observed. These two is shown in Figure 4(a),(b)

  • Common cause : active iff is not observed.

  • Common effect : active iff or one of its descendants is observed.

Definition 5.1 Let , , be three sets of nodes in . We say that and are dseparated given , denoted , if there is no active trail between any node and given . We use to denote the set of independences that correspond to d-separation:

This set is also called the set of global Markov independences.

6. Soundness and completeness

Soundness If a distribution factorizes according to a graph , then ).

Completeness d-separation detects all possible independences.

However, it is important to note that if and are not d-separated given , then it is not the case that and are dependent given in all distributions that factorize over . For example, consider the graph . Clearly, and are dependent. Note that every distribution over and factorizes according to this graph, since it is always true that . But if we consider the specific distribution give in Table 1, then . However, we can assert that if and are not d-separated given , then there is at least one distribution which factorizes according to the graph, and where is not independent of given . Combining this with the above theorems gives us an important result.

Table 1: The distribution specified in this table factorizes according to the graph but is independent of .

7. Uniqueness of BN

Very different BN graphs can actually be equivalent, in that they encode precisely the same set of conditional independence assertions.For example, the three networks in figure [fig:xyz_trail](a),(b),(c) encode precisely the same independence assumption: . Note that the v-structure network in figure [fig:xyz_trail](d) induces a very different set of d-separation assertions, and hence it does not fall into the same I-equivalence class as the first three.

Definition 7.1 Two graph structures and over X are I-equivalent if . The set of all graphs over is partitioned into a set of mutually exclusive and exhaustive I-equivalence classes, which are the set of equivalence classes induced by the I-equivalence relation.

Definition 7.2 The skeleton of a Bayesian network graph over is an undirected graph over that contains an edge for every edge in .

Theorem 7.1 Let and be two graphs over . If and have the same skeleton and the same set of v-structures then they are I-equivalent.

8. Minimum I-Map

Complete graph is a trivial I-map for any distribution over all variables, since it does not reveal any of the independence structure in the distribution.

Definition 8.1 A graph is a minimal I-map for a set of independences if it is an I-map for , and if the removal of even a single edge from renders it not an I-map.

9. Perfect Maps

Definition 9.1 We say that a graph is a perfect map (P-map) for a set of independences if we have that . We say that is a perfect map for if .

Note that not every distribution has a perfect map.

10. Summary

  • Definition 10.1 Bayesian network is a pair where factorizes over , and where is specified as a set of CPDs associated with nodes. The distribution is often annotated .

  • BN utilizes local and global independences to give a compact representation of the joint distribution.

  • Joint likelihood is computed by multiplying CPDs.

  • Local and global independences are identifiable via d-separation.

Leave a Comment