Joint & Conditional

We define the joint probability \(\mathrm{P}[A, B] = \mathrm{P}[A \cap B]\): the probability of both \(A\) and \(B\) happening in the same observation. This is sometimes also written \(\mathrm{P}[A; B]\), and commas and semicolons are sometimes mixed. This is usually to separate different kinds of events in the probability statement.

The conditional probability \(\mathrm{P}[B|A]\), read “the probability of \(B\) given \(A\)”, is the probability of \(B\) conditioned on the knowledge that \(A\) has happened.

Conditional and joint probabilities decompose as follows:

\(\mathrm{P}[A,B] = \mathrm{P}[A|B] \mathrm{P}[B]\)
\(\mathrm{P}[A,B] = \mathrm{P}[B|A] \mathrm{P}[A]\)

From this we can derive Bayes’ theorem:

\[\mathrm{P}[B|A] = \frac{\mathrm{P}[A|B] \mathrm{P}[B]}{\mathrm{P}[A]}\]

Bayes’ theorem gives us a tool to invert a conditional probability: given \(\mathrm{P}[A|B]\) and the associated unconditional probabilities \(\mathrm{P}[A]\) and \(\mathrm{P}[B]\), we can obtain \(\mathrm{P}[B|A]\). Crucially, this inversion requires the base rates, as expressed by unconditional probabilities, of \(A\) and \(B\) — the conditional probability alone does not provide sufficient information.

In many interesting settings, such as Bayesian inference, we are looking to compute and compare \(\mathrm{P}[A|B]\) for several different \(B\) and the same \(A\). For such computations, we do not actually need \(\mathrm{P}[A]\) unless we need the actual probabilities; if we simply wish to know which \(B\) is the most likely for a given \(A\), we treat \(\mathrm{P}[A]\) as fixed and look for the largest joint probability \(\mathrm{P}[A|B]\mathrm{P}[B]\) (sometimes called an unscaled probability, because we have not scaled it by \(\mathrm{P}[A]\) to obtain a proper conditional probability). For such computations, we say that \(\mathrm{P}[B|A] \propto \mathrm{P}[A|B]\mathrm{P}[B]\) (\(\mathrm{P}[B|A]\) is proportional to \(\mathrm{P}[A|B]\mathrm{P}[B]\)).

Finally, we can marginalize a joint distribution by summing. If \(\mathcal{B} = {B_1, B_2, \dots, B_n}\) is a collection of mutually exclusive events that span \(E\), then:

\[\mathrm{P}[A] = \sum_{B \in \mathcal{B}} \mathrm{P}[A, B]\]

We call \(\mathcal{B}\) a partition of \(E\). By “span \(E\)”, we mean that for any \(e \in E\), there is some \(B_i \in \mathcal{B}\) such that \(e \in B_i\).

Independence

Two events are independent if knowing the outcome of one tells you nothing about the probability of the other. The following are true if and only if \(A\) and \(B\) are independent:

\(\mathrm{P}[A|B] = \mathrm{P}[A]\)
\(\mathrm{P}[B|A] = \mathrm{P}[B]\)
\(\mathrm{P}[A, B] = \mathrm{P}[A] \mathrm{P}[B]\)