Measuring Probability

Now that we have a sigma algebra, we can define the concept of probability. A probability distribution (or probability measure) \(\mathrm{P}\) over a sigma algebra \(\mathcal{F}\) is a function that obeys the following (Kolmogorov’s axioms):

\(\mathrm{P}[E] = 1\) — the probability of something happening is 1.
\(\mathrm{P}[A] \ge 0\) — non-negativity: probabilities are not negative.
If \(A_1, A_2, \dots, A_n\) are (countably many) disjoint events in \(\mathcal{F}\), then \(\mathrm{P}[\bigcup_i A_i] = \sum_i \mathrm{P}[A_i]\) (countable additivity).

A collection of disjoint sets is also called mutually exclusive. What it means is that for any \(A_i, A_j\) in the collection, \(A_i \cap A_j = \emptyset\) — the two events cannot both happen simultaneously.

We call a field of events equipped with a probability measure \((E, \mathcal{F}, \mathrm{P})\) a probability space.

What this probability measure does is that it describes how “much” of the total probability is associated with event. This is sometimes called the probability mass, because probability acts like a conserved quantity (like mass or energy in physics). There is a total probability of 1 (from the first axiom \(\mathrm{P}[E] = 1\)); the probability measure over other events tells us how likely they are relative to other events by quantifying how much of the probability mass is placed on them: if \(\mathrm{P}[A] = 0.5\), that tells us that half the probability mass is on event \(A\). This then has a variety of interpretations:

Interpreted as a description of long-run frequencies, if we repeated the random process infinitely many times, half of the times should be \(A\).
Interpreted as an expectation of future observations of currently-unobserved outcomes, \(A\) is just as likely as it is not.

The non-negativity axiom keeps us from trying to assign negative probabilities to events because they won’t be meaningful, and the countable additivity axiom ensures that probabilities “make sense” in a way consistent with describing a distribution of mass across the various events. If we have two distinct, disjoint events, then the probability mass assigned to the pair is the sum of their individual masses; the axiom generalizes this to countable sets of disjoint events.

Some additional facts about probability that can be derived from the above axioms:

\(\mathrm{P}[A] \le 1\) (combined with non-negativity, we have \(0 \le \mathrm{P}[A] \le 1\))
\(\mathrm{P}[A \cup B] = \mathrm{P}[A] + \mathrm{P}[B] - \mathrm{P}[A \cap B]\)
\(\mathrm{P}[A^c] = 1 - \mathrm{P}[A]\)
\(\mathrm{P}[A \setminus B] = \mathrm{P}[A] - \mathrm{P}[A \cap B]\)
If \(A \subseteq B\), then \(\mathrm{P}[A] \le \mathrm{P}[B]\)