Measuring Probability
Now that we have a sigma algebra, we can define the concept of probability. A probability distribution (or probability measure) \(\mathrm{P}\) over a sigma algebra \(\mathcal{F}\) is a function that obeys the following (Kolmogorov’s axioms):
- \(\mathrm{P}[E] = 1\) — the probability of something happening is 1.
- \(\mathrm{P}[A] \ge 0\) — non-negativity: probabilities are not negative.
- If \(A_1, A_2, \dots, A_n\) are (countably many) disjoint events in \(\mathcal{F}\), then \(\mathrm{P}[\bigcup_i A_i] = \sum_i \mathrm{P}[A_i]\) (countable additivity).
A collection of disjoint sets is also called mutually exclusive. What it means is that for any \(A_i, A_j\) in the collection, \(A_i \cap A_j = \emptyset\) — the two events cannot both happen simultaneously.
We call a field of events equipped with a probability measure \((E, \mathcal{F}, \mathrm{P})\) a probability space.
What this probability measure does is that it describes how “much” of the total probability is associated with event. This is sometimes called the probability mass, because probability acts like a conserved quantity (like mass or energy in physics). There is a total probability of 1 (from the first axiom \(\mathrm{P}[E] = 1\)); the probability measure over other events tells us how likely they are relative to other events by quantifying how much of the probability mass is placed on them: if \(\mathrm{P}[A] = 0.5\), that tells us that half the probability mass is on event \(A\). This then has a variety of interpretations:
- Interpreted as a description of long-run frequencies, if we repeated the random process infinitely many times, half of the times should be \(A\).
- Interpreted as an expectation of future observations of currently-unobserved outcomes, \(A\) is just as likely as it is not.
The non-negativity axiom keeps us from trying to assign negative probabilities to events because they won’t be meaningful, and the countable additivity axiom ensures that probabilities “make sense” in a way consistent with describing a distribution of mass across the various events. If we have two distinct, disjoint events, then the probability mass assigned to the pair is the sum of their individual masses; the axiom generalizes this to countable sets of disjoint events.
Some additional facts about probability that can be derived from the above axioms:
- \(\mathrm{P}[A] \le 1\) (combined with non-negativity, we have \(0 \le \mathrm{P}[A] \le 1\))
- \(\mathrm{P}[A \cup B] = \mathrm{P}[A] + \mathrm{P}[B] - \mathrm{P}[A \cap B]\)
- \(\mathrm{P}[A^c] = 1 - \mathrm{P}[A]\)
- \(\mathrm{P}[A \setminus B] = \mathrm{P}[A] - \mathrm{P}[A \cap B]\)
- If \(A \subseteq B\), then \(\mathrm{P}[A] \le \mathrm{P}[B]\)
We have to be careful with \(\mathrm{P}[A \cup B]\) — a common mistake is to attempt to compute it as \(\mathrm{P}[A] + \mathrm{P}[B]\). However, this erroneously counts elements in the intersection \(A \cap B\) twice. With the examples from before, where \(A\) is 2s and \(B\) is red cards, the red 2s are included in both \(\mathrm{P}[A]\) (since they are 2s) and \(\mathrm{P}[B]\) (since they are red). Subtracting the joint probability \(\mathrm{P}[A \cap B]\) corrects for the double-counting and produces the correct result. If \(A\) and \(B\) are disjoint (they can never happen at the same time), then \(\mathrm{P}[A \cap B] = 0\), and we can see that the probability follows the rules for countable additivity.