3. Probability and Information Theory

Preface

Probability theory is mathmatical framework for representing uncertain statements.
Usage in AI
- the laws of probability tells us how AI systems should reason, so we design our algorithms to compute or approximate various expressions derived using probability theory
- we can use probability and statistics to theoretically analyze the behavior of proposed AI systems.
While prob theory allows us to make uncertain statements and reason in the presence of uncertainty, information theory allows us to quantify the amount of uncertainty in a prob distribution.

3.1 Why Probability

ML must always deal with uncertain quantities, and sometimes may also need to deal with stochastic(non-deterministic) quantities.
Beyond mathematical statements that are true by definition, it is difficult to think of any proposition that is absolutely true or any event that is absolutely guaranteed to occur.
Three possible sources of uncertainty
- Inherent stochasticity in the system being modeled
- Incomplete observability - we cannot observe all of the variables that drive the behavior of the system. Monty Hall problem
- Incomplete modeling - When we use a model that must discard some of the information we have observed, the discarded information results in uncertainty in the model’s prediction.
In many cases, it is more practical to use a simple but uncertain rule rather than a complex but certain one, even if the true rule is deterministic and our modeling system has the fidelity to accommodate a complex rule.
- Most birds fly VS Birds fly, except for very young birds that have not yet learned to fly, sick or injured birds that have lost the ability to fly, fllightless species of birds including the cassowary, ostrich, and kiwi.
- The second one is expensive to develop, maintain and communicate, and after all of this effort is still brittle and prone to failure.
When we say that an outcome has a probability p of occurring, it means that if we repeated the experiment infinitely many times, then proportion p of the repetitions would result in that outcome.
degree of belief
The former kind of prob, related directly to the rates at which events occur, is known as frequentist probability, while the latter, related to qualitative levels of certainty, is known as Baysian probability.
probability theory provides a set of formal rules for determining the likelihood of a proposition being true given the likelihood of other propositions.

3.2 Random Variables

random variable is a variable that can take on different values randomly.

For example, $x_1$ and $x_2$ are both possible values that the random variable $x$ can take on.
for vector-valued variables, we would write the random variable as x(arial font) and one of its values as $\mathbf{x}$.
random variables my be discrete or continuous.
discrete rv is one that has a finite or countably infinite number of states. Not necessarily the integers.
a continuous rv is associated with a real value.

3.3 Probability Distribution

A probability distribution is a description of how likely a rv or set of rvs is to take on each of its possible states.

3.3.1 Discrete Variables and Probability Mass Functions

Probability Mass Function (PMF)
- denotes with $P$
- The domain of P must be the set of all possible states of x.
- For all x in x, $0 \leq P(x) \leq 1$.
- Sum of all probabilities $P(x)$ is one. This means being normalized.

3.3.2 Continuous Variables and Probability Density Functions

Probability Density Function (PDF)
- The domain of p must be the set of all possible states of x.
- $\forall x \in x, p(x) \geq 0$. Note that we do not require $p(x) \leq 1$.
  $\int{p(x)dx}=1$

3.4 Marginal Probability

The probability distribution over the subset is known as the marginal probability distribution.
- For example, suppose that we have discrete random variables x and y, and we know $P(x,y)$. We can find $P(x)$ with the sum rule:
  $\forall x \in x, P(x=x) = \sum_y P(x=x, y=y)$
- The name comes from the process of computing marginal probabilities on paper.
  $p(x) = \int p(x,y)dy$
  3.5 Conditional Probability
- Conditional Probability is the probability of some event, given that some other event has happened.
- We denote the conditional probability that $y=y$ given $x=x$ as $P(y=y | x=x)$.
  $P(y=y|x=x) = \frac{P(y=y, x=x)}{P(x=x)}$
- The conditional probability is only defined when $P(x=x) > 0$.
- Important not to confuse conditional probability with computing what would happen if some action were undertaken.
  
  3.6 The chain Rule of Conditional Probabilities
- Any joint probability distribution over many random variables may be decomposed into conditional distributions over only one variable.
- This observation is known as the Chain rule or product rule of probability.
  $P(a,b,c) = P(a\ |\ b,c)P(b,c)$ $P(b,c) = P(b\ |\ c)P(c)$ $P(a,b,c) = P(a\ |\ b,c)P(b\ |\ c)P(c)$
  3.7 Independence and Conditional Independence
- Two random variables x and y are independent if their probability distribution can be expressed as a product of two factors, one involving only x and one involving only y.
  $\forall x \in x, y \in y, p(x=x, y=y) = p(x=x)p(y=y)$
- Two random variables x and y are conditionally independent given a random variable z if the conditional probability distribution over x and y factorizes in this way for every value of z.
  $\forall x \in x, y \in y, z \in z, p(x=x,y=y|z=z)=p(x=x|z=z)p(y=y|z=z)$
  $notiation : \ x \perp y\ |\ z$
  
  3.8 Expectation, Variance, and Covariance
- The expectation or expected value of some function f(x) with respect to a probability distribution P(x) is the average or mean value that f takes on when x is drawn from P. For discrete variables this can be computed with a summation
  $E_{X \sim P}[f(x)] = \sum_xP(x)f(x)$ $\text{For continuous variables}$ $E_{x \sim p}[f(x)] = \int p(x)f(x)dx$
- The variance gives a measure of how much the values of a function of a random variable x vary as we sample different values of x from its probability distribution.
  $Var(f(X)) = E[(f(x)-E[f(x)])^2]$
- The covariance gives some sense of how much two values are linearly related to each other, as well as the scale of these variables.
  $Cov(f(x),g(y)) = E[(f(x) - E[f(X)])(g(y)-E[g(y)])$
- High absolute values of the covariance mean that the values change very much and are both far from their respective means at the same time.
- $Cov = 0$ means there is no linear relationship, $y=x^2$ has Cov of 0 but they are not independent.
- Correlation normalize the contribution of each variable in order to measure only how much the variables are related, rather than also being affected by the scale of the separate variables.
  
  3.9 Common Probability Distributions
  3.9.1 Bernoulli Distribution
- It is a distribution over a single binary random variable.
  
  3.9.2 Multinoulli Distribution
  - Later
  3.9.3 Gaussian Distribution
  - The most commonly used distribution over real numbers is the normal distribution, also known as the Gaussian distribution.

$\mu \in \mathbf{R}\ and\ \sigma \in (0, \infty)$ $E[x] = \mu$

When absence of prior knowledge about what form a distribution over real numbers, Normal Distribution is good choice for two major reason.
- First, many distributions we wish to model are truly close to being normal distribution. The central limit theorem(CLT) shows that the sum of many independent random variables is approximately normally distributed
  - Second, out of all possible probability distributions with the same variance, the normal distribution encodes the maximum amount of uncertainty over the real number.
3.10 Useful Properties of Common Functions
- logistic sigmoid
- The logistic sigmoid is commonly used to produce the \phi parameter of a Bernoulli distribution because its range is (0,1)
- The sigmoid function sturates when its argument is very positive or very negative, meaning that the function becomes very flat and insensitive to smal changes in the inputs.
- Another commonly encountered function is the softplus function.
- The softplus function can be useful for producing the Beta or sigma parameter of a normal distribution because its range is $(0, \infty)$.
3.11 Bayes’ Rule
- We often find ourselves in a situation where we know P(y|x) and need to know P(x|y). Fortunately, if we also know P(x), we can compute the desired quantity using Bayes’ rule.

younguna

Probability and Information Theory

3. Probability and Information Theory

Preface

3.1 Why Probability

3.2 Random Variables

3.3 Probability Distribution

3.3.1 Discrete Variables and Probability Mass Functions

3.3.2 Continuous Variables and Probability Density Functions

3.4 Marginal Probability

3.5 Conditional Probability

3.6 The chain Rule of Conditional Probabilities

3.7 Independence and Conditional Independence

3.8 Expectation, Variance, and Covariance

3.9 Common Probability Distributions

3.9.1 Bernoulli Distribution

3.9.2 Multinoulli Distribution

3.9.3 Gaussian Distribution

3.10 Useful Properties of Common Functions

3.11 Bayes’ Rule