Table of Contents
Statistics
See also Probability, where we try to predict future events based on statistics.
Parent Statistics and Probability.
Tell a story: good or bad. Stretch or squeeze the axes.
sports stats
nba.com
Data
- collecting
- analyzing
- presenting
Variability
- 0,0,0,0,0
- 0,235,17,5,318
Statistical Questions: to answer, you need to collect data with variability
Population and Sample
- subset
- Simple random sample - random number generator, for example
- Stratified sample - by age or gender
- Clustered sample - randomly choose clusters (classrooms in school), then take the whole cluster
- Voluntary - likely to introduce bias, non-random
- Convenience - non-random
- Block design - variation of stratified sample
Bias
- voluntary response sampling (non random)
- response bias (persons do not want to answer truthfully)
- undercoverage (missing certain groups)
- convenience sampling
- nonresponse (not everyone sampled actually responds)
- wording of a survey (influences the answers)
Correlation vs causality
types of statistical studies
- Sample Study - select sample from population
- Observational Study - looking for correlation, not causality
- Experiment - look for causality, use control group
“margin of error”
Experiment Design
- explanatory variable
- response variable
- blind
- double-blind
- triple-blind, people analyzing the data don't know which is which
- matched pairs experiment, switch the control and treatment groups after a period of time
- Correlation Coeffecient ®
- Calculated as
\begin{align} r = \frac{1}{n-1} \sum \left ( \frac{x_i - \overline{x} }{s_x} \right ) \left ( \frac{y_i - \overline{y} }{s_y} \right ) \end{align}
conventions
\begin{align} y &= actual y \\ \hat{y} &= predicted y \\ \overline{y} &= mean of all y's \\ \end{align}
Coefficient of Determination
Represented by r-squared.
What percentage of the total variation in y is described by the variation in x?
What percentage of the variation described by the line?
How good is the li{ne as a predictor?
Total variation is the squared error of y from the mean of y.
\begin{align} r^2 &= 1 - \frac{SE_{LINE}}{SE\bar{y}} \\ SE_{LINE} &= \text{variation described by the line} \\ SE\bar{y} &= \text{total variation in y from the mean} \\ \end{align}
SSE_{LINE} && \text{variation described by the line}
Squared error for the line, the lower the value, the better the fit.
Coefficient of determination normalizes the squared error by making it a percentage and a probability.
The smaller the squared error, the higher the coefficient, the greater the probabiliy the line is a good fit.
- Root mean square error (RMSD)
- Standard deviation of residuals
- Covariance
- expected value can be
- arithmetic mean
- probability weighted sum, or probability weighted integral (in a continuous distribution)
- Coefficient
- can mean multiplier, factor, scalar
Confidence Intervals
Khan Unit: Confidence Intervals
95% confidence based on 2 standard deviations
the margin of error is 2 standard deviations
the confidence interval is the $\hat{p}$ plus or minus the margin of error
we want to know the proportion of the population that favors a candidate
sample the population
- N = 100,000
- n = 100
- take multiple samples
- calc the proportion for all of the samples
- assume the mean of the sample proportions = the proportion of the population
- assume the sample proportions are normally distributed, the sampling distribution
- the standard deviation can be calculated by a formula
\begin{align} \sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}} \\ \end{align}
calculate $\hat{p}$ as the proportion of the sample that favors your candidate
from that $\hat{p}$, n and 95%, calculate the margin of error
ratio, proportion, percentage, percentage proportion
ratio - comparison of two quantities
part-to-part ratio
proportion - equality of two ratios, can be used for interpolation
percentage - a fraction with 100 in the denominator
percentage proportion - an equality of two ratios where the second ratio has a denominator of 100
discrete, Binomial, two-point, bernoulli distributions
binomial - the binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes–no question, and each with its own Boolean-valued outcome:
- success (with probability p) or
- failure (with probability q = 1 − p).
A single success/failure experiment is also called a Bernoulli trial or Bernoulli experiment,
and a sequence of outcomes is called a Bernoulli process;
for a single trial, i.e., n = 1, the binomial distribution is a Bernoulli distribution.
The binomial distribution is the basis for the popular binomial test of statistical significance.
two-point - a random variable which can take 1 of 2 possible values
bernoulli - a two-point distributions where the two possible outcomes are 0 and 1
- a random variable which takes the value 0 or 1
- success, yes, true, one
- failure, no, false, zero
Random Variables
variable x = a single value
random variable X = a number of values, results of experiments, to calculate P(X)
Distribution
median, range, IQR
box-and-whisker plot
box plot vs dot plot
Frequency Distribution
dot plot
mean absolute deviation (MAD)
central tendancy
- mean, aka arithmetic mean, $\mu$ = average
- median = middle number of rank ordered set
- mode = most frequent number
spread
- range = max - min
- interquartile range (IQR) = diff between top half median and bottom half median
- variance, sigma squared, $\sigma^2$ = $\frac{\sum_{x=1}^{N}(x_i-\mu)^2}{N}$
- standard deviation, sigma, $\sigma$ = $\sqrt{\sigma^2}$
with a tight distribution, mean and standard deviation work best
with a skewed distribution, median and interquartile range work best
do algebra on the variance formula
\begin{align} \mu &= \frac{\sum_{x=1}^{N}(x_i)}{N} &&\text{mean formula}\\ \sigma^2 &= \frac{\sum_{x=1}^{N}(x_i-\mu)^2}{N} &&\text{variance formula 1}\\ \sigma^2 &= \frac{\sum_{x=1}^{N}(x_i)^2}{N} - \mu^2 &&\text{variance formula 2}\\ \sigma^2 &= \frac{\sum_{x=1}^{N}(x_i)^2}{N} - \frac{\left [\sum_{x=1}^{N}(x_i)\right ]^2}{N^2} &&\text{variance formula 3}\\ \end{align}
Population vs Sample
We can work with an entire population.
Or we can work with a sample from the population.
The math is the same except for two things.
- The symbols are different.
- For a population, we use
- $\mu$ for the mean,
- $\sigma$ for standard deviation, and
- $N$ for the number of points.
- For a sample, we use
- $\bar{x}$ for the mean,
- $s_x$ for standard deviation, and
- $n$ for the number of points.
- The divisor in the standard deviation formula is different.
- For a population, we divide by N.
- For a sample, we divide by n-1.
Relative Frequency
y axis is percent instead of raw frequency count
Density Curve
Make histogram bars more and more narrow until the top becomes a line.
The data points can take on any value in a continuum, as opposed to being lumped into coarse buckets.
Area under the curve is 100%.
The curve will never go negative.
Measure the area under the density curve between two values, to get the percentage of data points falling between those two values. This can sometimes be estimated by calculating the area of the rectangle.
You cannot calculate the percent of a single value, because there is no rectangle. The line has no width.
Symmetric Density Curve
mean and median are equal, both cut the area under the curve in half
in the bell curve, the mode is also equal to the mean and median
picture a bimodal symmetric curve
the area under the left half is equal to the area under the right half
asymmetric density curve
aka skewed
mean will be towards the long tail from the median
long tail to the right ⇒ right-skewed distribution
and vice versa
Probability Distribution
Distribution
Frequency Distribution
Binomial Distribution
Percentile
Density Curve
Probability Distribution
Cumulative Probability Distribution
frequency - table of actual results
relative frequency - division to create percentage
probability = theoretical probability = relative frequency of the entire population
$$ probability = \frac{\text{number of successful outcomes}}{\text{number of possible outcomes}}$$
$$ relative frequency = \frac{\text{number of successful outcomes}}{\text{number of trials}}$$
Binomial Random Variable
each trial has boolean result, ie, success or failure
trial results are independent
fixed number of trials
same probability in each trial
Geometric Random Variable
number of trials until success
Binomial Distribution
uses factorial in the formula
is discreet function
Geometric Distribution
right-skewed
Random Variable: Binomial vs Geometric
| Binomial | Geometric | |
| each trial has boolean result | x | x |
| trial results are independent | x | x |
| same probability in each trial | x | x |
| fixed number of trials | x | not |
Examples
Binomial: How many sixes in 12 rolls of the die?
Geometric: How many rolls of a die until one six?
Distribution
Look at past events and organize them into patterns which tell a story and allow us to understand how, when, and why things happen.
The graph of a distribution is a curve.
There are several kinds of distributions.
discrete vs continuous
distributions
- linear growth, errors, offsets
- normal, Gaussian
- exponential growth, prices, incomes, populations
- log-normal, a single quantity whose log is normally distributed
- Pareto, a single quantity whose log is exponentially distributed
- uniformly distributed
- discrete uniform distribution, for a finite set of values, coin, die
- continuous uniform distribution, for continously distributed values
- Bernoulli trials
- Bernoulli distribution (success/failure, yes/no)
- Binomial distribution, ?
- negative binomial distribution,
- geometric distribution, number of failures before the first success
- categorical outcomes (events with K possible outcomes)
- Poisson process (events that occur independently with a given rate)
- absolute values of vectors with normally distributed components
- normally distributed quantities opereated with sum of squares
- as conjugate prior distribution in Bayesian inference
- some specialized applications
Naive Bayes Classifier in AI is based on Bayes' Theorem in probability theory.
Normal Distribution
aka Bell Curve or Gauss Distribution.
Bell Curve
The _normal distribution_ is given by the equation $$e^{\frac{-x^2}{2}}$$
When we actually have input data, we will use this equation. $$y = \frac{1}{\sigma \sqrt[]{2\pi }}e^{\frac{(x-\mu )^2}{2\sigma ^2}}$$
Characteristics:
- The curve is symmetric about the y axis.
- The center portion is a convex parabola and has one maximum point.
- To either side lies an inflection point where the line becomes a concave parabola.
- The line stretches to the left and right, approaching the limit of zero.
- The area under the curve totals 1.0, the total probability of any prediction.
Median
Standard Deviation
Salmon Khan uses problems from ck12.org open source flex book: AP Statistics
Empirical Rule: 68 - 95 - 99.7
- Values within plus-or-minus 1 standard deviation account for 68% of the universe.
- Values within plus-or-minus 2 standard deviation account for 95% of the universe.
- Values within plus-or-minus 3 standard deviation account for 99.7% of the universe.
Standard Normal Distribution
a distribution where
$$\mu = 0 \text{ and } \sigma = 1$$
z-score = number of standard deviations away from the mean
$$ \text{z-score} = \frac{x - \mu}{\sigma}$$
allows comparison of values on different scales and distributions, like LSAT and MCAT
Standard Normal Table, aka z-table - a table based on the Standard Normal Distribution
gives cumulative probability for any z-score
where cumulative probability is the area under the curve to the left of the z-score
How to find the cumulative probability of a value in the distribution.
- Calculate the z-score
- Look up the z-score in the standard normal table
Normal vs Pareto
Walter Scheidel: The Great Leveler https://press.princeton.edu/books/paperback/9780691183251/the-great-leveler the only things that can level wealth inequality are: war revolution state collapse plague
nomal iq conscientiousness openness
pareto productivity creative output
Price's Law Derek De Sola Price 1960s study: a vanishingly small proportion of scientests operating in a given domain, produce half the output a tiny number of people produce almost all of everything aka square root law The square root of a number of people in a domain produce half the output if you have 10 employees, 3 of them produce half the output if you have 100 employees, 10 of them produce half the output if you have 10,000 employees, 100 of them produce half the output
The Matthew Principle From those who have everything, more will be given. From those who have nothing, all will be taken. (An economic principle, copying language from the new testament.)
iq, conscientiousness, openness are normally distributed and are good predictors of long-term performance but creative output is NOT normally distributed
2017 Personality 21: Biology & Traits: Performance Prediction https://www.youtube.com/watch?v=Q7GKmznaqsQ&t=0s
2017 Personality 18: Biology & Traits: Openness, Intelligence, Creativity https://www.youtube.com/watch?v=D7Kn5p7TP_Y&t=5393s
YouTube: Jordan Petersen - IQ and the Job Market
YouTube: Jordan Peterson - Controversial Facts about IQ
YouTube: Jordan Peterson - The Big IQ Controversy
The SAT, GRE, the LSAT, all of those are IQ tests. They are more crystallized than fluid.
IQ 145 to 160 - to be the best. 1 in 10,000 116 to 130 - 95-86 percentile 115-110 - 85-73 percentile below 87 - no jobs below 83 - 10% of pop. , cannot join the army
YouTube: Udacity: IQ score distribution

