Statistics

See also Probability, where we try to predict future events based on statistics.

Parent Statistics and Probability.

Tell a story: good or bad. Stretch or squeeze the axes.

sports stats
nba.com

Data

collecting
analyzing
presenting

Variability

0,0,0,0,0
0,235,17,5,318

Statistical Questions: to answer, you need to collect data with variability

Population and Sample

subset
Simple random sample - random number generator, for example
Stratified sample - by age or gender
Clustered sample - randomly choose clusters (classrooms in school), then take the whole cluster
Voluntary - likely to introduce bias, non-random
Convenience - non-random
Block design - variation of stratified sample

Bias

voluntary response sampling (non random)
response bias (persons do not want to answer truthfully)
undercoverage (missing certain groups)
convenience sampling
nonresponse (not everyone sampled actually responds)
wording of a survey (influences the answers)

Correlation vs causality

types of statistical studies

Sample Study - select sample from population
Observational Study - looking for correlation, not causality
Experiment - look for causality, use control group

“margin of error”

Experiment Design

explanatory variable
response variable
blind
double-blind
triple-blind, people analyzing the data don't know which is which
matched pairs experiment, switch the control and treatment groups after a period of time

Correlation Coeffecient ®: Calculated as

\begin{align} r = \frac{1}{n-1} \sum \left ( \frac{x_i - \overline{x} }{s_x} \right ) \left ( \frac{y_i - \overline{y} }{s_y} \right ) \end{align}

conventions

\begin{align} y &= actual y \\ \hat{y} &= predicted y \\ \overline{y} &= mean of all y's \\ \end{align}

Coefficient of Determination

Represented by r-squared.

What percentage of the total variation in y is described by the variation in x?

What percentage of the variation described by the line?

How good is the li{ne as a predictor?

Total variation is the squared error of y from the mean of y.

\begin{align} r^2 &= 1 - \frac{SE_{LINE}}{SE\bar{y}} \\ SE_{LINE} &= \text{variation described by the line} \\ SE\bar{y} &= \text{total variation in y from the mean} \\ \end{align}

SSE_{LINE} && \text{variation described by the line}

Squared error for the line, the lower the value, the better the fit.

Coefficient of determination normalizes the squared error by making it a percentage and a probability.

The smaller the squared error, the higher the coefficient, the greater the probabiliy the line is a good fit.

Root mean square error (RMSD): Standard deviation of residuals

Covariance: expected value can be

arithmetic mean
probability weighted sum, or probability weighted integral (in a continuous distribution)

Coefficient: can mean multiplier, factor, scalar

Confidence Intervals

Khan Unit: Confidence Intervals

95% confidence based on 2 standard deviations

the margin of error is 2 standard deviations

the confidence interval is the $\hat{p}$ plus or minus the margin of error

we want to know the proportion of the population that favors a candidate

sample the population

N = 100,000
n = 100
take multiple samples
calc the proportion for all of the samples
assume the mean of the sample proportions = the proportion of the population
assume the sample proportions are normally distributed, the sampling distribution
the standard deviation can be calculated by a formula

\begin{align} \sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}} \\ \end{align}

calculate $\hat{p}$ as the proportion of the sample that favors your candidate

from that $\hat{p}$, n and 95%, calculate the margin of error

ratio, proportion, percentage, percentage proportion

ratio - comparison of two quantities

part-to-part ratio

proportion - equality of two ratios, can be used for interpolation

percentage - a fraction with 100 in the denominator

percentage proportion - an equality of two ratios where the second ratio has a denominator of 100

discrete, Binomial, two-point, bernoulli distributions

binomial - the binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes–no question, and each with its own Boolean-valued outcome:

success (with probability p) or
failure (with probability q = 1 − p).

A single success/failure experiment is also called a Bernoulli trial or Bernoulli experiment,

and a sequence of outcomes is called a Bernoulli process;

for a single trial, i.e., n = 1, the binomial distribution is a Bernoulli distribution.

The binomial distribution is the basis for the popular binomial test of statistical significance.

two-point - a random variable which can take 1 of 2 possible values

bernoulli - a two-point distributions where the two possible outcomes are 0 and 1

a random variable which takes the value 0 or 1
success, yes, true, one
failure, no, false, zero

Random Variables

variable x = a single value

random variable X = a number of values, results of experiments, to calculate P(X)

Distribution

median, range, IQR

box-and-whisker plot

box plot vs dot plot

Frequency Distribution

dot plot

mean absolute deviation (MAD)

central tendancy

mean, aka arithmetic mean, $\mu$ = average
median = middle number of rank ordered set
mode = most frequent number

spread

range = max - min
interquartile range (IQR) = diff between top half median and bottom half median
variance, sigma squared, $\sigma^2$ = $\frac{\sum_{x=1}^{N}(x_i-\mu)^2}{N}$
standard deviation, sigma, $\sigma$ = $\sqrt{\sigma^2}$

with a tight distribution, mean and standard deviation work best

with a skewed distribution, median and interquartile range work best

do algebra on the variance formula

\begin{align} \mu &= \frac{\sum_{x=1}^{N}(x_i)}{N} &&\text{mean formula}\\ \sigma^2 &= \frac{\sum_{x=1}^{N}(x_i-\mu)^2}{N} &&\text{variance formula 1}\\ \sigma^2 &= \frac{\sum_{x=1}^{N}(x_i)^2}{N} - \mu^2 &&\text{variance formula 2}\\ \sigma^2 &= \frac{\sum_{x=1}^{N}(x_i)^2}{N} - \frac{\left [\sum_{x=1}^{N}(x_i)\right ]^2}{N^2} &&\text{variance formula 3}\\ \end{align}

Population vs Sample

We can work with an entire population.
Or we can work with a sample from the population.
The math is the same except for two things.

The symbols are different.
- For a population, we use
  - $\mu$ for the mean,
  - $\sigma$ for standard deviation, and
  - $N$ for the number of points.
- For a sample, we use
  - $\bar{x}$ for the mean,
  - $s_x$ for standard deviation, and
  - $n$ for the number of points.
The divisor in the standard deviation formula is different.
- For a population, we divide by N.
- For a sample, we divide by n-1.

Relative Frequency

y axis is percent instead of raw frequency count

Density Curve

Make histogram bars more and more narrow until the top becomes a line.

The data points can take on any value in a continuum, as opposed to being lumped into coarse buckets.

Area under the curve is 100%.

The curve will never go negative.

Measure the area under the density curve between two values, to get the percentage of data points falling between those two values. This can sometimes be estimated by calculating the area of the rectangle.

You cannot calculate the percent of a single value, because there is no rectangle. The line has no width.

Symmetric Density Curve

mean and median are equal, both cut the area under the curve in half

in the bell curve, the mode is also equal to the mean and median

picture a bimodal symmetric curve

the area under the left half is equal to the area under the right half

asymmetric density curve

aka skewed

mean will be towards the long tail from the median

long tail to the right ⇒ right-skewed distribution

and vice versa

Probability Distribution

Distribution
Frequency Distribution
Binomial Distribution
Percentile
Density Curve
Probability Distribution
Cumulative Probability Distribution

frequency - table of actual results

relative frequency - division to create percentage

probability = theoretical probability = relative frequency of the entire population

$$ probability = \frac{\text{number of successful outcomes}}{\text{number of possible outcomes}}$$

$$ relative frequency = \frac{\text{number of successful outcomes}}{\text{number of trials}}$$

Binomial Random Variable

each trial has boolean result, ie, success or failure

trial results are independent

fixed number of trials

same probability in each trial

Geometric Random Variable

number of trials until success

Binomial Distribution

uses factorial in the formula

is discreet function

Geometric Distribution

right-skewed

Random Variable: Binomial vs Geometric

	Binomial	Geometric
each trial has boolean result	x	x
trial results are independent	x	x
same probability in each trial	x	x
fixed number of trials	x	not

Examples

Binomial: How many sixes in 12 rolls of the die?

Geometric: How many rolls of a die until one six?

Distribution

Look at past events and organize them into patterns which tell a story and allow us to understand how, when, and why things happen.

The graph of a distribution is a curve.

There are several kinds of distributions.

discrete vs continuous

distributions

linear growth, errors, offsets
- normal, Gaussian
exponential growth, prices, incomes, populations
- log-normal, a single quantity whose log is normally distributed
- Pareto, a single quantity whose log is exponentially distributed
uniformly distributed
- discrete uniform distribution, for a finite set of values, coin, die
- continuous uniform distribution, for continously distributed values
Bernoulli trials
- Bernoulli distribution (success/failure, yes/no)
- Binomial distribution, ?
- negative binomial distribution,
- geometric distribution, number of failures before the first success
categorical outcomes (events with K possible outcomes)
Poisson process (events that occur independently with a given rate)
absolute values of vectors with normally distributed components
normally distributed quantities opereated with sum of squares
as conjugate prior distribution in Bayesian inference
some specialized applications

Naive Bayes Classifier in AI is based on Bayes' Theorem in probability theory.

Wikipedia: Probability distribution

Normal Distribution

aka Bell Curve or Gauss Distribution.

Bell Curve

The _normal distribution_ is given by the equation $$e^{\frac{-x^2}{2}}$$

When we actually have input data, we will use this equation. $$y = \frac{1}{\sigma \sqrt[]{2\pi }}e^{\frac{(x-\mu )^2}{2\sigma ^2}}$$

Characteristics:

The curve is symmetric about the y axis.
The center portion is a convex parabola and has one maximum point.
To either side lies an inflection point where the line becomes a concave parabola.
The line stretches to the left and right, approaching the limit of zero.
The area under the curve totals 1.0, the total probability of any prediction.

Median

Standard Deviation

Salmon Khan uses problems from ck12.org open source flex book: AP Statistics

Empirical Rule: 68 - 95 - 99.7

Values within plus-or-minus 1 standard deviation account for 68% of the universe.
Values within plus-or-minus 2 standard deviation account for 95% of the universe.
Values within plus-or-minus 3 standard deviation account for 99.7% of the universe.

Standard Normal Distribution

a distribution where

$$\mu = 0 \text{ and } \sigma = 1$$

z-score = number of standard deviations away from the mean

$$ \text{z-score} = \frac{x - \mu}{\sigma}$$

allows comparison of values on different scales and distributions, like LSAT and MCAT

Standard Normal Table, aka z-table - a table based on the Standard Normal Distribution

gives cumulative probability for any z-score

where cumulative probability is the area under the curve to the left of the z-score

How to find the cumulative probability of a value in the distribution.

Calculate the z-score
Look up the z-score in the standard normal table

Normal vs Pareto

Walter Scheidel: The Great Leveler https://press.princeton.edu/books/paperback/9780691183251/the-great-leveler the only things that can level wealth inequality are: war revolution state collapse plague

nomal iq conscientiousness openness

pareto productivity creative output

Price's Law Derek De Sola Price 1960s study: a vanishingly small proportion of scientests operating in a given domain, produce half the output a tiny number of people produce almost all of everything aka square root law The square root of a number of people in a domain produce half the output if you have 10 employees, 3 of them produce half the output if you have 100 employees, 10 of them produce half the output if you have 10,000 employees, 100 of them produce half the output

The Matthew Principle From those who have everything, more will be given. From those who have nothing, all will be taken. (An economic principle, copying language from the new testament.)

iq, conscientiousness, openness are normally distributed and are good predictors of long-term performance but creative output is NOT normally distributed

2017 Personality 21: Biology & Traits: Performance Prediction https://www.youtube.com/watch?v=Q7GKmznaqsQ&t=0s

2017 Personality 18: Biology & Traits: Openness, Intelligence, Creativity https://www.youtube.com/watch?v=D7Kn5p7TP_Y&t=5393s

YouTube: Jordan Petersen - IQ and the Job Market

YouTube: Jordan Peterson - Controversial Facts about IQ

YouTube: Jordan Peterson - The Big IQ Controversy

The SAT, GRE, the LSAT, all of those are IQ tests. They are more crystallized than fluid.

IQ 145 to 160 - to be the best. 1 in 10,000 116 to 130 - 95-86 percentile 115-110 - 85-73 percentile below 87 - no jobs below 83 - 10% of pop. , cannot join the army

YouTube: Udacity: IQ score distribution

Python Programming for Statistics and Probability

spreadsheet of usa deaths 2015 to 2020

Table of Contents

Statistics

conventions

Coefficient of Determination

Confidence Intervals

ratio, proportion, percentage, percentage proportion

discrete, Binomial, two-point, bernoulli distributions

Random Variables

Distribution

Frequency Distribution

Population vs Sample

Relative Frequency

Density Curve

Symmetric Density Curve

asymmetric density curve

Probability Distribution

Binomial Random Variable

Geometric Random Variable

Binomial Distribution

Geometric Distribution

Random Variable: Binomial vs Geometric

Distribution

Normal Distribution

Bell Curve

Median

Standard Deviation

Standard Normal Distribution

Normal vs Pareto