Notes in Progress

Resources to consume

<https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1//67000Dx//ZCJB-3pi> 3blue1brown youtube channel with numerous videos on math with animations
- Neural Net Deep Learning 1, ft hot chick
- Neural Net Deep Learning 2, done
- Neural Net Deep Learning 3, Back Propagation, not public yet
<http://neuralnetworksanddeeplearning.com> online html book
<https://www.deepbootcamp.io/> website on Google DeepMind bootcamp on Reinforcement Learning
- <https://www.youtube.com/watch?v=bsuvM1jO-4w&feature=youtu.be> youtube from RL bootcamp
<http://colah.github.io/> Christopher Ohlah's blog, multiple articles, visualizations, article on NHIST
<https://distill.pub> Distill Blog, co-edited by Christopher Ohlah
https://www.tensorflow.org/get//started/get//started tensorflow getting started, 1/3 of the way thru the first tutorial

great thoughts

TensorFlow and Caffe

Theano, python package for machine learning supported by Montreal Institute for Leaning Algorithms (MILA) from 2010 to 2017. Eclipsed by TensorFlow and Caffe.

Caffe, python package for machine learning supported by Berkeley AI Research (BAIR).

Yoshua Bengio, Head of Montreal Institute for Leaning Algorithms (MILA)

Yangqing Jia (贾扬清), created Caffe at Berkeley, went to Google Brain and worked on TensorFlow, now at Facebook working on Caffe2 and ONNX. [ONNX is the attempt by Facebook and Microsoft to keep themselves relevant under the onslaught of TensorFlow.]

equation drawing with rmarkdown, latex http://www.statpower.net/Content/310/R%20Stuff/SampleMarkdown.html

python is installed here
C:\Users\John\AppData\Local\Programs\Python\Python36

language imprecision
In week 1, he uses the term “dimensional” incorrectly. In week 2, Andrew Ng uses the word “multivariant linear regression” incorrectly

m = number of rows in the training data n = number of independent variables in each row

independent variable - aka predictor, aka feature

there is no such thing as a 4-dimensional vector a vector is a 1-dimensional matrix and a 1-dimensional tensor this vector has 4 thingies occuring in one dimension

elements of an array member of a set

vector in physics vs vector in matrix math

vector = array = a list of things, members, elements, numbers

with only two features, the cost function can be represented by a surface or contour

feature scaling = normalization get every feature into -1 < x < +1 or -3 < x < +3 or 0 < x < +1 - this last is called normalization -.5 < 0 < +.5 this is called mean normalization, where the mean value is zero

feature scaling improves the efficience of gradient descent demonstrated by the 2 variable case, where if one variable has a range of 1 to 2000 and the other variable has a range of 1 to 5, the resulting contour will be long and skinny and gradient descent works much better if the countour is zero. so the squared errors of each variable are roughly similar in scale to one another

if feature scaling is not done to the data before hand it will be implicitly done by the weights so by feature scaling up front, we give more meaning to the resulting weights

alpha = learning rate try .003, .03, .3, 3 plot the curve of the cost function y, by number of iterations x in should be the kind of curve that is an asymptote (a parabola that never reaches the 0 origin on either axis)

polynomial regression using a curve instead of a line, square, cubic, quadratic, etc same use of least squares and gradient descent feature scaling becomes important, because of squares and cubes in the equation, which increase the artificial weight of each feature

what is the difference between regression and neural net? in a neural net, each neuron implements a regression analysis

the normal equation in the beginning Andrew Ng uses the notation X{ij}, and later X^{(i)}j for an element in a matrix

regression analysis

linear regression
non-linear regression
fitting data to a line or curve
least squares
normal equation
gradient descent

1809, Carl Frederich Gauss formalized the least squares method. Many predecessors had worked through the 1700s refining the concept of minimizing the total error in observations vs predictions of planetary orbits. The term “regression” was not used until the 1920s. 1920, The word “regression” was first used to refer to the process of using the least squares method to fit data to a line or curve.

https://en.wikipedia.org/wiki/Gallery//of//curves

a parabola has two x,s for every y what about a curve that has two y's for every x? the function f(x) would have two results for every x. like a circle for example. how do you graph a circle? x^2 + y^2 = 0

if the data matches an algebraic curve non-linear data can sometimes be solved with either linear or nonlinear regression for the former, we fudge the data until it fits a straight line

linear nonlinear curvilinear

http://blog.minitab.com/blog/adventures-in-statistics-2/what-is-the-difference-between-linear-and-nonlinear-equations-in-regression-analysis http://blog.minitab.com/blog/adventures-in-statistics-2/curve-fitting-with-linear-and-nonlinear-regression https://en.wikipedia.org/wiki/Curvilinear_coordinates

x = 5184 range = max - min = 8836 - 4761 = 4075 mean = (7921+5184+8836+4761) / 4 = 6675.5 smean = scaled = 5184/(8836 - 4761) = 1.2721472392638036 normalized = scaled-mean/range =

x - mean / range 5184 - 6675.5 / 4075

min 4761 / 4075 8836 / 4075

this is the Octave command to calculate theta using normal equation \theta = pinv({X}'*X)*{X}'*y

feature scaling is NOT necessary in normal equation method

Sigmoid - old school, replaced by ReLU
ReLU Rectified Linear Unit function, straight horizontal line to a point then an angle up
source minute 17:00 of https://www.youtube.com/watch?v=aircAruvnKk

Notes

Moreover, the holders of this much data remain in the hands of the private sector in the big six: Amazon, Facebook, Google, Microsoft, Apple and Baidu.“
—-
GAN Generative Adversarial Network, see Wikipedia
used to generate additional data. more data helps the training. “augmented data”
—–
Datasets
News
The web
A crawler
Output of a crawler, which links proved fruitful
Pandora music

Build an AI that can create data sets

Datasets to build
Map shapes
Treasures
Crawler - predict which links to follow

Teach an AI to play plunder
Let the ai identify boring vs addictive game play

How does ocr work? Pattern matching

How to train s human
Human gets tired
Machine does not
Need efficient training techniques

How is a glossary done in wiki pedia?

Look at everything humans do and figure out how to do it by machine.

Massage data into a training set.
Decide what to do, vs how to do it.

Technology
Logic
Prayer

Matrix
Variables × weights

Vector, matrix. Now what is a tensorflow.

—-

https://www.technocracy.news/index.php/2017/08/27/147-transnational-companies-run-world/

Find the Illuminati

Draw the hierarchy of humans
Hold them to account

Keep the population under control
Keep the environment conducive to human life

Really? But what about the future life forms?
Attempt s to freeze a moment of happiness, to create a Utopia, backfire by repressing “natural” development.

Nothing needs to be done. Everything is perfect.
Enjoy.
Be.

—-

Use ai to geocode and chonocode, timestamp article s
to find historical maps, georeference, break into shapes, chronocode, timestamp, label
so that AGI can put human population movements into historical perspective

AGI needs historical perspective.
Therefore, all data must be timestamped.

all data must be timestamped and geocoded

perhaps the AGI will be able to figure out why people tweet

—-

Game-playing API.
Create multiple clones of itself. Have the clones play against each other in leagues and tournaments. Observe the different personalities and styles of play that develop. Bring the learning of all the clones back into itself, delete the clones, and go defeat the humans.

Meta learning. Use one neural network to create and test multiple neural networks to pick the best one. https://arxiv.org/abs/1703.01041

Sense of self
Ego
Metaphysics

Tools
Tools become machines
Machines become intelligent
Child supersedes the father

Mimic the brain

Mimic natural selection

As of 2016, AlphaGo's algorithm uses a combination of machine learning and tree search techniques, combined with extensive training, both from human and computer play. It uses Monte Carlo tree search, guided by a “value network” and a “policy network,” both implemented using deep neural network technology.[2][10] A limited amount of game-specific feature detection pre-processing (for example, to highlight whether a move matches a nakade pattern) is applied to the input before it is sent to the neural networks. from Wikipedia: alphago

In uncharted territory—where one would expect learning to be most beneficial—an agent must be able to learn from its own experience.

An endless proliferation of 3D environments: In the past couple of years there have been a bunch of new large-scale AI-training environments released ranging from Microsoft's Minecraft-based Malmo to DeepMind's Quake-based 'DeepMind Lab', to the Doom-based VizDoom.

three types of machine learning: supervised unsupervised reinforcement

application
	robotics
	game playing
	operations management in a factory
agent in environment
	reward
	state
action
	explore (new territory or strategy)
	exploit (what he already knows)
Markov Decision Process
goal: maximize reward over time
policy: optimize reward over time

policy: agent's behavior function
value function: how good is each state and/or action
model: agent's representation of the environment

q-learning
	r(s,a) immediate reward
	Q(s,a) values

features defined by expert
deep learning to discover features on its own

https://deepmind.com/blog/deep-reinforcement-learning/

https://en.wikipedia.org/wiki/Markov_decision_process#Reinforcement_learning

Kurzweil, Ray (2005). The Singularity is Near. New York: Penguin Group. ISBN 9780715635612.

Is there some kind of thinking that is not programmed?

evolutionary algorithm = genetic algorithm

3 parts, loop thru these three steps

selection
crossover
mutation

population breeds fittist survive

genome - something we want to breed and improve over time

 multiple genes
 gene = parameter

selection

 fitness function decides which genomes can breed

crossover

 breeding
 offspring, new population

mutation

 mutate the new population
 mutation function
 randomly modifying parameters

since the 80s alternative to back propagation alternative to reinforcement learning neural network zoom “evolution strategies as an alternative to reinforcement learning”

classification of environments

Deterministicness (deterministic or stochastic or Non-deterministic): An environment is deterministic if the next state is perfectly predictable given knowledge of the previous state and the agent's action. or “strategic” it there is one other agent. like a chess game.

Staticness (static or dynamic): Static environments do not change while the agent deliberates.

Observability (full or partial): A fully observable environment is one in which the agent has access to all information in the environment relevant to its task.

Agency (single or multiple): If there is at least one other agent in the environment, it is a multi-agent environment. Other agents might be apathetic, cooperative, or competitive.

Knowledge (known or unknown): An environment is considered to be “known” if the agent understands the laws that govern the environment's behavior. For example, in chess, the agent would know that when a piece is “taken” it is removed from the game. On a street, the agent might know that when it rains, the streets get slippery.

Episodicness (episodic or sequential): Sequential environments require memory of past actions to determine the next best action. Episodic environments are a series of one-shot actions, and only the current (or recent) percept is relevant. An AI that looks at radiology images to determine if there is a sickness is an example of an episodic environment. One image has nothing to do with the next.

Discreteness (discrete or continuous or ): A discrete environment has fixed locations or time intervals. A continuous environment could be measured quantitatively to any level of precision.

Simulated : a separate program is used to simulate an environment, feed percepts to agents, evaluate performance, etc.

feed forward NN -input hidden output, two weight matrices between

input times weight, bias, activate

RNN - for sequential data, like video or audio a third weight matrix connects hidden layer back to itself

vanishing gradient problem exploding gradient problem

an RNN cell can be replaced with an LSTM cell

x * weight plus bias, activate

common activation functions:

sigmoid
tanH
ReLU

RNN's are used for translation and question answer chatbots

a RNN cell remembers it's previous state

blog comments: classification by sentiment: positive negative neutral

Curriculum

Table of Contents

Notes in Progress

Resources to consume

great thoughts

Notes