Table of Contents
Notes in Progress
Resources to consume
- <https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1//67000Dx//ZCJB-3pi> 3blue1brown youtube channel with numerous videos on math with animations
- Neural Net Deep Learning 1, ft hot chick
- Neural Net Deep Learning 2, done
- Neural Net Deep Learning 3, Back Propagation, not public yet
- <http://neuralnetworksanddeeplearning.com> online html book
- <https://www.deepbootcamp.io/> website on Google DeepMind bootcamp on Reinforcement Learning
- <https://www.youtube.com/watch?v=bsuvM1jO-4w&feature=youtu.be> youtube from RL bootcamp
- <http://colah.github.io/> Christopher Ohlah's blog, multiple articles, visualizations, article on NHIST
- <https://distill.pub> Distill Blog, co-edited by Christopher Ohlah
- https://www.tensorflow.org/get//started/get//started tensorflow getting started, 1/3 of the way thru the first tutorial
great thoughts
TensorFlow and Caffe
Theano, python package for machine learning supported by Montreal Institute for Leaning Algorithms (MILA) from 2010 to 2017. Eclipsed by TensorFlow and Caffe.
Caffe, python package for machine learning supported by Berkeley AI Research (BAIR).
Yoshua Bengio, Head of Montreal Institute for Leaning Algorithms (MILA)
Yangqing Jia (贾扬清), created Caffe at Berkeley, went to Google Brain and worked on TensorFlow, now at Facebook working on Caffe2 and ONNX. [ONNX is the attempt by Facebook and Microsoft to keep themselves relevant under the onslaught of TensorFlow.]
equation drawing with rmarkdown, latex http://www.statpower.net/Content/310/R%20Stuff/SampleMarkdown.html
python is installed here
C:\Users\John\AppData\Local\Programs\Python\Python36
language imprecision
In week 1, he uses the term “dimensional” incorrectly.
In week 2, Andrew Ng uses the word “multivariant linear regression” incorrectly
m = number of rows in the training data n = number of independent variables in each row
independent variable - aka predictor, aka feature
there is no such thing as a 4-dimensional vector a vector is a 1-dimensional matrix and a 1-dimensional tensor this vector has 4 thingies occuring in one dimension
elements of an array member of a set
vector in physics vs vector in matrix math
vector = array = a list of things, members, elements, numbers
with only two features, the cost function can be represented by a surface or contour
feature scaling = normalization get every feature into -1 < x < +1 or -3 < x < +3 or 0 < x < +1 - this last is called normalization -.5 < 0 < +.5 this is called mean normalization, where the mean value is zero
feature scaling improves the efficience of gradient descent demonstrated by the 2 variable case, where if one variable has a range of 1 to 2000 and the other variable has a range of 1 to 5, the resulting contour will be long and skinny and gradient descent works much better if the countour is zero. so the squared errors of each variable are roughly similar in scale to one another
if feature scaling is not done to the data before hand it will be implicitly done by the weights so by feature scaling up front, we give more meaning to the resulting weights
alpha = learning rate try .003, .03, .3, 3 plot the curve of the cost function y, by number of iterations x in should be the kind of curve that is an asymptote (a parabola that never reaches the 0 origin on either axis)
polynomial regression using a curve instead of a line, square, cubic, quadratic, etc same use of least squares and gradient descent feature scaling becomes important, because of squares and cubes in the equation, which increase the artificial weight of each feature
what is the difference between regression and neural net? in a neural net, each neuron implements a regression analysis
the normal equation in the beginning Andrew Ng uses the notation X{ij}, and later X^{(i)}j for an element in a matrix
regression analysis
linear regression non-linear regression fitting data to a line or curve least squares normal equation gradient descent
1809, Carl Frederich Gauss formalized the least squares method. Many predecessors had worked through the 1700s refining the concept of minimizing the total error in observations vs predictions of planetary orbits. The term “regression” was not used until the 1920s. 1920, The word “regression” was first used to refer to the process of using the least squares method to fit data to a line or curve.
https://en.wikipedia.org/wiki/Gallery//of//curves
a parabola has two x,s for every y what about a curve that has two y's for every x? the function f(x) would have two results for every x. like a circle for example. how do you graph a circle? x^2 + y^2 = 0
if the data matches an algebraic curve non-linear data can sometimes be solved with either linear or nonlinear regression for the former, we fudge the data until it fits a straight line
linear nonlinear curvilinear
http://blog.minitab.com/blog/adventures-in-statistics-2/what-is-the-difference-between-linear-and-nonlinear-equations-in-regression-analysis http://blog.minitab.com/blog/adventures-in-statistics-2/curve-fitting-with-linear-and-nonlinear-regression https://en.wikipedia.org/wiki/Curvilinear_coordinates
x = 5184 range = max - min = 8836 - 4761 = 4075 mean = (7921+5184+8836+4761) / 4 = 6675.5 smean = scaled = 5184/(8836 - 4761) = 1.2721472392638036 normalized = scaled-mean/range =
x - mean / range 5184 - 6675.5 / 4075
min 4761 / 4075 8836 / 4075
this is the Octave command to calculate theta using normal equation \theta = pinv({X}'*X)*{X}'*y
feature scaling is NOT necessary in normal equation method
Sigmoid - old school, replaced by ReLU
ReLU Rectified Linear Unit function, straight horizontal line to a point then an angle up
source minute 17:00 of https://www.youtube.com/watch?v=aircAruvnKk
Notes
Moreover, the holders of this much data remain in the hands of the private sector in the big six: Amazon, Facebook, Google, Microsoft, Apple and Baidu.“
—-
GAN Generative Adversarial Network, see Wikipedia
used to generate additional data. more data helps the training. “augmented data”
—–
Datasets
News
The web
A crawler
Output of a crawler, which links proved fruitful
Pandora music
Build an AI that can create data sets
Datasets to build
Map shapes
Treasures
Crawler - predict which links to follow
Teach an AI to play plunder
Let the ai identify boring vs addictive game play
How does ocr work? Pattern matching
How to train s human
Human gets tired
Machine does not
Need efficient training techniques
How is a glossary done in wiki pedia?
Look at everything humans do and figure out how to do it by machine.
Massage data into a training set.
Decide what to do, vs how to do it.
Technology
Logic
Prayer
Matrix
Variables × weights
Vector, matrix. Now what is a tensorflow.
—-
https://www.technocracy.news/index.php/2017/08/27/147-transnational-companies-run-world/
Find the Illuminati
Draw the hierarchy of humans
Hold them to account
Keep the population under control
Keep the environment conducive to human life
Really? But what about the future life forms?
Attempt s to freeze a moment of happiness, to create a Utopia, backfire by repressing “natural” development.
Nothing needs to be done. Everything is perfect.
Enjoy.
Be.
—-
Use ai to geocode and chonocode, timestamp article s
to find historical maps, georeference, break into shapes, chronocode, timestamp, label
so that AGI can put human population movements into historical perspective
AGI needs historical perspective.
Therefore, all data must be timestamped.
all data must be timestamped and geocoded
perhaps the AGI will be able to figure out why people tweet
—-
Game-playing API.
Create multiple clones of itself. Have the clones play against each other in leagues and tournaments. Observe the different personalities and styles of play that develop. Bring the learning of all the clones back into itself, delete the clones, and go defeat the humans.
Meta learning. Use one neural network to create and test multiple neural networks to pick the best one. https://arxiv.org/abs/1703.01041
Sense of self
Ego
Metaphysics
Tools
Tools become machines
Machines become intelligent
Child supersedes the father
Mimic the brain
Mimic natural selection
As of 2016, AlphaGo's algorithm uses a combination of machine learning and tree search techniques,
combined with extensive training, both from human and computer play.
It uses Monte Carlo tree search, guided by a “value network” and a “policy network,”
both implemented using deep neural network technology.[2][10]
A limited amount of game-specific feature detection pre-processing
(for example, to highlight whether a move matches a nakade pattern)
is applied to the input before it is sent to the neural networks.
from Wikipedia: alphago
In uncharted territory—where one would expect learning to be most beneficial—an agent must be able to learn from its own experience.
An endless proliferation of 3D environments: In the past couple of years there have been a bunch of new large-scale AI-training environments released ranging from Microsoft's Minecraft-based Malmo to DeepMind's Quake-based 'DeepMind Lab', to the Doom-based VizDoom.
three types of machine learning: supervised unsupervised reinforcement
application robotics game playing operations management in a factory agent in environment reward state action explore (new territory or strategy) exploit (what he already knows) Markov Decision Process goal: maximize reward over time policy: optimize reward over time policy: agent's behavior function value function: how good is each state and/or action model: agent's representation of the environment q-learning r(s,a) immediate reward Q(s,a) values
features defined by expert deep learning to discover features on its own
https://deepmind.com/blog/deep-reinforcement-learning/
https://en.wikipedia.org/wiki/Markov_decision_process#Reinforcement_learning
Kurzweil, Ray (2005). The Singularity is Near. New York: Penguin Group. ISBN 9780715635612.
Is there some kind of thinking that is not programmed?
evolutionary algorithm = genetic algorithm
3 parts, loop thru these three steps
selection crossover mutation
population breeds fittist survive
genome - something we want to breed and improve over time
multiple genes gene = parameter
selection
fitness function decides which genomes can breed
crossover
breeding offspring, new population
mutation
mutate the new population mutation function randomly modifying parameters
since the 80s alternative to back propagation alternative to reinforcement learning neural network zoom “evolution strategies as an alternative to reinforcement learning”
classification of environments
Deterministicness (deterministic or stochastic or Non-deterministic): An environment is deterministic if the next state is perfectly predictable given knowledge of the previous state and the agent's action. or “strategic” it there is one other agent. like a chess game.
Staticness (static or dynamic): Static environments do not change while the agent deliberates.
Observability (full or partial): A fully observable environment is one in which the agent has access to all information in the environment relevant to its task.
Agency (single or multiple): If there is at least one other agent in the environment, it is a multi-agent environment. Other agents might be apathetic, cooperative, or competitive.
Knowledge (known or unknown): An environment is considered to be “known” if the agent understands the laws that govern the environment's behavior. For example, in chess, the agent would know that when a piece is “taken” it is removed from the game. On a street, the agent might know that when it rains, the streets get slippery.
Episodicness (episodic or sequential): Sequential environments require memory of past actions to determine the next best action. Episodic environments are a series of one-shot actions, and only the current (or recent) percept is relevant. An AI that looks at radiology images to determine if there is a sickness is an example of an episodic environment. One image has nothing to do with the next.
Discreteness (discrete or continuous or ): A discrete environment has fixed locations or time intervals. A continuous environment could be measured quantitatively to any level of precision.
Simulated : a separate program is used to simulate an environment, feed percepts to agents, evaluate performance, etc.
feed forward NN -input hidden output, two weight matrices between
input times weight, bias, activate
RNN - for sequential data, like video or audio a third weight matrix connects hidden layer back to itself
vanishing gradient problem exploding gradient problem
an RNN cell can be replaced with an LSTM cell
x * weight plus bias, activate
common activation functions:
sigmoid tanH ReLU
RNN's are used for translation and question answer chatbots
a RNN cell remembers it's previous state
blog comments: classification by sentiment: positive negative neutral