====== Notes in Progress ======

==== Resources to consume ====
  * <https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1//67000Dx//ZCJB-3pi>  3blue1brown youtube channel with numerous videos on math with animations\\
    * Neural Net Deep Learning 1, ft hot chick\\
    * Neural Net Deep Learning 2, done\\
    * Neural Net Deep Learning 3, Back Propagation, not public yet\\
  * <http://neuralnetworksanddeeplearning.com>  online html book\\
  * <https://www.deepbootcamp.io/>  website on Google DeepMind bootcamp on Reinforcement Learning
    * <https://www.youtube.com/watch?v=bsuvM1jO-4w&feature=youtu.be>  youtube from RL bootcamp
  * <http://colah.github.io/>  Christopher Ohlah's blog, multiple articles, visualizations, article on NHIST
  * <https://distill.pub>  Distill Blog, co-edited by Christopher Ohlah
  * https://www.tensorflow.org/get//started/get//started  tensorflow getting started, 1/3 of the way thru the first tutorial

==== great thoughts ====

TensorFlow and Caffe

Theano, python package for machine learning supported by Montreal Institute for Leaning Algorithms (MILA) from 2010 to 2017.  Eclipsed by TensorFlow and Caffe.

Caffe, python package for machine learning supported by Berkeley AI Research (BAIR).


Yoshua Bengio, Head of Montreal Institute for Leaning Algorithms (MILA)

Yangqing Jia (贾扬清), created Caffe at Berkeley, went to Google Brain and worked on TensorFlow, now at Facebook working on Caffe2 and ONNX.  [ONNX is the attempt by Facebook and Microsoft to keep themselves relevant under the onslaught of TensorFlow.]

equation drawing with rmarkdown, latex http://www.statpower.net/Content/310/R%20Stuff/SampleMarkdown.html\\

python is installed here\\
C:\Users\John\AppData\Local\Programs\Python\Python36\\

language  imprecision\\
In week 1, he uses the term "dimensional" incorrectly.
In week 2, Andrew Ng uses the word "multivariant linear regression" incorrectly

m = number of rows in the training data
n = number of independent variables in each row

independent variable - aka predictor, aka feature

there is no such thing as a 4-dimensional vector
a vector is a 1-dimensional matrix and a 1-dimensional tensor
this vector has 4 thingies occuring in one dimension

elements of an array
member of a set

vector in physics vs vector in matrix math

vector = array = a list of things, members, elements, numbers

with only two features, the cost function can be represented by a surface or contour

feature scaling = normalization
get every feature into
-1 < x < +1 or
-3 < x < +3 or
0 < x < +1  - this last is called normalization
-.5 < 0 < +.5 this is called mean normalization, where the mean value is zero


feature scaling improves the efficience of gradient descent
demonstrated by the 2 variable case, where if one variable has a range of 1 to 2000 and the other variable has a range of 1 to 5,
the resulting contour will be long and skinny
and gradient descent works much better if the countour is zero.
so the squared errors of each variable are roughly similar in scale to one another

if feature scaling is not done to the data before hand
it will be implicitly done by the weights
so by feature scaling up front, we give more meaning to the resulting weights

alpha = learning rate
try .003, .03, .3, 3
plot the curve of the cost function y, by number of iterations x
in should be the kind of curve that is an asymptote (a parabola that never reaches the 0 origin on either axis)


polynomial regression
using a curve instead of a line, square, cubic, quadratic, etc
same use of least squares and gradient descent
feature scaling becomes important, because of squares and cubes in the equation, which increase the artificial weight of each feature


what is the difference between regression and neural net?
in a neural net, each neuron implements a regression analysis


the normal equation
in the beginning Andrew Ng uses the notation X//{ij}, and later X^{(i)}//j for an element in a matrix


regression analysis
	linear regression
	non-linear regression
	fitting data to a line or curve
	least squares
	normal equation
	gradient descent


1809, Carl Frederich Gauss formalized the least squares method.  Many predecessors had worked through the 1700s refining the concept of minimizing the total error in observations vs predictions of planetary orbits.  The term "regression" was not used until the 1920s.
1920, The word "regression" was first used to refer to the process of using the least squares method to fit data to a line or curve.


https://en.wikipedia.org/wiki/Gallery//of//curves

a parabola has two x,s for every y
what about a curve that has two y's for every x?  the function f(x) would have two results for every x.  like a circle for example.
how do you graph a circle?  x^2 + y^2 = 0


if the data matches an algebraic curve
non-linear data can sometimes be solved with either linear or nonlinear regression
for the former, we fudge the data until it fits a straight line

linear
nonlinear
curvilinear

http://blog.minitab.com/blog/adventures-in-statistics-2/what-is-the-difference-between-linear-and-nonlinear-equations-in-regression-analysis
http://blog.minitab.com/blog/adventures-in-statistics-2/curve-fitting-with-linear-and-nonlinear-regression
https://en.wikipedia.org/wiki/Curvilinear_coordinates


---------------
x = 5184
range = max - min = 8836 - 4761 = 4075
mean = (7921+5184+8836+4761) / 4 = 6675.5
smean =
scaled = 5184/(8836 - 4761) = 1.2721472392638036
normalized = scaled-mean/range =

x - mean / range
5184 - 6675.5 / 4075

min
4761 / 4075
8836 / 4075

-----------


this is the Octave command to calculate theta using normal equation
\theta = pinv({X}'*X)*{X}'*y

feature scaling is NOT necessary in normal equation method

Sigmoid - old school, replaced by ReLU\\
ReLU  Rectified Linear Unit  function, straight horizontal line to a point then an angle up\\
source minute 17:00 of https://www.youtube.com/watch?v=aircAruvnKk\\

==== Notes   ====
Moreover, the holders of this much data remain in the hands of the private sector in the big six: Amazon, Facebook, Google, Microsoft, Apple and Baidu."\\
----\\
GAN Generative Adversarial Network, see Wikipedia\\
used to generate additional data.  more data helps the training.  "augmented data"\\
-----\\
Datasets\\
News\\
The web\\
A crawler\\
Output of a crawler, which links proved fruitful\\
Pandora music\\
\\
Build an AI that can create data sets\\
\\
Datasets to build\\
Map shapes\\
Treasures\\
Crawler - predict which links to follow\\
\\
Teach an AI to play plunder\\
Let the ai identify boring vs addictive game play\\
\\
How does ocr work? Pattern matching\\
\\
How to train s human\\
Human gets tired\\
Machine does not\\
Need efficient training techniques\\
\\
How is a glossary done in wiki pedia?\\
\\
Look at everything humans do and figure out how to do it by machine.\\
\\
Massage data into a training set.\\
Decide what to do, vs how to do it.\\
\\
Technology\\
Logic\\
Prayer\\
\\
Matrix\\
Variables × weights\\
\\
Vector, matrix. Now what is a tensorflow.\\
\\
----\\
\\
\\
https://www.technocracy.news/index.php/2017/08/27/147-transnational-companies-run-world/\\
\\
Find the Illuminati\\
\\
Draw the hierarchy of humans\\
Hold them to account\\
\\
Keep the population under control\\
Keep the environment conducive to human life\\
\\
Really? But what about the future life forms?\\
Attempt s to freeze a moment of happiness, to create a Utopia, backfire by repressing "natural" development.\\
\\
Nothing needs to be done. Everything is perfect. \\
Enjoy.\\
Be.\\
\\
----\\
\\
Use ai to geocode and chonocode, timestamp article s\\
to find historical maps, georeference, break into shapes, chronocode, timestamp, label\\
so that AGI can put human population movements into historical perspective\\
\\
AGI needs historical perspective.\\
Therefore, all data must be timestamped.\\
\\
all data must be timestamped and geocoded\\
\\
perhaps the AGI will be able to figure out why people tweet\\
\\
----\\
\\
Game-playing API.\\
Create multiple clones of itself.  Have the clones play against each other in leagues and tournaments.  Observe the different personalities and styles of play that develop.  Bring the  learning of all the clones back into itself, delete the clones, and go defeat the humans.\\
\\
Meta learning.  Use one neural network to create and test multiple neural networks to pick the best one.  https://arxiv.org/abs/1703.01041\\
\\
Sense of self\\
Ego\\
Metaphysics\\
\\
Tools\\
Tools become machines\\
Machines become intelligent\\
Child supersedes the father\\
\\
\\
Mimic the brain\\
\\
Mimic natural selection\\
\\
\\
As of 2016, AlphaGo's algorithm uses a combination of machine learning and tree search techniques, 
combined with extensive training, both from human and computer play. 
It uses Monte Carlo tree search, guided by a "value network" and a "policy network," 
both implemented using deep neural network technology.[2][10] 
A limited amount of game-specific feature detection pre-processing 
(for example, to highlight whether a move matches a nakade pattern) 
is applied to the input before it is sent to the neural networks. 
from Wikipedia: alphago

In uncharted territory—where one would expect learning to be
most beneficial—an agent must be able to learn from its own experience.


An endless proliferation of 3D environments: 
In the past couple of years there have been a bunch of new large-scale AI-training environments released ranging from 
Microsoft's Minecraft-based Malmo to 
DeepMind's Quake-based 'DeepMind Lab', to the 
Doom-based VizDoom. 

three types of machine learning:
supervised
unsupervised
reinforcement
	application
		robotics
		game playing
		operations management in a factory
	agent in environment
		reward
		state
	action
		explore (new territory or strategy)
		exploit (what he already knows)
	Markov Decision Process
	goal: maximize reward over time
	policy: optimize reward over time
	
	policy: agent's behavior function
	value function: how good is each state and/or action
	model: agent's representation of the environment
	
	q-learning
		r(s,a) immediate reward
		Q(s,a) values

	features defined by expert
	deep learning to discover features on its own
		

https://deepmind.com/blog/deep-reinforcement-learning/

https://en.wikipedia.org/wiki/Markov_decision_process#Reinforcement_learning

Kurzweil, Ray (2005). The Singularity is Near. New York: Penguin Group. ISBN 9780715635612.

Is there some kind of thinking that is not programmed?

--------------

evolutionary algorithm = genetic algorithm

3 parts, loop thru these three steps
	selection
	crossover
	mutation

population
breeds
fittist survive

genome - something we want to breed and improve over time
   multiple genes
   gene = parameter

   
selection
   fitness function decides which genomes can breed

crossover   
   breeding
   offspring, new population
   
mutation
   mutate the new population
   mutation function
   randomly modifying parameters
   
since the 80s
alternative to back propagation
alternative to reinforcement learning
neural network zoom
"evolution strategies as an alternative to reinforcement learning"

-----------

classification of environments

Deterministicness (deterministic or stochastic or Non-deterministic): An environment is deterministic if the next state is perfectly predictable given knowledge of the previous state and the agent's action.
or "strategic" it there is one other agent.  like a chess game.

Staticness (static or dynamic): Static environments do not change while the agent deliberates.

Observability (full or partial): A fully observable environment is one in which the agent has access to all information in the environment relevant to its task.

Agency (single or multiple): If there is at least one other agent in the environment, it is a multi-agent environment. Other agents might be apathetic, cooperative, or competitive.

Knowledge (known or unknown): An environment is considered to be "known" if the agent understands the laws that govern the environment's behavior. For example, in chess, the agent would know that when a piece is "taken" it is removed from the game. On a street, the agent might know that when it rains, the streets get slippery.

Episodicness (episodic or sequential): Sequential environments require memory of past actions to determine the next best action. Episodic environments are a series of one-shot actions, and only the current (or recent) percept is relevant. An AI that looks at radiology images to determine if there is a sickness is an example of an episodic environment. One image has nothing to do with the next.

Discreteness (discrete or continuous or ): A discrete environment has fixed locations or time intervals. A continuous environment could be measured quantitatively to any level of precision.

Simulated : a separate program is used to simulate an environment, feed percepts to agents, evaluate performance, etc.

----------------

feed forward NN -input hidden output, two weight matrices between 
	input times weight, bias, activate
RNN - for sequential data, like video or audio
a third weight matrix connects hidden layer back to itself

vanishing gradient problem
exploding gradient problem

an RNN cell can be replaced with an LSTM cell


x * weight plus bias, activate

common activation functions:
	sigmoid
	tanH
	ReLU

RNN's are used for translation and question answer chatbots

a RNN cell remembers it's previous state

blog comments: classification by sentiment: positive negative neutral