Natural Language Processing (NLP)
see wordnet
a topic map maintained at princeton
https://wordnet.princeton.edu/
(note. three pages on wordnet have been copied into the wordnet project and should be deleted from here.)
Data Science Instructor at Metis, Chicago, Illinois
MS Northwestern University, Evanston, Ill
two hour tutorial
https://www.youtube.com/watch?v=xvqsFTUsOmc
three products
python package for Web Scraping
Text data formats
procedure
two output formats:
sentiment analysis 1:08:56
sentiment
word-net, compiled at Princeton, columns for each word:
example: movie review database
topic modeling 1:22:52
Latent Dirichlet Allocation (DLA)
goal: learn the topic mix in each document, and the word mix in each topic
other techniques, also available in gensim:
id2word = dict1)
Part of speech tag set
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
text generation 1:44:50
input: corpus, include punctuation
Markov Chains, the current word predicts the next word
LSTM, the current string of words predicts the next word
https://www.youtube.com/watch?v=bDxFvr1gpSU
History
Feed forward networks
a vanilla neural network like a multilayer perceptron with fully connected layers. A feed forward network treats all input features as unique and independent of one another, discrete.
Convolutional networks
An image processing adjacent pixels are related, and similar patterns repeated in the image are related. Proximity matters.
Recurrent networks
Process a string of words. Predict the end of the sentence given the beginning of the sentence.
LSTM networks, A variant of RNN.
Attention networks
Can match a pronoun to its noun antecedent.
Transformer
Encoder
https://m.youtube.com/watch?v=4Z_TzZJ-v3o
Any time you're trying to do something with text.
F1 Score. A number from zero to 1. 1 is better. A way to evaluate classification problems.
Transformers have become the basic building block of most state-of-the-art architectures in NLP, replacing gated recurrent neural network models such as the long short-term memory (LSTM)
Both BERT and GPT-2 are based on transformers.
corpus, corpora
Sam, coach, conversation, info capture, python server, SQL, Ajax
Python
Recognize Thai handwriting
This could be an academic project that would result in a dictionary and corpus. Which academic institutions are working on NLP for Thai?
Teach vocabulary, grammar, many subject domains simultaneously with principles of repitition, reinforcement, building gradually on previously mastered material.
Let the teacher learn even while teaching.
Let the teacher teach like a parent: continuously, while going about your day.
Databit
Network of databits
Chat server
Ajax server
Webserver
Ajax Chat client
GitHub frug Ajax chat
Uses Ruby socket server
Client Uses js-flash bridge or fall-back to Ajax polling
Polling every 1sec
Degrade to 5sec after disuse
Requires Ajax server
A2hosting
Run webserver
Run python scripts from Ajax post
Access postgresql from python post
Gensen
Gencon
Thirst for knowledge about person
Ask about friends
Compare answers from multiple persons
Database structure for personal data
Dialog
A. Gather info
B. Drill student
Empathy, know the other's vocabulary, use it, help him expand it
เข้าร่วม. Join
เข้าสู่ระบบ. Login
Thai Corpus - use to calculate level
Thai wordnet,
English wordnet: Princeton
Chulalongkorn U. Bangkok
http://www.arts.chula.ac.th/ling/tnc/works/
TNC Online, broken php
http://www.arts.chula.ac.th/~ling/TNCII/corp.php
Research paper about TNC:
Aroonmanakun, Wirote & Tansiri, Kachen & Nittayanuparp, Pairit. (2009). Thai National Corpus. 153-158. 10.3115/1690299.1690321.
https://www.researchgate.net/publication/271429101_Thai_National_Corpus
Khun Wannaphong, Khon Kaen U., Using Thani, 2017-2020
http://thainlp.wannaphong.com/
Dictionary
https://lexitron.nectec.or.th/2009_1/
Wordlist
https://www.expatden.com/thai/thai-frequency-lists-with-english-definitions/