NLP

Natural Language Processing (NLP)

see wordnet
a topic map maintained at princeton
https://wordnet.princeton.edu/
(note. three pages on wordnet have been copied into the wordnet project and should be deleted from here.)

Alice Zhao

Data Science Instructor at Metis, Chicago, Illinois
MS Northwestern University, Evanston, Ill

two hour tutorial
https://www.youtube.com/watch?v=xvqsFTUsOmc

three products

sentiment analysis
topic modeling
text generation

python package for Web Scraping

Requests, make HTTP requests
Beautiful Soup, parse HTML documents
Pickle, serialize python objects for later use
Pandas, data analysis, DataFrame = table

Text data formats

corpus, prep as Pandas DataFrame, two-column table: author, transcript
Document-term matrix

procedure

clean: remove punctuation, lowercase, remove numbers
tokenize: words
remove stop words (articles)
matricize: columns=words, rows=documents, cells=word counts

two output formats:

corpus, original text
document-term matrix

sentiment analysis 1:08:56

input: corpus
nltk: natural language toolkit
python libraries: TextBlob, built on top of nltk

sentiment

from textblob import TextBlob
TextBlob(“I love Naiyana”).sentiment
output: Sentiment(polarity=0.5, subjectivity=0.6)
polarity: -1 to +1
subjectivity: 0 to +1, higher score means opinionated
TextBlob uses a sentiment lexicon labeled by Tom De Smedt

word-net, compiled at Princeton, columns for each word:

word form: great
wordnet id: a-01123879
POS: JJ
Sense: very good
Polarity: 0.8
Subjectivity: 1.0

example: movie review database

topic modeling 1:22:52

input: document-term matrix
python libraries:
nltk, for pos tagging
gensim, built by Radim Rehurek specifically for topic modeling

Latent Dirichlet Allocation (DLA)

latent = hidden
Dirichlet = a type of probability distribution

goal: learn the topic mix in each document, and the word mix in each topic

input: document-term matrix, number of topics, number of iterations
output: the top words in each topic

other techniques, also available in gensim:

Latent Semantic Indexing (LSI)
Non-Negative Matrix Factorization (NMF)

id2word = dict¹⁾

Part of speech tag set
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

text generation 1:44:50
input: corpus, include punctuation

Markov Chains, the current word predicts the next word
LSTM, the current string of words predicts the next word

Siraj, NLP

https://www.youtube.com/watch?v=bDxFvr1gpSU

History

Feed forward networks
a vanilla neural network like a multilayer perceptron with fully connected layers. A feed forward network treats all input features as unique and independent of one another, discrete.

Convolutional networks
An image processing adjacent pixels are related, and similar patterns repeated in the image are related. Proximity matters.

Recurrent networks
Process a string of words. Predict the end of the sentence given the beginning of the sentence.

LSTM networks, A variant of RNN.

Attention networks

Can match a pronoun to its noun antecedent.

Transformer
Encoder

Jesse Moore, Using BERT to Accelerate NLP

https://m.youtube.com/watch?v=4Z_TzZJ-v3o

BERT Google, Bi-directional encoder representation from transformers
GPT2 OpenAI, for story-telling

Any time you're trying to do something with text.

Classify it.
Make use of it
Translate
Sentence completion
Auto complete
Story telling

F1 Score. A number from zero to 1. 1 is better. A way to evaluate classification problems.

Transformers have become the basic building block of most state-of-the-art architectures in NLP, replacing gated recurrent neural network models such as the long short-term memory (LSTM)

Both BERT and GPT-2 are based on transformers.

nltk, natural language toolkit, python library

corpus, corpora

text
concordance
common_contexts
dispersion_plot
generate
set
len
sorted

notes

Sam, coach, conversation, info capture, python server, SQL, Ajax

Python

Read config.ini
psycopg2
Calc level (to be used in rapgen)
Load grammar table. Not just grammar but infonetgrab
Interrogation

Recognize Thai handwriting
This could be an academic project that would result in a dictionary and corpus. Which academic institutions are working on NLP for Thai?

Teach vocabulary, grammar, many subject domains simultaneously with principles of repitition, reinforcement, building gradually on previously mastered material.

Let the teacher learn even while teaching.
Let the teacher teach like a parent: continuously, while going about your day.

Databit

Questions to illicit this databit
Answers valid values

Network of databits

Which question to ask next?

Chat server

Ajax server

Webserver

Ajax Chat client

GitHub frug Ajax chat
Uses Ruby socket server
Client Uses js-flash bridge or fall-back to Ajax polling

Polling every 1sec
Degrade to 5sec after disuse

Requires Ajax server

A2hosting
Run webserver
Run python scripts from Ajax post
Access postgresql from python post

con

Gensen
Gencon
Thirst for knowledge about person
Ask about friends
Compare answers from multiple persons
Database structure for personal data

Dialog A. Gather info
B. Drill student

Empathy, know the other's vocabulary, use it, help him expand it

Hm

เข้าร่วม. Join
เข้าสู่ระบบ. Login

Thai Corpus - use to calculate level

Thai wordnet,
English wordnet: Princeton

Thai National Corpus (TNC)

Chulalongkorn U. Bangkok
http://www.arts.chula.ac.th/ling/tnc/works/

TNC Online, broken php
http://www.arts.chula.ac.th/~ling/TNCII/corp.php

Research paper about TNC:
Aroonmanakun, Wirote & Tansiri, Kachen & Nittayanuparp, Pairit. (2009). Thai National Corpus. 153-158. 10.3115/1690299.1690321.
https://www.researchgate.net/publication/271429101_Thai_National_Corpus

Resources

Khun Wannaphong, Khon Kaen U., Using Thani, 2017-2020
http://thainlp.wannaphong.com/

Dictionary
https://lexitron.nectec.or.th/2009_1/

Wordlist
https://www.expatden.com/thai/thai-frequency-lists-with-english-definitions/

¹⁾

v,k) for k, v in cv.vocabulary_.items(

Curriculum

Table of Contents

NLP