[[projects:projects]]:[[mai]] ====== NLP ====== Natural Language Processing (NLP) see wordnet \\ a topic map maintained at princeton \\ https://wordnet.princeton.edu/ \\ (note. three pages on wordnet have been copied into the wordnet project and should be deleted from here.) ===== Alice Zhao ===== Data Science Instructor at Metis, Chicago, Illinois \\ MS Northwestern University, Evanston, Ill two hour tutorial \\ https://www.youtube.com/watch?v=xvqsFTUsOmc three products * sentiment analysis * topic modeling * text generation python package for Web Scraping * Requests, make HTTP requests * Beautiful Soup, parse HTML documents * Pickle, serialize python objects for later use * Pandas, data analysis, DataFrame = table Text data formats - corpus, prep as Pandas DataFrame, two-column table: author, transcript - Document-term matrix procedure * clean: remove punctuation, lowercase, remove numbers * tokenize: words * remove stop words (articles) * matricize: columns=words, rows=documents, cells=word counts two output formats: * corpus, original text * document-term matrix sentiment analysis 1:08:56 * input: corpus * nltk: natural language toolkit * python libraries: TextBlob, built on top of nltk sentiment * from textblob import TextBlob * TextBlob("I love Naiyana").sentiment * output: Sentiment(polarity=0.5, subjectivity=0.6) * polarity: -1 to +1 * subjectivity: 0 to +1, higher score means opinionated * TextBlob uses a sentiment lexicon labeled by Tom De Smedt word-net, compiled at Princeton, columns for each word: * word form: great * wordnet id: a-01123879 * POS: JJ * Sense: very good * Polarity: 0.8 * Subjectivity: 1.0 example: movie review database topic modeling 1:22:52 * input: document-term matrix * python libraries: * nltk, for pos tagging * gensim, built by Radim Rehurek specifically for topic modeling Latent Dirichlet Allocation (DLA) * latent = hidden * Dirichlet = a type of probability distribution goal: learn the topic mix in each document, and the word mix in each topic * input: document-term matrix, number of topics, number of iterations * output: the top words in each topic other techniques, also available in gensim: * Latent Semantic Indexing (LSI) * Non-Negative Matrix Factorization (NMF) id2word = dict((v,k) for k, v in cv.vocabulary_.items()) Part of speech tag set \\ https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html text generation 1:44:50 \\ input: corpus, include punctuation Markov Chains, the current word predicts the next word \\ LSTM, the current string of words predicts the next word \\ ===== Siraj, NLP ===== https://www.youtube.com/watch?v=bDxFvr1gpSU History Feed forward networks \\ a vanilla neural network like a multilayer perceptron with fully connected layers. A feed forward network treats all input features as unique and independent of one another, discrete. Convolutional networks \\ An image processing adjacent pixels are related, and similar patterns repeated in the image are related. Proximity matters. Recurrent networks \\ Process a string of words. Predict the end of the sentence given the beginning of the sentence. LSTM networks, A variant of RNN. Attention networks Can match a pronoun to its noun antecedent. Transformer \\ Encoder ===== Jesse Moore, Using BERT to Accelerate NLP ===== https://m.youtube.com/watch?v=4Z_TzZJ-v3o * BERT Google, Bi-directional encoder representation from transformers * GPT2 OpenAI, for story-telling Any time you're trying to do something with text. * Classify it. * Make use of it * Translate * Sentence completion * Auto complete * Story telling F1 Score. A number from zero to 1. 1 is better. A way to evaluate classification problems. Transformers have become the basic building block of most state-of-the-art architectures in NLP, replacing gated recurrent neural network models such as the long short-term memory (LSTM) Both BERT and GPT-2 are based on transformers. ===== nltk, natural language toolkit, python library ===== corpus, corpora * text * concordance * common_contexts * dispersion_plot * generate * set * len * sorted ===== notes ===== Sam, coach, conversation, info capture, python server, SQL, Ajax Python * Read config.ini * psycopg2 * Calc level (to be used in rapgen) * Load grammar table. Not just grammar but infonetgrab * Interrogation Recognize Thai handwriting \\ This could be an academic project that would result in a dictionary and corpus. Which academic institutions are working on NLP for Thai? Teach vocabulary, grammar, many subject domains simultaneously with principles of repitition, reinforcement, building gradually on previously mastered material. Let the teacher learn even while teaching. \\ Let the teacher teach like a parent: continuously, while going about your day. Databit * Questions to illicit this databit * Answers valid values Network of databits * Which question to ask next? Chat server Ajax server Webserver Ajax Chat client GitHub frug Ajax chat \\ Uses Ruby socket server \\ Client Uses js-flash bridge or fall-back to Ajax polling Polling every 1sec \\ Degrade to 5sec after disuse Requires Ajax server A2hosting \\ Run webserver \\ Run python scripts from Ajax post \\ Access postgresql from python post \\ === con === Gensen \\ Gencon \\ Thirst for knowledge about person \\ Ask about friends \\ Compare answers from multiple persons \\ Database structure for personal data \\ Dialog A. Gather info \\ B. Drill student \\ Empathy, know the other's vocabulary, use it, help him expand it === Hm === เข้าร่วม. Join \\ เข้าสู่ระบบ. Login \\ Thai Corpus - use to calculate level Thai wordnet, \\ English wordnet: Princeton ===== Thai National Corpus (TNC) ===== Chulalongkorn U. Bangkok \\ http://www.arts.chula.ac.th/ling/tnc/works/ TNC Online, broken php \\ http://www.arts.chula.ac.th/~ling/TNCII/corp.php Research paper about TNC: \\ Aroonmanakun, Wirote & Tansiri, Kachen & Nittayanuparp, Pairit. (2009). Thai National Corpus. 153-158. 10.3115/1690299.1690321. \\ https://www.researchgate.net/publication/271429101_Thai_National_Corpus ===== Resources ===== Khun Wannaphong, Khon Kaen U., Using Thani, 2017-2020 \\ http://thainlp.wannaphong.com/ Dictionary \\ https://lexitron.nectec.or.th/2009_1/ Wordlist \\ https://www.expatden.com/thai/thai-frequency-lists-with-english-definitions/