====== Natural Language Processing ======

This article is about NLP as studied within the field of AI.  See also [[Language]] for more general information about language as studied in the fields of Psychology and Linguistics.

===== Human vs Computer Language =====

Computer languages are orderly and elegant.  One command has one meaning.  Each command has a fixed set of arguments.

Human languages are comparatively messy.  Many words have multiple meanings.  The meanings change from time to time.  New words are invented constantly.  Words can be arranged into sentences in a variety of ways.  There are rules, but they are constantly changing.


Two halves:

  * input: Natural Language Understanding (NLU) - aka information extraction
  * output: Natural Language Generation (NLG)

Scale:
  * large-scale: Computation Linguistics
  * small-scale: Conversational NLP

Applications:


===== Computational Linguistics =====

Applications:
  * speech recognition
  * email filters: spam, social, promotional, etc.
  * search results
  * question answering
  * predictive text: autocorrect, autocomplete, etc.
  * language translation
  * sentiment analysis
  * intent classification
  * urgency detection
  * assistants: chatbots, Siri, Alexa

Corpora

Techniques
  * data cleaning, prep
  * pattern recognition


===== Conversational NLP =====

If a human wants to use a computer, he has to learn the computer's language.

If a computer wants to interact with a human, it must learn the human's language.

==== Assistant ====

One form of conversational NLP is the virtual assistant.

  * Siri from Apple
  * Alexa from Amazon
  * Home from Google
  * Assistant from Google


We’ll deep dive into our NLU pipeline, custom components like Google’s BERT and Recurrent Embedding Dialogue Policy (REDP), and approach concepts like context, attention, and non-linear conversation.

Natural Language Understanding (NLU) is a subset of NLP that turns natural language into structured data. NLU is able to do two things — intent classification and entity extraction.

For the purposes of this article, we will use the Rasa, an open source stack that provides tools to build contextual AI assistants. There are two main components in the Rasa stack that will help us build a travel assistant — Rasa NLU and Rasa core.

Conversational AI refers to the use of messaging apps, speech-based assistants and chatbots to automate communication with customers.

speech-based assistants like Amazon Alexa and Google Home.

Starting in the late 1980s, however, there was a revolution in natural language processing with the introduction of machine learning algorithms for language processing. This was due to both the steady increase in computational power (see Moore's law) and the gradual lessening of the dominance of Chomskyan theories of linguistics (e.g. transformational grammar), whose theoretical underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning approach to language processing.

Speech recognition

Computational linguistics

Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.

Sentiment

Rasa

==== Terms ====

corpus

tokenize

tag

stemming

lemma

chunking

chinking

stop word removal

==== Parsing ====

  * result is tree
  * can be diagrammed
  * based on a grammar
  * based on syntax

Two kinds:
  * Constituency Parsing 
    * based on phrase structure grammar (PSG), aka context-free grammar
    * sentence > noun phrase + verb phrase
  * Dependency Parsing
    * based on dependency grammar (DG)
    * verb has two dependencies: subject and object

Java Stanford CoreNLP
AllenNLP
Python StanfordNLP


==== Named-Entity Recognition ====

Bank of America is one name\\
America is also one name

Two parts:
  * detection, aka chunking
  * classification

Classification ontology hierarchies:
  * BBN categories, proposed in 2002, is used for question answering and consists of 29 types and 64 subtypes.
  * Sekine's extended hierarchy, proposed in 2002, is made of 200 subtypes.
  * in 2011 Ritter used a hierarchy based on common Freebase entity types in ground-breaking experiments on NER over social media text.


Classifications
  * person
  * place
  * organization
  * temporal expressions
  * numerical expressions


Approach
  * manual, using grammar
  * neural net classifiers


===== Corpus =====

Treebank, wikipedia

https://en.wikipedia.org/wiki/Treebank 

Most syntactic treebanks annotate variants of either phrase structure (left) or dependency structure (right).

The exploitation of treebank data has been important ever since the first large-scale treebank, The Penn Treebank, was published

However, two main groups can be distinguished: treebanks that annotate phrase structure (for example the Penn Treebank or ICE-GB) and those that annotate dependency structure (for example the Prague Dependency Treebank or the Quranic Arabic Dependency Treebank).

A very long list of treebanks

--

https://en.wikipedia.org/wiki/Interlinear_gloss 

In its simplest form, an interlinear gloss is simply a literal, word-for-word translation of the source text.

--

https://en.wikipedia.org/wiki/Parallel_text 

Rosetta stone

Sentence alignment

--

Corpus of Contemporary American English (COCA)

https://www.english-corpora.org/coca/ 


Fourth, you can search for phrases and strings. And because the corpus is optimized for speed, searches for substrings (*ism, un*able) and phrases are very fast, e.g.: got VERB-ed, BUY * ADJ NOUN, "gorgeous" NOUN -- and even high frequency phrases like: from ADJ to ADJ, phrasal verbs, or NOUN NOUN.

--

Non-English, Parallel & Multilingual Corpora

http://martinweisser.org/corpora_site/corpora2.html

--

Word databases, lexicons and n-gram databases from Lexical Computing

https://www.lexicalcomputing.com/lexical-computing/

==== Thai Language Corpus ====

=== Thai National Corpus ===

https://www.researchgate.net/publication/271429101_Thai_National_Corpus

Aroonmanakun, W. 2007.

TNC  web.  http://www.arts.chula.ac.th/~ling/TNC/ [Accessed 2009-04-24]. 


==== Trie ====

A data structure sometimes used in NLP.

trie - a digital tree

tree - a binary tree


===== Tools =====

[[NLTK]], Natural Language Toolkit, a Python package for English language NLP

[[PyThaiNLP]], a Python package for Thai language NLP


===== Resources =====

Online Sentence Parser, Carnegie Mellon University, 2003
https://www.link.cs.cmu.edu/link/index.html

Wordnet, Princeton University, 2005
https://wordnet.princeton.edu/

Stanford Parser, 2002-2020
https://nlp.stanford.edu/software/lex-parser.shtml

Ai2 The Allen Institute for AI, founded by Paul Allen
https://allenai.org/

The Turing Center, University of Washington
http://turing.cs.washington.edu/

Open IE, Open Information Extraction, entity-relation
https://openie.allenai.org/

NLTK, Python package for NLP

PyThaiNLP, Python package for NLP in Thai language

Penn Treebank

British National Corpus
http://www.natcorp.ox.ac.uk/