Table of Contents
PyThaiNLP
A Python package for Thai language NLP.
The github page, https://github.com/PyThaiNLP/pythainlp
The project home page, https://www.thainlp.org/
Tutorials, https://www.thainlp.org/pythainlp/tutorials/
Project founder, Wannaphong Phatthiyaphaibun, Khon Kaen University Nong Khai Campus https://iam.wannaphong.com/
Corpus
Word Lists:
- countries
- provinces
- thai_family_names
- thai_female_names
- thai_male_names
- thai_negations
- thai_stopwords
- thai_syllables
- thai_words
Capabilities
util.isthai()
util.countthai()
util.collate() - sort
util.thai_strftime() - format date and time in thai formats
util.thai_time() - spellout time as words
sent_tokenize() - split text into sentences, multiple algorithms
word_tokenize() - split text into words, multiple algorithms and dictionaries
dict_trie() - create a custom dictionary trie for use with word_tokenizei()
subword_tokenize() - syllable_tokenize() tokenize.tcc.segment() tokenize.tcc.tcc_pos()
transliterate.romanize() transliterate.transliterate()
util.normalize() - reorder vowels and tone marks, remove spaces, remove repeating vowels, remove dangling characters
util.arabic_digit_to_thai_digit() util.thai_digit_to_arabic_digit() util.digit_to_text()
Soundex: lk82, metasound, udom83
Spellchecking: Peter Novig's algorithm with word frequency from Thai National Corpus (TNC) spell() correct() Can use custom dictionary.
Thai National Corpus Thai Textbook Corpus
pythainlp.corpus.ttc - Thai Textbook Corpus
Part-of-Speech Tagging pos_tag() pos_tag_sents()
Named-Entity Tagging
Word Vector
Number Spell Out
Resources
paper about the Thai National Corpus
authors:
- Wirote Aroonmanakun, Chulalongkorn University, Bangkok, 1917, 37,000 students
- Kachen Tansiri, Kasetsart University, Bangkok, 1943, 86,000 students
- Pairit Nittayanuparp, Chulalongkorn University
https://www.researchgate.net/publication/271429101_Thai_National_Corpus