User Tools

Site Tools


pythainlp

PyThaiNLP

A Python package for Thai language NLP.

The github page, https://github.com/PyThaiNLP/pythainlp

The project home page, https://www.thainlp.org/

Tutorials, https://www.thainlp.org/pythainlp/tutorials/

Project founder, Wannaphong Phatthiyaphaibun, Khon Kaen University Nong Khai Campus https://iam.wannaphong.com/

Corpus

Word Lists:

  • countries
  • provinces
  • thai_family_names
  • thai_female_names
  • thai_male_names
  • thai_negations
  • thai_stopwords
  • thai_syllables
  • thai_words

Capabilities

util.isthai()

util.countthai()

util.collate() - sort

util.thai_strftime() - format date and time in thai formats

util.thai_time() - spellout time as words

sent_tokenize() - split text into sentences, multiple algorithms

word_tokenize() - split text into words, multiple algorithms and dictionaries

dict_trie() - create a custom dictionary trie for use with word_tokenizei()

subword_tokenize() - syllable_tokenize() tokenize.tcc.segment() tokenize.tcc.tcc_pos()

transliterate.romanize() transliterate.transliterate()

util.normalize() - reorder vowels and tone marks, remove spaces, remove repeating vowels, remove dangling characters

util.arabic_digit_to_thai_digit() util.thai_digit_to_arabic_digit() util.digit_to_text()

Soundex: lk82, metasound, udom83

Spellchecking: Peter Novig's algorithm with word frequency from Thai National Corpus (TNC) spell() correct() Can use custom dictionary.

Thai National Corpus Thai Textbook Corpus

pythainlp.corpus.ttc - Thai Textbook Corpus

Part-of-Speech Tagging pos_tag() pos_tag_sents()

Named-Entity Tagging

Word Vector

Number Spell Out

Resources

paper about the Thai National Corpus
authors:

  • Wirote Aroonmanakun, Chulalongkorn University, Bangkok, 1917, 37,000 students
  • Kachen Tansiri, Kasetsart University, Bangkok, 1943, 86,000 students
  • Pairit Nittayanuparp, Chulalongkorn University

https://www.researchgate.net/publication/271429101_Thai_National_Corpus

pythainlp.txt · Last modified: 2021/01/28 05:46 by 127.0.0.1

Except where otherwise noted, content on this wiki is licensed under the following license: Public Domain
Public Domain Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki