linguistics_problems icon indicating copy to clipboard operation
linguistics_problems copied to clipboard

Natural language processing in examples and games

Computational linguistics

Welcome to the main page of my project! This repository stores examples of linguistics problems.

My name is Daria, I'm a software engineer with skills in natural language processing. My general scientific interests are knowledge bases and facts extraction. There are very important analysis tools that provides semantic analysis and text mining.

Project has next sections:

  • Pre-morphology
  • Phonology
  • Morphology
  • Knowledge engineering
  • N-grams applications
  • Games

In the source code three languages is supported now: English, Russian and Finnish. I hope that very soon next publishing problems will implement NLP-algorithms for more languages.

Source code:

Pre-morphology

  • Russian tokenizer
  • Sentence boundary detection
  • Transliteration Russian <=> Latin (with spell-checker)
  • Word decomposition
  • Camel case segmenter
  • Distance to anagram
  • Russian number2text converter

Phonology

  • Soundex Algorithm Implementation
  • Syllable Module (word syllables count (russian/english/finnish) and word syllables list (russian/finnish))

Morphology

  • Russian patronymic generator
  • Russian diminutive names generator
  • Russian cases generator (dative)
  • Russian cognate words checker
  • English Adjective Comparisoner
  • Common English question generator
  • Finnish Predicative Sentences
  • Finnish POS-tagger
  • Finnish case tagger
  • Russian POS-tagger

Syntax

  • Syntax analyzer for simple sentences

Knowledge engineering

  • Family tree
  • Abstract ontology for company
  • Simple timetable QA-system
  • Bookshelf

N-grams applications

  • N-gram dictionary (for spelling/for language modeling)
  • Simple English word filler
  • N-gram language model
  • Collocations
  • Russian diminutive names generator with RNN
  • Russian character RNN (non-smoothing)
  • Russian joking language model (PI Day)
  • Simple spell-checker (based on n-grams and Damerau-Levenstein distance)
  • Advanced spell-checker based on:
    • dictionary of words from good texts with 2-3-gram index;
    • train language model with 2-grams on good texts;
    • retrieval candidates with Damerau-Levenstein distance;
    • find candidate with max probability of bigram max{ P(prev_word, candidate), candidate in candidates}

Games

  • Russian Cities
  • Guess City
  • Guess Number
  • Secret Letter
  • Opposites