Finding Basic Forms of Finnish Words
September 18, 2014
In information retrieval and machine learning it is often useful to find the basic forms (aka canonical, dictionary or citation forms) of the words appearing in a document. When they are of little or no concern, we can discard the inflections of words and convert documents into a form in which each inflected word is replaced with its basic form. This coarser approach allows for indexing as well as simple comparing of the contents of documents. And by comparing the basic-form words of documents, for example by turning the documents into bags of words, we can quite easily find common classifying keywords and establish a similarity metric between documents.
In English documents it is often sufficient to use stemming (in Finnish: stemmaus or informally "typistäminen") to find stems, bases or root forms of inflected words. Stems may then be compared to determine the similarity of two documents. An example of stemming, which was produced using an online stemmer, is shown below.
Input:
I like fishing but I love swimming
Output:
i like fish but i love swim
However, when applied to Finnish, a language that boasts numerous grammatical cases, complex inflections, lots of compound words and ways to create new words, the simplistic rule-based stemming doesn't seem to produce good enough results — at least in my experience.
For example, consider the conversion outlined below, which was produced with the same stemming tool as before, but with the Finnish option turned on.
Input:
tahtoisitko tahtoa tahtokaamme tahdoimme tahdoitko
Output:
tahtois tahto tahtok tahdoi tahdoi
Each input word is an inflection of the basic form "tahtoa", but the produced stems mostly differ from each other. In any kind of standard data analysis or NLP technique, these stems would be considered as non-matches.
A more complex method called Lemmatization (in Finnish: perusmuotoistaminen) has worked better for me. This method actually produces the lemmas (perusmuodot) or the basic forms of the words, but to do this, it requires more advanced computations and possibly some corpus data in the language under consideration. While stemmers consider semantics of word formation only, lemmatizers require additional knowledge of context.
A while back I tried looking for Finnish lemmatizers that are free-of-charge as well as open-source. I stumbled across the academic package Omorfi, but I never understood how to operate it. I also found Lingsoft, a company that provides a closed-source commerical package that has a limited trial demo available. A while later I found the web service SeCo Lexical Analysis Services that provides lemmatization (as well as other language analysis tools) as a web service. It is based on Omorfi, and it has been packaged for easy use.
Given the same inflected words as input as in the previous example, the following output is produced:
Lemmatized: { "locale":"fi", "baseform":"tahtoa tahtoa tahtoa tahtoa tahtoa" }
Finally, results that one would expect!
For the time being, the SeCo Lexical Analysis Services is the best free-of-charge tool that I know of. I wouldn't necessarily use it to convert millions of documents as that could be construed as a denial-of-service attack. Hopefully, some day someone with the know-how could wrap the Omorfi package into an simple-to-install and simple-to-use command line tool or R package that could be used locally to lemmatize millions of documents.