Voikko consists of a set of separately released components that form a stack of layers as illustrated in the picture below.
In this project we develop the components shown with blue background, and some of the components with yellow background:
Upstream version of Malaga by Björn Beutel is used to compile and debug Suomi-malaga. The Malaga implementation within libvoikko is also based on original Malaga, but it has been modified to make it more suitable for our purposes.
This project started in late 2005 under the name Hunspell-fi, with an aim to create Finnish vocabulary and affix files for Hunspell. The Hunspell based implementation was developed roughly six months, and there were no serious problems but it was also evident that the work progressed rather slowly. In early 2006 Hannu Väisänen published Suomi-malaga, which contained a vocabulary that was (depending on how one defines "word") roughly ten times larger than the Hunspell-fi vocabulary at that time. Additionally the Hunspell-fi implementation did not support compound words and only a few derived word forms, which were both supported by Suomi-malaga.
Suomi-malaga had a lot of correctness problems from the spellchecking point of view that did not exist in our Finnish Hunspell dictionary, but with the limited resources we had at that time we really could not afford ignoring the huge amount of work that had gone into producing Suomi-malaga. Using the vocabulary of Suomi-malaga in Hunspell was not possible due to different semantics of word classification between these projects. It would be somewhat easier now that the data has been moved to Joukahainen and the classification has been modernised.
There are still some problems with the malaga based approach that might not exist in Hunspell. Malaga is not thread safe (this is going to be fixed within libvoikko), and the performance is sufficient but not great. Writing an accurate Finnish morphology with malaga is not easy, but there are currently only a few cases (mostly involving inflection within compound words) where no satisfactory solution has been found yet. However it is unlikely that Hunspell is any better in this regard. The COMPOUNDRULE patterns in Hunspell would make some things easier that are somewhat complicated to do with Malaga, but there are other major limitations (or at least there have been, some may have been fixed in recent versions) in Hunspell that should be considered:
All of the problems above could definitely be solved within Hunspell, but migh require a lot of work. Compromising quality just to become compatible with Hunspell is not an option, because Finnish people have come to expect really good results from their spell checkers (we have had advanced compound word checking in commercial text editors for well over ten years). Moving to another implementation is certainly a possibility, and it will be easier after the vocabulary and inflection data have been completely moved to an application independent format. Most likely replacement for malaga would involve using some sort of finite state devices. Something along these lines is already being tried in the Omorfi project. If results from Omorfi seem good enough and there was interest in integrating finite state tools in Hunspell, it would be more likely that we would make the switch.
The core parts of Voikko are all hidden behind the public interface of libvoikko, which is designed to be distributed as a shared library and used by any number of applications in the operating system. Our goal is to get the software shipped as a part of various Linux distributions so that Finnish writing aids would work out of the box for anyone who needs them. In the best case users would not even know that they are using Voikko. The source packages released by us should be suitable for easy packaging in different distributions (if not, tell us and we try to improve them). Just make sure that you package a compatible set of modules.
Note that currently the interface between libvoikko and Suomi-malaga is not considered to be fully stable, although it has remained unchanged for quite long time. We still have some requirements left that cannot be implemented without changing this interface. We will do our best to make libvoikko handle missing or incorrect versions of Suomi-malaga lexicon files as gracefully as possible. We however think that binary packages of libvoikko should have a dependency on binary packge of Suomi-malaga (commonly called voikko-fi) since the library is essentially useless without it.
The suggestion above implies that the Enchant provider plugin should not be distributed in the same binary package with Enchant main library, otherwise the dependency chain will drag Suomi-malaga binaries (the largest component of Voikko) on almost every Linux desktop on the planet regardless of the installation language. Luckily the provider plugin can be easily shipped in a separate package, since Enchant does runtime detection and loading of provider plugins with dlopen. This way no Voikko specific material gets installed on systems where Finnish spell checking is not needed.
We do not have official reference packaging available, but Fedora packages of Voikko follow the guidelines above and could be used as a starting point for packages for other distributions.
To make the application integration easier, it would be preferable to have an unified standard interface between linguistic tools and applications using them. The proposed freedesktop.org Desktop Language Checking Spec is a step to this direction.
For Windows and OS X the packaging may have to be done a bit differently, as neither of them natively supports software packaged this way. However in OS X one can use similar third party packaging systems such as Fink and MacPorts.