Stemming and Spell Checking in R
March 21, 2016
Last week we introduced the new hunspell R package. This week a new version was released which adds support for additional languages and text analysis features.
By default hunspell uses the US English dictionary
en_US but the new version allows for checking and analyzing in other languages as well. The
?hunspell help page has detailed instructions on how to install additional dictionaries.
> library(hunspell) > hunspell_info("ru_RU") $dict  "/Users/jeroen/workspace/hunspell/tests/testdict/ru_RU.dic" $encoding  "UTF-8" $wordchars  NA
> hunspell("чёртова карова", dict = "ru_RU")[]  "карова"
It turned out this feature was much more difficult to implement than I expected. Much of the Hunspell library dates from before UTF-8 became popular and therefore many dictionaries use local 8 bit character encodings such as
ISO-8859-1 for English and
KOI8-R for Russian. To spell check in these languages, the character encoding of the document text has to match that of the dictionary. However R only supports
UTF-8 so we need to convert strings in C with
iconv, which opens up a new can of worms. Anyway it should all work now.
Text analysis and wordclouds
In last weeks post we showed how to parse and spell check a latex file:
# Check an entire latex document library(hunspell) setwd(tempdir()) download.file("http://arxiv.org/e-print/1406.4806v1", "1406.4806v1.tar.gz", mode = "wb") untar("1406.4806v1.tar.gz") text <- readLines("content.tex", warn = FALSE) bad_words <- hunspell(text, format = "latex") sort(unique(unlist(bad_words)))
The new version also exposes the parser directly, so you can easily extract words and derive the stems to summarize some text, for example to display in a wordcloud.
# Summarize text by stems (e.g. for wordcloud) allwords <- hunspell_parse(text, format = "latex") stems <- unlist(hunspell_stem(unlist(allwords))) words <- head(sort(table(stems), decreasing = TRUE), 200)