Hunspell: Spell Checker and Text Parser for R

March 14, 2016


Hunspell is the spell checker library used in LibreOffice, OpenOffice, Mozilla Firefox, Google Chrome, Mac OS X, InDesign, and a few more. Base R has some spell checking functionality via the aspell function which wraps the aspell or hunspell command line program on supported systems. The new hunspell R package on the other hand directly links to the hunspell c++ library and works on all platforms without installing additional dependencies.

Basic tools

The hunspell_check function takes a vector of words and checks each individual word for correctness.

library(hunspell)
words <- c("beer", "wiskey", "wine")
hunspell_check(words)
## [1]  TRUE FALSE  TRUE

The hunspell function takes a character vector with text (in plain, latex or man format) and returns a list with incorrect words for each line.

bad_words <- hunspell("spell checkers are not neccessairy for langauge ninja's")
print(bad_words)
## [1] "neccessairy" "langauge"    "ninja's"    

Finally hunspell_suggest is used to suggest correct alternatives for each (incorrect) input word.

hunspell_suggest(bad_words[[1]])
## [[1]]
## [1] "necessary"    "necessarily"  "necessaries"  "recessionary" "accessory"    "incarcerate" 
##
## [[2]]
## [1] "language"  "Langeland" "Lagrange"  "Lange"     "gaugeable" "linkage"   "Langland" 
##
## [[3]]
## [1] "ninjas"   "Janina's" "Nina's"   "ninja"    "Janine's" "meninx"   "nark's"

Parsing text

The first challenge in spell-checking is extracting individual words from formatted text. The hunspell function supports three parsers via the format parameter: plain text, latex and man. For example to check the OpenCPU paper for spelling errors we use the latex source code:

download.file("http://arxiv.org/e-print/1406.4806v1", "1406.4806v1.tar.gz",  mode = "wb")
untar("1406.4806v1.tar.gz")
text <- readLines("content.tex", warn = FALSE)
words <- hunspell(text, format = "latex")
sort(unique(unlist(words)))

Base R also has a few filters to extract words from R, Sweave or Rd code, see RdTextFilter, SweaveTeXFilter in tools. For example to check your R package manual for typos (assuming you are in the pkg source dir)

for(file in list.files("man", full.names = TRUE)){
  cat("\nFile", file, ":\n  ")
  txt <- tools::RdTextFilter(file, keepSpacing = FALSE)
  cat(sQuote(sort(unique(unlist(hunspell(txt))))), sep =", ")
}

Morphological analysis

A cool feature in hunspell is the morphological analysis. The hunspell_analyze function will show you how a word breaks down into a valid stem plus affix. Hunspell uses a special dictionary format to determine if a stem+affix combination is valid in a given language.

For example suppose we take a few variations of the word love. To get the possible stems+affix for each word:

hunspell_analyze(c("love", "loving", "lovingly", "loved", "lover", "lovely", "love"))
## [1] " st:love"
## [1] " st:loving"    " st:love fl:G"
## [1] " st:lovingly"
## [1] " st:loved"     " st:love fl:D"
## [1] " st:lover"     " st:love fl:R"
## [1] " st:lovely"    " st:love fl:Y"
## [1] " st:love"

Alternatively the hunspell_stem returns only the stem. Not sure how you would use this but it’s certainly cool.

Thanks!

Thanks to Daniel Falbel for suggesting this package on the rOpenSci forums!