A sentence from origin words to present words though a noisy channel.
- We see an observation of a misspelled word x
- Find the correct word w
w = argmax P(x|w)P(w)
From this equation, we can learn that the correct word w is the probability of w misspelled to x, and the probability of word w.
- words with similar spelling
words with similar pronunciation
We get candidate list from compute edit distance of string.
Compute minimal edit distance between two strings, where edits are:
- Transposition of two adjacent letters
- Unigram, bigram, trigram
- Web-scale spelling correction
- Stupid back-off
For example, we can find P(w) for unigram model:
P(x|w) = probability of the edit (deletion/insertion/substitution/transposition)
We should compute error probability
How often are above mistake happen.
This table is showing the confusion matrix for spelling errors.
Generating confusion lists by P(x|w)
Channel model for ‘access’
Noisy channel probability for acress
This table primarily show that the process of compute the noisy channel probability for acress.
- List the candidate words by insertion, deletion, substitution, transposition
- Get the P(word) from language model such as unigram model.
- From a confusion matrix(the probability of type ‘a’ as ‘b’, insert ‘a’ before ‘b’ etc.), we get P(x|w) for candidate words.
- w = P(x|w)*P(w)
For more accuracy, we may replace unigram of bigram model.
So the above process may change:
P(word) => P(word|previous word)
What’s more, trigram model have more accuracy for the result.
Two corpora are necessary:
- The probability of the letters in word operation as insertion, deletion, substitute, transposition.
- The corpora for the probability of special language model, in detail, the probability of the candidate word show after the previous word or the two previous words.
For the real-word spell correction
We don’t know the wrong word which is in a sentence, we must generate a set of candidates for each word.
- For each word in sentence
- Generate candidate set
- the word itself
- all single-letter edits that are English words
- words that homophones
- Choose best candidates
- Noisy channel model
- Task-specific classifier
Consider the probability of no error
The origin word is correct is taking up most probability
Improvements to channel model
- Allow richer edits
- ent -> ant
- ph -> f
- le -> al
- Incorporate pronunciation into channel
- Factors that could influence p(misspelling|word)
- The source letter
- The target letter
- Surrounding letters
- The position in word
- Nearby keys on the keyboard
- Homology on the keyboard
- Likely morpheme transformations