Article Preview
Buy Now
COLUMN
Soundex and Metaphone
Algorithms for matching words based on sound
Issue: 2.6 (July/August 2004)
Author: Thomas Reed
Author Bio: Thomas Reed has been programming as a hobbyist for more than 20 years, and fell in love with the Mac in 1984.
Article Description: No description available.
Article Length (in bytes): 7,857
Starting Page Number: 34
Article Number: 2616
Resource File(s):
2616.zip Updated: 2013-03-11 19:07:58
Related Link(s): None
Excerpt of article text...
"I overheard Marty saying I 'have the ethics of Cthulhu' last night. I wonder who that is? Well, I'll just look in my electronic dictionary. Let's see... C-U-T-H-O-O-L-O-O. Hmm, no matches. Maybe I spelled it wrong. K-U-T-H-O-O-L-O-O. No? C-U-T-H-U-L-O-O. Ahh, I give up. It must just be the name of some famous Indian philosopher or something. I'll have to be sure to tell him thanks."
Poor Curt. He's about to make a fool of himself, all because he doesn't know how to spell a word. If only he could have found a reference to Cthulhu, he would have discovered his friend was comparing him to an evil monster-deity from H. P. Lovecraft's imagination. His embarrassment would be averted if only his electronic dictionary used an algorithm that grouped similar sounds when searching for words! Perhaps he could even turn the tables on his acquaintance and inform him that, according to Lovecraft, the name would be more aptly pronounced "Khlul'-hloo".
Soundex is an algorithm that allows for the matching of words based on how they sound rather than how they are spelled. It is an extremely old algorithm, as such things go, with the first patent on the Soundex technique dating back to 1918. As such, it is also a fairly limited algorithm. It only takes into consideration US English spellings and pronunciations, and is of no use whatsoever with non-ASCII characters. However, it has seen quite wide use, and remains of interest even today.
Soundex is actually a hashing algorithm. Given a word, Soundex will return a very short (4 character) hash of that word. To generate this hash, Soundex assigns numbers to each of the letters in the alphabet based on which letters have similar sounds. (See Table 1 for a list of the codes and which letter each is assigned to.) Vowels and any other letters with a code of 0 are omitted from the hash, and double letters (such as 'LL' or 'PP') are coded only once in the hash.
The first character in the hash is always the first letter of the word. The remaining three characters in the hash are numeric codes generated from the remainder of the string, with characters omitted as described above. The hash is finished when it reaches four characters in length -- any remaining characters in the string are omitted from the hash. If the hash ends up being less than 4 characters long, it is padded with zeros at the end.
Let's consider an example. The name "Smith" would be converted into the Soundex code "S530". The initial "S" forms the first character of the code. The "M" is represented by the number 5, the "I" is omitted, the "T" is represented by 3 and the final "H" is omitted. Since the result is only 3 characters long, the result is padded with an additional 0 at the end.
...End of Excerpt. Please purchase the magazine to read the full article.