Special

Introducing the “Welcome to Xojo” Bundle!

New to Xojo and looking for guidance? We've put together a terrific bundle to welcome you! Xojo Bundle

This bundle includes six back issues of the magazine -- all of year 21 in printed book and digital formats -- plus a one-year subscription (beginning with 22.1) so you'll be learning all about Xojo for the next year. It's the perfect way to get started programming with Xojo. And you save as much as $35 over the non-bundle price!

This offer is only available for a limited time as supplies are limited, so hurry today and order this special bundle before the offer goes away!

Article Preview


Buy Now

Issue 2.6

COLUMN

Soundex and Metaphone

Algorithms for matching words based on sound

Issue: 2.6 (July/August 2004)
Author: Thomas Reed
Author Bio: Thomas Reed has been programming as a hobbyist for more than 20 years, and fell in love with the Mac in 1984.
Article Description: No description available.
Article Length (in bytes): 7,857
Starting Page Number: 34
Article Number: 2616
Resource File(s):

Download Icon 2616.zip Updated: 2013-03-11 19:07:58

Related Link(s): None

Excerpt of article text...

"I overheard Marty saying I 'have the ethics of Cthulhu' last night. I wonder who that is? Well, I'll just look in my electronic dictionary. Let's see... C-U-T-H-O-O-L-O-O. Hmm, no matches. Maybe I spelled it wrong. K-U-T-H-O-O-L-O-O. No? C-U-T-H-U-L-O-O. Ahh, I give up. It must just be the name of some famous Indian philosopher or something. I'll have to be sure to tell him thanks."

Poor Curt. He's about to make a fool of himself, all because he doesn't know how to spell a word. If only he could have found a reference to Cthulhu, he would have discovered his friend was comparing him to an evil monster-deity from H. P. Lovecraft's imagination. His embarrassment would be averted if only his electronic dictionary used an algorithm that grouped similar sounds when searching for words! Perhaps he could even turn the tables on his acquaintance and inform him that, according to Lovecraft, the name would be more aptly pronounced "Khlul'-hloo".

Soundex is an algorithm that allows for the matching of words based on how they sound rather than how they are spelled. It is an extremely old algorithm, as such things go, with the first patent on the Soundex technique dating back to 1918. As such, it is also a fairly limited algorithm. It only takes into consideration US English spellings and pronunciations, and is of no use whatsoever with non-ASCII characters. However, it has seen quite wide use, and remains of interest even today.

Soundex is actually a hashing algorithm. Given a word, Soundex will return a very short (4 character) hash of that word. To generate this hash, Soundex assigns numbers to each of the letters in the alphabet based on which letters have similar sounds. (See Table 1 for a list of the codes and which letter each is assigned to.) Vowels and any other letters with a code of 0 are omitted from the hash, and double letters (such as 'LL' or 'PP') are coded only once in the hash.

The first character in the hash is always the first letter of the word. The remaining three characters in the hash are numeric codes generated from the remainder of the string, with characters omitted as described above. The hash is finished when it reaches four characters in length -- any remaining characters in the string are omitted from the hash. If the hash ends up being less than 4 characters long, it is padded with zeros at the end.

Let's consider an example. The name "Smith" would be converted into the Soundex code "S530". The initial "S" forms the first character of the code. The "M" is represented by the number 5, the "I" is omitted, the "T" is represented by 3 and the final "H" is omitted. Since the result is only 3 characters long, the result is padded with an additional 0 at the end.

...End of Excerpt. Please purchase the magazine to read the full article.