Squeak SmalltalkJoker Squeak Smalltalk : Visoracle : prevnext String Search Fuzzy Soundex Metaphone

does anybody know of a way of finding strings that match a given
pattern closely, but not necessarily exactly (like the Levinshtein
distance) available in Smalltalk ?

You might take a look at String>>correctAgainstDictionary:continuedFrom:,
which is called when you type a variable/method name that doesn't
exist in the system, and computes a list of likely spellings.

Not off-hand.  Check the list archives for the work Scott Crosby did
on text indexing; I'm not sure if it included the type of "near-miss"
indexing that you are interested in.

How about using the Soundex algorithm? A quick Google search found this
brief explanation <http://www.frontiernet.net/~rjacob/soundex.htm>,
Soundex in Ruby <http://raa.ruby-lang.org/list.rhtml?name=Soundex>, and
this C code <http://physics.nist.gov/cuu/Reference/soundex.html>.

  Martin,

  I happened to have written something like this while ago.

  According to the class comment, it is based on:
----------
Baeza-Yates, R. A., and Gonnet, G.H., A new approach to text
searcing, {\it Communications of the ACM}, 35, 10 (October
1992), pp 74-82.

Wu, S., and Manber, U., Fast text searching allowing errors,
{\it Communications of the ACM}, 35, 10 (October 1992), pp 82-91.
----------

  I don't remember how it worked well, but I had an application that
used this, so maybe not too bad if you know what's its limitation^^;

-- Yoshiki

Soundex:
It works for any words because it is based on how they sound. I have
read about one problem with the algorithm, though: you need different
sets of characters and weightings for different languages. For example,
I think you would want "j" and "h" to map to the same sound in Mexican
Spanish. (Forgive me if that's a bad example. The only Spanish I've
ever learned was "May I have another beer, please?" and "Where is the
bathroom?")

Jim

The other problem with it, as I recall, is that you the first letter
needs to be the same.  So a name/word that starts with 'ph' won't ever
match a word that starts with 'f', for example, even if they sound the
same.  Other than that, though, it works great: we used it for a sales
system and it allowed users to stop asking people to spell their names
over the phone.  I've tried typing in every convoluted spelling of my
name I can think of and it always finds me :)

Julian

> Other than that, though, it works great: we used it for a sales
> system and it allowed users to stop asking people to spell their names
> over the phone.  I've tried typing in every convoluted spelling of my
> name I can think of and it always finds me :)

And the code we used was (man, this is pretty old, these days I would use
streams...):

initialize

	SoundexMap _ Dictionary new.
	#('bpfv' 'cskgjqxz' 'dt' 'l' 'mn' 'r') doWithIndex:
            [:r :i | r do: [:c | SoundexMap at: c put: i asString first]].
	'aeiouyhw ' do: [:c | SoundexMap at: c put: nil].

soundex: aString

	|soundex i lastCode|
	soundex _ '0000' copy.
	aString isEmpty ifTrue: [^ soundex].
	soundex at: 1 put: (aString asUppercase at: 1).
	lastCode _ i _ 1.
	aString asLowercase allButFirst do:
		[:c | |code|
		(i < 4) ifTrue:
			[code _ SoundexMap at: c ifAbsent:[].
			 ((code = lastCode) or: [code isNil]) ifFalse:
                             [i := i + 1.
			     soundex at: i put: code].
			 lastCode _ code]].
	^ soundex

There is also the metaphone algorithm, which may or may not have these
same problems:
http://aspell.sourceforge.net/metaphone/

Chris,

        There is also an algorithm called 'Metaphone' that was originally
published in Computer Language in 1990.  It does a somewhat better job
of matching similar sounding words (at least in English).  The principal
weakness of soundex is that it always uses the first letter of the word,
which can often be spelled differently.

        You might also try searches on 'agrep' ("approximate grep") and
'string similarity' and 'approximate string matching' or
'approximate pattern matching' for other references.

Here are a few fairly good references:

        http://www.bitmechanic.com/mail-archives/mysql/Jan1998/0666.html
        http://aspell.net/metaphone/metaphone-kuhn.txt
        http://www.dcc.ufmg.br/~ghuiban/paa/tp3/node18.html

                                        -Dean