Introduction

This module provides efficient similarity search by implementing different data structures for indexing metric spaces. You can use it to index an arbitrary number of objects which you later want to search for objects "similar" to a given other object.

For this module to be useful, you need a function with metric properties (see below for a definition) which you can apply to the type of objects you want to search for. The only pre-defined metric funtion in this module is levenshtein, which is a common metric to calculate the (dis)similarity of two strings.

Beware that the indexing phase might take a long time, depending on the number of objects you want to index and the runtime of your metric distance function. Indexing only pays off when you are planning to do a lot of searches.

Definitions

A metric space is a pair (S, f) where S is an abritrary (possibly infinite) set of objects and f is a function which returns a "distance" for any two objects x, y in S. This distance is a number used to express the similarity between the two objects. The larger the distance, the more they differ. A distance of zero means they are considered equal.

The function f has to be a "metric", i.e. the following conditions have to be met for all x, y, z in S:

f(x, y) >= 0

(positivity)
f(x, y) == f(y, x)

(symmetry)
f(x, z) <= f(x, y) + f(y, z)

(triangle inequality)

This definition adheres to most people's intuitive understanding of the distance between two points in a Euclidian space (of arbitrary dimensionality). [1] Imagine you have three points in a 2-dimensional space like this:

A-------B
 \     /
  \   /
   \ /
    C

It is evident that the distance between two of these points is necessarily larger than (or equal to) zero, and that the direction you take for measuring (from A to B or vice versa) has no influence on the distance. The third condition, the triangle inequality is easy to grasp, too: if you know the distances between A and B and between B and C, you can be sure that their sum is equal to or larger than the distance between A and C. Voila, there's your metric space.

In many cases, metric distance functions take a long time to compute. Because of this, data structures used to index metric spaces are designed to minimize the number of distance computations necessary for searching the whole space by using the properties of its metric function (most notably the triangle inequality) to discard as many objects as possible after every computation.

[1]	In fact, Euclidian spaces are one type of metric space.

Use cases & caveats

Metric distance functions can be defined for objects other than points in Euclidian space, too. One very well known distance function is the Levenshtein distance, which is applicable to arbitrary strings. Since it is quite popular and indexing strings is one of the major use cases for metric spaces, an implementation of this function is included in this module. Other metric distance functions for strings are the Damerau-Levenshtein variant and Hamming distance. These can be used for a variety of tasks, such as spell-checking, record linkage (or "deduplication"), error detection etc.

Please note that indexing a large number of objects usually takes a considerable amount of time and memory. Indexing only pays off if you are going to do a lot of searches in the index. If you just want to search a set of objects for all objects similar to one or a few other objects, you are better off comparing them sequentially by hand with your distance function.

Installation

This file is a stand-alone module for Python 2.4 and later which you only have to save somewhere in your PYTHONPATH. [2] No installation is necessary. The only tight dependencies are the modules sys, random and weakref which are most probably already included in your Python installation.

If you are using this on a 32 Bit i386 machine running Windows or Linux, you probably also want to install Psyco which speeds up execution times quite noticeably (in exchange for only a slightly increased memory usage). Psyco will be used automatically if it is available.

[2]	Run `python -c "import sys; print sys.path"` to learn where Python currently looks for modules.

The most current version is always available from this site:

http://well-adjusted.de/mspace.py/

The project is currently kept in SVN and may be obtained from the following URL:

svn://svn.well-adjusted.de/mspace/trunk

Basic usage

Index creation

Say you have a dictionary file containing one or more words per line. You can make an index of all the words in it by doing:

>>> import mspace
>>> mydict = file('dictionary.txt')
>>> words = mspace.tokenizer(mydict)
>>> index = mspace.VPTree(words, mspace.levenshtein)

You can delay the time-comsuming construction phase by creating the index without any parameters and explicitly calling construct at a later time:

>>> index = mspace.VPTree()
>>> index.construct(words, mspace.levenshtein)

Please note that the index' content is dismissed on every call to construct. The index is built from scratch with only the objects and distance function passed as parameters.

Performing range-searches

After you have built the index (and got yourself some coffee if the index is large) you may search it for all objects having distance to a given other object between arbitrary bounds:

>>> index.range_search('WOOD', min_dist=0, max_dist=1)
['GOOD', 'WOOD', 'WOOL', 'MOOD']

In this case, you received four results: the object you searched for (apparently it has been in your dictionary, too) and two other objects "GOOD" and "WOOL", both of which have the requested maximum distance of one to your query object. Result lists are unordered and do not contain information about their contents' distance to the query object. If you need this, you have to calculate it by hand.

Both min_dist and max_dist are optional parameters defaulting to zero. Thus, if you leave them out, a search for objects which are exactly equal (as defined by your distance function) to the query object is performed. For historical reasons, and since range-searching with a minimum distance of zero and a maximum greater than zero is quite common, there is a shortcut to search with min_dist=0:

>>> index.search('WOOD', 1)
['GOOD', 'WOOD', 'WOOL', 'MOOD']

Performing nearest-neighbour-searches

The second type of search you can perform is "k-nearest-neighbour" search. It returns a given number of objects from the index that are closest to the query object. Search results are guaranteed to never contain an object with a distance to the query object that is larger than that of any other object in the tree.

Result lists are composed of (object, distance) tuples, sorted ascendingly by the distance. If you don't specify a maximum number of matches to return, it defaults to one:

>>> index.nn_search('WOOD')
[('WOOD', 0)]
>>> index.nn_search('WOOD', 5)
[('WOOD', 0), ('WOOL', 1), ('GOOD', 1), ('MOOD', 1), ('FOOT', 2)]

Note that the index may contain other objects with a distance of two to the query object 'WOOD' (e.g. 'FOOL'). They have just been cut off because a maximum of five objects has been requested. You have no influence on the choice of representatives that is made.

Note that you must not expect this method to always return a list of length num because you might ask for more objects than are currently in the index.

Advanced usage

Indexing complex objects

If you have "complex" objects which you want to index by only one specific attribute, you can write a simple wrapper around levenshtein (or some other function applicable to the attribute) which extracts this attribute from your objects and returns the distance between their attributes' value. This way you can search for and retrieve complex objects from the index while comparing only a single attribute

>>> import mspace
>>>
>>> class Record(object):
...     def __init__(self, firstname, surname):
...         self._firstname, self._surname = firstname, surname
...
>>> def firstnameLev(r1, r2):
...     return mspace.levenshtein(r1._firstname, r2._firstname)
...
>>> index = mspace.VPTree(listOfRecords, firstnameLev)
>>> rec = Record("Paul", "Brady")
>>> index.search(rec, 2)
[<Record: 'Paula Bean'>, <Record: 'Raoul Perky'>, <Record: 'Paul Simon'>]

Of course you can also use more advanced techniques like writing a function factory which returns a function that extracts arbitrary attributes from your objects and applies a metric function to it:

>>> def metric_using_attr(metric, attr):
...     def f(one, other, attr=attr):
...         attr1, attr2 = getattr(one, attr), getattr(other, attr)
...         return metric(attr1, attr2)
...     return f
...
>>> metric = metric_using_attr(mspace.levenshtein, "_firstname")
>>> metric( Record("Paul", "Brady"), Record("Paul", "Simon") )
0

(Note that the inner function f in this example deviates from the standard protocol by accepting an optional third parameter. This is done here only to pull the name attr into the inner function's namespace to save some time when looking up its associated value. No index structure in this module will ever call its distance function with more than two parameters anyway, so that whatever you passed as attr to metric_using_attr when creating your function will be used when this function gets called by an index. Got it?)

Reverse indexes

Of course you can use a similar technique to avoid full indexes and instead create an index which only contains references to your data (in a database, text file, somewhere on the net etc.). Your distance function would then use the supplied values to fetch the objects to compare from your data source. But, I cannot stress this enough, your distance function's performance is absolutely crucial to efficient indexing and searching. Therefore you should make sure that your way of accessing the data is really fast.

Choice of data structure

This module currently gives you two data structures to choose from. While the underlying algorithms are different, their search results (given the same set of objects to index, distance function and maximum distance) are exactly the same. If you find a scenario where this is not the case, please let me know because this would be a serious bug.

Capabilities

The algorithms implemented in this module have different capabilities which are displayed in the table below:

	VPTree	BKTree
non-discrete metric	X
insertion		X
deletion

The first row is about the value returned by the distance function the index uses. BKTrees are not prepared to handle non-discrete distances (for example floats) and will most probably perform really bad when they occur.

The other two rows describe what you can do with the indexes after they have been constructed. Please note that neither of them may shrink over time. Only BKTrees are able to handle insertion efficiently.

Performance

[In short: you are probably better off using BKTrees unless you need a non-discrete metric. When in doubt, test both options.]

Obviously, the whole point of this module is to speed up the process of finding objects similar to one another. And, as you can imagine, the different algorithms perform differently when exposed to a specific problem.

Here's an example: I took the file american-english-large from the Debian package wamerican-large and did some time measurements of indexing and searching a random subset of it. The machine I used is an 1.1GHz Pentium3 running Python 2.4 from Debian stable with Psyco enabled. Please note that the following numbers have been determined completely unscientifical. I only show them to give you an idea of what to expect. For serious benchmarking, I should have used the timeit module. Nevertheless, this is it:

	BKTree			VPTree
size	time	per node	height	time	per node	height
5000	2.92	0.000584	11	7.40	0.001480	21
10000	6.32	0.000632	14	16.02	0.001602	22
15000	9.95	0.000663	14	28.35	0.001890	24
20000	13.70	0.000685	14	41.40	0.002070	24
25000	17.46	0.000699	15	50.63	0.002025	25
30000	21.81	0.000727	15	55.47	0.001849	25
35000	25.77	0.000736	16	64.43	0.001841	26
40000	29.40	0.000735	16	75.45	0.001886	26
45000	41.28	0.000917	16	96.36	0.002141	26
50000	37.62	0.000752	16	95.31	0.001906	28

This table shows construction times (total and per node in seconds) for data sets of a given size. Additionally, you can see the height [3] of the trees. Apparently, BKTrees can be constructed a lot faster than VPTrees. Both of them need an increasing time per node with growing data sets. This is expected since construction complexity is O(n log(n)) in both cases.

		BK, size: 50,000		VP, size: 50,000
k	results	time	# dist	time	# dist
0	0.89	0.0008	8.14	0.0019	17.28
1	4.07	0.1670	1583.77	0.2801	1933.64
2	38.10	1.1687	10353.31	1.4413	13845.67
3	304.57	2.5614	22202.28	3.0497	27514.51
4	1584.86	3.8727	32376.54	4.1518	36877.62
5	5317.03	4.4182	39616.04	4.9935	42720.38

This table shows the runtime of searches done in the trees (in seconds), the number of distance calculations done and the number of results for growing error tolerance. All values are given as average over 100 random searches (from the same file, but not necessarily from the set of indexed objects). As you can see, search runtime (tightly connected to the number of distance calculations) literally explodes for larger values of k. This growth only fades when an increased part of the complete search space is examined (the number of distance calculations equals the number of nodes compared with the query object).

As you can see, too, long search runtimes for large values of k don't actually hurt usability very much since you only get a usable number of results for small k anyway. This is of course due to the nature of the data set and the distance function used in this example. Your application may vary greatly.

You also have to keep in mind that the dictionary I used contains almost no duplicates. If you used the words of a real document as your data set, your tree would have significantly less nodes than the number of words in your document since you typically repeat a lot of words very often. A quick test with my diploma thesis revealed only 4400 distinct words in a document with 14,000 words (including LaTeX commands). This makes searching much faster because equal objects (as defined by the distance function) are only evaluated once.

[3]

If you look closely, you can see that the VPTree's height doesn't grow continuously when the data set grows. The reason for that phenomenon is that this implementation does not try very hard to pick a good vantage point which is the key factor to get well-balanced trees. So, in some cases, a vantage point may be chosen that may result in trees which are higher than strictly necessary.

Optimization potential (or: Why the f*ck is this so slow!?)

Ok, you have tried to index your small dictionary file and start to wonder why this easy task takes several minutes. Here are a couple of (possible) reasons:

Indexing takes O(n log(n)) for all data structures currently implemented. So yes, doubling the number of indexed objects will necessarily more than double your runtime for indexing, sorry.
Python is slow. I have written an implementation very similar to this one in Java which is significantly faster (by a factor of about 15 to 25, even when using Psyco!). But the java implementation eats more memory and unfortunately I cannot release it under a free license.
Recursion is slow in Python. Recursion is the most natural way to create and search trees and this code uses it a lot. Most of the recursion in this module is "tail recursion", but the Python interpreter is not able to optimize it into loops.

[Note: as of revision 60, searching in both trees and inserting into BKTrees has been rewritten using loops instead of recursion. Perfomance gain is quite small, though.]
The distance distribution in your metric space is too dense. This leads to the data structures being unable to discard large parts of the indexed space while searching. In pathological cases, you may end up with data structures resembling linked lists.
Your distance function is non-discrete, i.e. it returns floats. How well this is supported depends on the index structure in use.
Your metric function is very expensive. Remember that this function has to be called very often during indexing. You may use this module's attribute dist_ctr to get an idea how often this is done. It is incremented by one on every call to your distance function and is otherwise unused.
You are searching for non-similar objects. If you, for example, have an index of strings with an average length of six characters and you are continuously searching for strings with a maximum distance of three or more, do not expect the search to be significantly faster than linear testing of the complete search space. It may even be slower.

Possible solutions include:

Rewrite in C. Since I cannot be bothered to learn C, someone else would have to do this.
Use Pyrex or Psyco. Quick tests with Psyco suggest that indexing becomes about 3-4 times faster. This is as easy as doing an import psyco; psyco.full() in your program. Read Psyco's manual for tuning options.

[Note: as of about revision 46, on module loading an attempt is made to import and enable Psyco for levenshtein and the tree classes. If it doesn't succeed, you'll get a performance warning on stderr but everything will continue to work flawlessly.]
Pickle trees for static data sets. Pickling and unpickling indexes using Python's standard pickle module is quite fast. But beware that a pickled index takes a lot more space than your plain data.
Rewrite tree construction and searching using loops. I am not sure whether this will be faster, but it will probably take less memory than the current approach and fix the problem with Python's recursion limit. I fear that code readibility will suffer, though. The current implementations are quite close to the original descriptions of the algorithms.

[Note: partially done, see above.]
Optimize the distance function in use. Since I do not know your distance function, I cannot help you with this.

As for levenshtein: the algorithm implemented in this module is the classical dynamic programming variant with O(|n|*|m|) time complexity (and O(min(|n|, |m|)) for space). Other algorithms for Levenshtein try not to compute the whole matrix but only the values around the diagonal of the matrix that are strictly necessary. By doing this, the average O(|n|*|m|) of the classical dynamic programming algorithm could be improved to something like O(k*|n|). While it isn't obvious how much wall clock time this approarch saves in reality, it may be in fact a whole lot faster because k is usually only a fraction of the length of the input strings.

And if you take a close look at the algorithms, you may realize that searching may be sped up by using an algorithm that doesn't actually compute the distance between two strings, but one that only answers whether two strings have a distance less than or equal to a specified maximum distance. This can be done in O(k^2) and should boost search performance quite noticeably.

If you are interested in any of these optimizations, I suggest reading Gonzalo Navarro's excellent survey "Guided Tour To Approximate String Matching" which you can get from Citeseer.

Currently, this module always uses the same function for searching and indexing. If you want to test your own implementations, you have to replace your index object's _func attribute after indexing and before searching.

Contact

For comments, patches, flames and hugs send mail to: jrspieker@well-adjusted.de.

levenshtein(x, y)

source code

Compute Levenshtein (or "edit") distance between two strings. Levenshtein distance between two strings x and y is defined as the minimal number of operations on single characters that it takes to transform x into y (or vice versa).

Allowed operations are:

Insertion,
Deletion and
Replacement

of a single character.

Levenshtein distance has all the properties of a strictly positive metric. That means for all x, y, z it holds:

x == y <=> levenshtein(x, y) == 0
x != y <=> levenshtein(x, y) > 0 (positivity)
levenshtein(x, y) == levenshtein(y, x) (symmetry)
levenshtein(x, z) <= levenshtein(x, y) + levenshtein(y, z)

The upper bound for Levenshtein distance is the length of the longer string: max(len(x), len(y)). A general lower bound is abs(len(x) - len(y)). This is the case where one string is the pre-/postfix of the other one.

Incidentally, this implementation not only works for strings, but for all types of sequences. Objects in the given sequences are compared for equality using '==' to determine whether an edit operation is needed.

Module mspace