Module mspace
[hide private]
[frames] | no frames]

Module mspace

source code

Contents

Introduction

This module provides efficient similarity search by implementing different data structures for indexing metric spaces. You can use it to index an arbitrary number of objects which you later want to search for objects "similar" to a given other object.

For this module to be useful, you need a function with metric properties (see below for a definition) which you can apply to the type of objects you want to search for. The only pre-defined metric funtion in this module is levenshtein, which is a common metric to calculate the (dis)similarity of two strings.

Beware that the indexing phase might take a long time, depending on the number of objects you want to index and the runtime of your metric distance function. Indexing only pays off when you are planning to do a lot of searches.

Definitions

A metric space is a pair (S, f) where S is an abritrary (possibly infinite) set of objects and f is a function which returns a "distance" for any two objects x, y in S. This distance is a number used to express the similarity between the two objects. The larger the distance, the more they differ. A distance of zero means they are considered equal.

The function f has to be a "metric", i.e. the following conditions have to be met for all x, y, z in S:

This definition adheres to most people's intuitive understanding of the distance between two points in a Euclidian space (of arbitrary dimensionality). [1] Imagine you have three points in a 2-dimensional space like this:

A-------B
 \     /
  \   /
   \ /
    C

It is evident that the distance between two of these points is necessarily larger than (or equal to) zero, and that the direction you take for measuring (from A to B or vice versa) has no influence on the distance. The third condition, the triangle inequality is easy to grasp, too: if you know the distances between A and B and between B and C, you can be sure that their sum is equal to or larger than the distance between A and C. Voila, there's your metric space.

In many cases, metric distance functions take a long time to compute. Because of this, data structures used to index metric spaces are designed to minimize the number of distance computations necessary for searching the whole space by using the properties of its metric function (most notably the triangle inequality) to discard as many objects as possible after every computation.

[1]In fact, Euclidian spaces are one type of metric space.

Use cases & caveats

Metric distance functions can be defined for objects other than points in Euclidian space, too. One very well known distance function is the Levenshtein distance, which is applicable to arbitrary strings. Since it is quite popular and indexing strings is one of the major use cases for metric spaces, an implementation of this function is included in this module. Other metric distance functions for strings are the Damerau-Levenshtein variant and Hamming distance. These can be used for a variety of tasks, such as spell-checking, record linkage (or "deduplication"), error detection etc.

Please note that indexing a large number of objects usually takes a considerable amount of time and memory. Indexing only pays off if you are going to do a lot of searches in the index. If you just want to search a set of objects for all objects similar to one or a few other objects, you are better off comparing them sequentially by hand with your distance function.

Installation

This file is a stand-alone module for Python 2.4 and later which you only have to save somewhere in your PYTHONPATH. [2] No installation is necessary. The only tight dependencies are the modules sys, random and weakref which are most probably already included in your Python installation.

If you are using this on a 32 Bit i386 machine running Windows or Linux, you probably also want to install Psyco which speeds up execution times quite noticeably (in exchange for only a slightly increased memory usage). Psyco will be used automatically if it is available.

[2]Run python -c "import sys; print sys.path" to learn where Python currently looks for modules.

The most current version is always available from this site:

http://well-adjusted.de/mspace.py/

The project is currently kept in SVN and may be obtained from the following URL:

svn://svn.well-adjusted.de/mspace/trunk

Basic usage

Index creation

Say you have a dictionary file containing one or more words per line. You can make an index of all the words in it by doing:

>>> import mspace
>>> mydict = file('dictionary.txt')
>>> words = mspace.tokenizer(mydict)
>>> index = mspace.VPTree(words, mspace.levenshtein)

You can delay the time-comsuming construction phase by creating the index without any parameters and explicitly calling construct at a later time:

>>> index = mspace.VPTree()
>>> index.construct(words, mspace.levenshtein)

Please note that the index' content is dismissed on every call to construct. The index is built from scratch with only the objects and distance function passed as parameters.

Performing range-searches

After you have built the index (and got yourself some coffee if the index is large) you may search it for all objects having distance to a given other object between arbitrary bounds:

>>> index.range_search('WOOD', min_dist=0, max_dist=1)
['GOOD', 'WOOD', 'WOOL', 'MOOD']

In this case, you received four results: the object you searched for (apparently it has been in your dictionary, too) and two other objects "GOOD" and "WOOL", both of which have the requested maximum distance of one to your query object. Result lists are unordered and do not contain information about their contents' distance to the query object. If you need this, you have to calculate it by hand.

Both min_dist and max_dist are optional parameters defaulting to zero. Thus, if you leave them out, a search for objects which are exactly equal (as defined by your distance function) to the query object is performed. For historical reasons, and since range-searching with a minimum distance of zero and a maximum greater than zero is quite common, there is a shortcut to search with min_dist=0:

>>> index.search('WOOD', 1)
['GOOD', 'WOOD', 'WOOL', 'MOOD']

Performing nearest-neighbour-searches

The second type of search you can perform is "k-nearest-neighbour" search. It returns a given number of objects from the index that are closest to the query object. Search results are guaranteed to never contain an object with a distance to the query object that is larger than that of any other object in the tree.

Result lists are composed of (object, distance) tuples, sorted ascendingly by the distance. If you don't specify a maximum number of matches to return, it defaults to one:

>>> index.nn_search('WOOD')
[('WOOD', 0)]
>>> index.nn_search('WOOD', 5)
[('WOOD', 0), ('WOOL', 1), ('GOOD', 1), ('MOOD', 1), ('FOOT', 2)]

Note that the index may contain other objects with a distance of two to the query object 'WOOD' (e.g. 'FOOL'). They have just been cut off because a maximum of five objects has been requested. You have no influence on the choice of representatives that is made.

Note that you must not expect this method to always return a list of length num because you might ask for more objects than are currently in the index.

Advanced usage

Indexing complex objects

If you have "complex" objects which you want to index by only one specific attribute, you can write a simple wrapper around levenshtein (or some other function applicable to the attribute) which extracts this attribute from your objects and returns the distance between their attributes' value. This way you can search for and retrieve complex objects from the index while comparing only a single attribute

>>> import mspace
>>>
>>> class Record(object):
...     def __init__(self, firstname, surname):
...         self._firstname, self._surname = firstname, surname
...
>>> def firstnameLev(r1, r2):
...     return mspace.levenshtein(r1._firstname, r2._firstname)
...
>>> index = mspace.VPTree(listOfRecords, firstnameLev)
>>> rec = Record("Paul", "Brady")
>>> index.search(rec, 2)
[<Record: 'Paula Bean'>, <Record: 'Raoul Perky'>, <Record: 'Paul Simon'>]

Of course you can also use more advanced techniques like writing a function factory which returns a function that extracts arbitrary attributes from your objects and applies a metric function to it:

>>> def metric_using_attr(metric, attr):
...     def f(one, other, attr=attr):
...         attr1, attr2 = getattr(one, attr), getattr(other, attr)
...         return metric(attr1, attr2)
...     return f
...
>>> metric = metric_using_attr(mspace.levenshtein, "_firstname")
>>> metric( Record("Paul", "Brady"), Record("Paul", "Simon") )
0

(Note that the inner function f in this example deviates from the standard protocol by accepting an optional third parameter. This is done here only to pull the name attr into the inner function's namespace to save some time when looking up its associated value. No index structure in this module will ever call its distance function with more than two parameters anyway, so that whatever you passed as attr to metric_using_attr when creating your function will be used when this function gets called by an index. Got it?)

Reverse indexes

Of course you can use a similar technique to avoid full indexes and instead create an index which only contains references to your data (in a database, text file, somewhere on the net etc.). Your distance function would then use the supplied values to fetch the objects to compare from your data source. But, I cannot stress this enough, your distance function's performance is absolutely crucial to efficient indexing and searching. Therefore you should make sure that your way of accessing the data is really fast.

Choice of data structure

This module currently gives you two data structures to choose from. While the underlying algorithms are different, their search results (given the same set of objects to index, distance function and maximum distance) are exactly the same. If you find a scenario where this is not the case, please let me know because this would be a serious bug.

Capabilities

The algorithms implemented in this module have different capabilities which are displayed in the table below:

  VPTree BKTree
non-discrete metric X  
insertion   X
deletion    

The first row is about the value returned by the distance function the index uses. BKTrees are not prepared to handle non-discrete distances (for example floats) and will most probably perform really bad when they occur.

The other two rows describe what you can do with the indexes after they have been constructed. Please note that neither of them may shrink over time. Only BKTrees are able to handle insertion efficiently.

Performance

[In short: you are probably better off using BKTrees unless you need a non-discrete metric. When in doubt, test both options.]

Obviously, the whole point of this module is to speed up the process of finding objects similar to one another. And, as you can imagine, the different algorithms perform differently when exposed to a specific problem.

Here's an example: I took the file american-english-large from the Debian package wamerican-large and did some time measurements of indexing and searching a random subset of it. The machine I used is an 1.1GHz Pentium3 running Python 2.4 from Debian stable with Psyco enabled. Please note that the following numbers have been determined completely unscientifical. I only show them to give you an idea of what to expect. For serious benchmarking, I should have used the timeit module. Nevertheless, this is it:

  BKTree VPTree
size time per node height time per node height
5000 2.92 0.000584 11 7.40 0.001480 21
10000 6.32 0.000632 14 16.02 0.001602 22
15000 9.95 0.000663 14 28.35 0.001890 24
20000 13.70 0.000685 14 41.40 0.002070 24
25000 17.46 0.000699 15 50.63 0.002025 25
30000 21.81 0.000727 15 55.47 0.001849 25
35000 25.77 0.000736 16 64.43 0.001841 26
40000 29.40 0.000735 16 75.45 0.001886 26
45000 41.28 0.000917 16 96.36 0.002141 26
50000 37.62 0.000752 16 95.31 0.001906 28

This table shows construction times (total and per node in seconds) for data sets of a given size. Additionally, you can see the height [3] of the trees. Apparently, BKTrees can be constructed a lot faster than VPTrees. Both of them need an increasing time per node with growing data sets. This is expected since construction complexity is O(n log(n)) in both cases.

  BK, size: 50,000 VP, size: 50,000
k results time # dist time # dist
0 0.89 0.0008 8.14 0.0019 17.28
1 4.07 0.1670 1583.77 0.2801 1933.64
2 38.10 1.1687 10353.31 1.4413 13845.67
3 304.57 2.5614 22202.28 3.0497 27514.51
4 1584.86 3.8727 32376.54 4.1518 36877.62
5 5317.03 4.4182 39616.04 4.9935 42720.38

This table shows the runtime of searches done in the trees (in seconds), the number of distance calculations done and the number of results for growing error tolerance. All values are given as average over 100 random searches (from the same file, but not necessarily from the set of indexed objects). As you can see, search runtime (tightly connected to the number of distance calculations) literally explodes for larger values of k. This growth only fades when an increased part of the complete search space is examined (the number of distance calculations equals the number of nodes compared with the query object).

As you can see, too, long search runtimes for large values of k don't actually hurt usability very much since you only get a usable number of results for small k anyway. This is of course due to the nature of the data set and the distance function used in this example. Your application may vary greatly.

You also have to keep in mind that the dictionary I used contains almost no duplicates. If you used the words of a real document as your data set, your tree would have significantly less nodes than the number of words in your document since you typically repeat a lot of words very often. A quick test with my diploma thesis revealed only 4400 distinct words in a document with 14,000 words (including LaTeX commands). This makes searching much faster because equal objects (as defined by the distance function) are only evaluated once.

[3]If you look closely, you can see that the VPTree's height doesn't grow continuously when the data set grows. The reason for that phenomenon is that this implementation does not try very hard to pick a good vantage point which is the key factor to get well-balanced trees. So, in some cases, a vantage point may be chosen that may result in trees which are higher than strictly necessary.

Optimization potential (or: Why the f*ck is this so slow!?)

Ok, you have tried to index your small dictionary file and start to wonder why this easy task takes several minutes. Here are a couple of (possible) reasons:

Possible solutions include:

Contact

For comments, patches, flames and hugs send mail to: jrspieker@well-adjusted.de.


Version: $Revision: 119 $

Date: $Date: 2012-09-17 02:13:52 +0200 (Mon, 17 Sep 2012) $

Author: $Author: jrschulz $

License: GPL

Classes [hide private]
  MetricTree
Base class for metric space indexes which have tree-like properties. It's implementation is not complete enough to make it actually usable, but it should make it quite easy to implement other indexing and searching strategies. In an ideal case, all you have to do is to derive from this class, implement the methods construct, _get_child_candidates and optionally insert and make children an attribute or property returning all child nodes of the current node. If you need an __init__ method, you should call super(YourClass, self).__init__(objects, func, parent) in its end as well.
  BKTree
"Burkhard-Keller Trees" are unbalanced multiway trees and may grow over time. They belong to the first general solutions to index arbitrary metric spaces.
  FastBKTree
Simpler version of BKTree.
  VPTree
A "Vantage Point Tree" is a static, balanced, binary tree which can be used to index metric spaces with a non-discrete distance function. It has been independently described by both Jeffrey K. Uhlmann (1991) and Peter Yianilos (1993).
  UnindexableObjectError
This Exception is thrown when the call to the distance function in use by a metric space throws an exception. This should not happen during regular usage and is a hint that either your distance function is buggy or you tried to index objects which your distance function cannot handle.
Functions [hide private]
 
levenshtein(x, y)
Compute Levenshtein (or "edit") distance between two strings. Levenshtein distance between two strings x and y is defined as the minimal number of operations on single characters that it takes to transform x into y (or vice versa).
source code
 
tokenizer(textfile, separator=None)
Simple generator that tokenizes sequences of strings (or file-like objects) into uppercase words using the optional separator. This separator is passed directly to str.split so split's behaviour concerning None as separator applies here as well.
source code
Variables [hide private]
  dist_ctr = 0
  __package__ = None
  e = ImportError('No module named psyco',)
Function Details [hide private]

levenshtein(x, y)

source code 

Compute Levenshtein (or "edit") distance between two strings. Levenshtein distance between two strings x and y is defined as the minimal number of operations on single characters that it takes to transform x into y (or vice versa).

Allowed operations are:

  • Insertion,
  • Deletion and
  • Replacement

of a single character.

Levenshtein distance has all the properties of a strictly positive metric. That means for all x, y, z it holds:

  • x == y <=> levenshtein(x, y) == 0
  • x != y <=> levenshtein(x, y) > 0 (positivity)
  • levenshtein(x, y) == levenshtein(y, x) (symmetry)
  • levenshtein(x, z) <= levenshtein(x, y) + levenshtein(y, z)

The upper bound for Levenshtein distance is the length of the longer string: max(len(x), len(y)). A general lower bound is abs(len(x) - len(y)). This is the case where one string is the pre-/postfix of the other one.

Incidentally, this implementation not only works for strings, but for all types of sequences. Objects in the given sequences are compared for equality using '==' to determine whether an edit operation is needed.

tokenizer(textfile, separator=None)

source code 

Simple generator that tokenizes sequences of strings (or file-like objects) into uppercase words using the optional separator. This separator is passed directly to str.split so split's behaviour concerning None as separator applies here as well.

After the last token has been returned, there is an attempt to close textfile. If this yields an AttributeError, it will be silently ignored.