1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23 """
24
25 Metric Spaces
26 ==============
27
28 .. contents::
29
30 Introduction
31 -------------
32
33 This module provides efficient similarity search by implementing
34 different data structures for indexing metric spaces. You can use it to
35 index an arbitrary number of objects which you later want to search for
36 objects "similar" to a given other object.
37
38 For this module to be useful, you need a function with metric properties
39 (see below for a definition) which you can apply to the type of objects
40 you want to search for. The only pre-defined metric funtion in this
41 module is `levenshtein`, which is a common metric to calculate the
42 (dis)similarity of two strings.
43
44 Beware that the indexing phase might take a long time, depending on the
45 number of objects you want to index and the runtime of your metric
46 distance function. Indexing only pays off when you are planning to do a
47 lot of searches.
48
49 Definitions
50 ............
51
52 A metric space is a pair ``(S, f)`` where ``S`` is an abritrary
53 (possibly infinite) set of objects and ``f`` is a function which returns
54 a "distance" for any two objects ``x``, ``y`` in ``S``. This distance is
55 a number used to express the similarity between the two objects. The
56 larger the distance, the more they differ. A distance of zero means they
57 are considered equal.
58
59 The function ``f`` has to be a "metric", i.e. the following conditions
60 have to be met for all ``x``, ``y``, ``z`` in ``S``:
61
62 - ``f(x, y) >= 0``
63
64 (positivity)
65
66 - ``f(x, y) == f(y, x)``
67
68 (symmetry)
69
70 - ``f(x, z) <= f(x, y) + f(y, z)``
71
72 (triangle inequality)
73
74 This definition adheres to most people's intuitive understanding of the
75 distance between two points in a Euclidian space (of arbitrary
76 dimensionality). [#euclid]_ Imagine you have three points in a
77 2-dimensional space like this::
78
79 A-------B
80 \ /
81 \ /
82 \ /
83 C
84
85 It is evident that the distance between two of these points is
86 necessarily larger than (or equal to) zero, and that the direction you
87 take for measuring (from ``A`` to ``B`` or vice versa) has no influence
88 on the distance. The third condition, the triangle inequality is easy to
89 grasp, too: if you know the distances between ``A`` and ``B`` and
90 between ``B`` and ``C``, you can be sure that their sum is equal to or
91 larger than the distance between ``A`` and ``C``. Voila, there's your
92 metric space.
93
94 In many cases, metric distance functions take a long time to compute.
95 Because of this, data structures used to index metric spaces are
96 designed to minimize the number of distance computations necessary for
97 searching the whole space by using the properties of its metric function
98 (most notably the triangle inequality) to discard as many objects as
99 possible after every computation.
100
101 .. [#euclid] In fact, Euclidian spaces are one type of metric space.
102
103
104 Use cases & caveats
105 ....................
106
107 Metric distance functions can be defined for objects other than points
108 in Euclidian space, too. One very well known distance function is the
109 Levenshtein distance, which is applicable to arbitrary strings. Since it
110 is quite popular and indexing strings is one of the major use cases for
111 metric spaces, an implementation of this function is included in this
112 module. Other metric distance functions for strings are the
113 Damerau-Levenshtein variant and Hamming distance. These can be used for
114 a variety of tasks, such as spell-checking, record linkage (or
115 "deduplication"), error detection etc.
116
117 Please note that indexing a large number of objects usually takes a
118 considerable amount of time *and* memory. Indexing only pays off if you
119 are going to do *a lot* of searches in the index. If you just want to
120 search a set of objects for all objects similar to one or a few other
121 objects, you are better off comparing them sequentially by hand with
122 your distance function.
123
124 Installation
125 .............
126
127 .. _Psyco: http://psyco.sourceforge.net/
128
129 This file is a stand-alone module for Python 2.4 and later which you
130 only have to save somewhere in your ``PYTHONPATH``. [#pythonpath]_ No
131 installation is necessary. The only tight dependencies are the modules
132 ``sys``, ``random`` and ``weakref`` which are most probably already
133 included in your Python installation.
134
135 If you are using this on a 32 Bit i386 machine running Windows or Linux,
136 you probably also want to install Psyco_ which speeds up execution times
137 quite noticeably (in exchange for only a slightly increased memory
138 usage). Psyco will be used automatically if it is available.
139
140 .. [#pythonpath] Run ``python -c "import sys; print sys.path"`` to learn
141 where Python currently looks for modules.
142
143 The most current version is always available from this site:
144
145 http://well-adjusted.de/mspace.py/
146
147 The project is currently kept in SVN and may be obtained from the
148 following URL:
149
150 svn://svn.well-adjusted.de/mspace/trunk
151
152 Basic usage
153 ------------
154
155 Index creation
156 ...............
157
158 Say you have a dictionary file containing one or more words per line.
159 You can make an index of all the words in it by doing:
160
161 .. python::
162
163 >>> import mspace
164 >>> mydict = file('dictionary.txt')
165 >>> words = mspace.tokenizer(mydict)
166 >>> index = mspace.VPTree(words, mspace.levenshtein)
167
168 You can delay the time-comsuming construction phase by creating the
169 index without any parameters and explicitly calling ``construct`` at a
170 later time:
171
172 .. python::
173
174 >>> index = mspace.VPTree()
175 >>> index.construct(words, mspace.levenshtein)
176
177 Please note that the index' content is dismissed on every call to
178 ``construct``. The index is built from scratch with only the objects and
179 distance function passed as parameters.
180
181 Performing range-searches
182 ..........................
183
184 After you have built the index (and got yourself some coffee if the
185 index is large) you may search it for all objects having distance to a
186 given other object between arbitrary bounds:
187
188 .. python::
189
190 >>> index.range_search('WOOD', min_dist=0, max_dist=1)
191 ['GOOD', 'WOOD', 'WOOL', 'MOOD']
192
193 In this case, you received four results: the object you searched for
194 (apparently it has been in your dictionary, too) and two other objects
195 "GOOD" and "WOOL", both of which have the requested maximum distance of
196 one to your query object. Result lists are unordered and do not contain
197 information about their contents' distance to the query object. If you
198 need this, you have to calculate it by hand.
199
200 Both ``min_dist`` and ``max_dist`` are optional parameters defaulting to
201 zero. Thus, if you leave them out, a search for objects which are
202 exactly equal (as defined by your distance function) to the query object
203 is performed. For historical reasons, and since range-searching with a
204 minimum distance of zero and a maximum greater than zero is quite
205 common, there is a shortcut to search with ``min_dist=0``:
206
207 .. python::
208
209 >>> index.search('WOOD', 1)
210 ['GOOD', 'WOOD', 'WOOL', 'MOOD']
211
212
213 Performing nearest-neighbour-searches
214 ......................................
215
216 The second type of search you can perform is "k-nearest-neighbour"
217 search. It returns a given number of objects from the index that are
218 closest to the query object. Search results are guaranteed to never
219 contain an object with a distance to the query object that is larger
220 than that of any other object in the tree.
221
222 Result lists are composed of ``(object, distance)`` tuples, sorted
223 ascendingly by the distance. If you don't specify a maximum number of
224 matches to return, it defaults to one:
225
226 .. python::
227
228 >>> index.nn_search('WOOD')
229 [('WOOD', 0)]
230 >>> index.nn_search('WOOD', 5)
231 [('WOOD', 0), ('WOOL', 1), ('GOOD', 1), ('MOOD', 1), ('FOOT', 2)]
232
233 Note that the index may contain other objects with a distance of two to
234 the query object ``'WOOD'`` (e.g. ``'FOOL'``). They have just been cut
235 off because a maximum of five objects has been requested. You have no
236 influence on the choice of representatives that is made.
237
238 Note that you must not expect this method to always return a list of
239 length ``num`` because you might ask for more objects than are currently
240 in the index.
241
242
243 Advanced usage
244 ---------------
245
246 Indexing complex objects
247 .........................
248
249 If you have "complex" objects which you want to index by only one
250 specific attribute, you can write a simple wrapper around `levenshtein`
251 (or some other function applicable to the attribute) which extracts this
252 attribute from your objects and returns the distance between their
253 attributes' value. This way you can search for and retrieve complex
254 objects from the index while comparing only a single attribute
255
256 .. python::
257
258 >>> import mspace
259 >>>
260 >>> class Record(object):
261 ... def __init__(self, firstname, surname):
262 ... self._firstname, self._surname = firstname, surname
263 ...
264 >>> def firstnameLev(r1, r2):
265 ... return mspace.levenshtein(r1._firstname, r2._firstname)
266 ...
267 >>> index = mspace.VPTree(listOfRecords, firstnameLev)
268 >>> rec = Record("Paul", "Brady")
269 >>> index.search(rec, 2)
270 [<Record: 'Paula Bean'>, <Record: 'Raoul Perky'>, <Record: 'Paul Simon'>]
271
272 Of course you can also use more advanced techniques like writing a
273 function factory which returns a function that extracts arbitrary
274 attributes from your objects and applies a metric function to it:
275
276 .. python::
277
278 >>> def metric_using_attr(metric, attr):
279 ... def f(one, other, attr=attr):
280 ... attr1, attr2 = getattr(one, attr), getattr(other, attr)
281 ... return metric(attr1, attr2)
282 ... return f
283 ...
284 >>> metric = metric_using_attr(mspace.levenshtein, "_firstname")
285 >>> metric( Record("Paul", "Brady"), Record("Paul", "Simon") )
286 0
287
288 (Note that the inner function ``f`` in this example deviates from the
289 standard protocol by accepting an optional third parameter. This is
290 done here only to pull the name ``attr`` into the inner function's
291 namespace to save some time when looking up its associated value. No
292 index structure in this module will ever call its distance function with
293 more than two parameters anyway, so that whatever you passed as ``attr``
294 to ``metric_using_attr`` when creating your function will be used when
295 this function gets called by an index. Got it?)
296
297 Reverse indexes
298 ................
299
300 Of course you can use a similar technique to avoid full indexes and
301 instead create an index which only contains references to your data (in
302 a database, text file, somewhere on the net etc.). Your distance
303 function would then use the supplied values to fetch the objects to
304 compare from your data source. But, I cannot stress this enough, your
305 distance function's performance is absolutely crucial to efficient
306 indexing and searching. Therefore you should make sure that your way of
307 accessing the data is really fast.
308
309 Choice of data structure
310 -------------------------
311
312 This module currently gives you two data structures to choose from.
313 While the underlying algorithms are different, their search results
314 (given the same set of objects to index, distance function and maximum
315 distance) are exactly the same. If you find a scenario where this is not
316 the case, please let me know because this would be a serious bug.
317
318 Capabilities
319 .............
320
321 The algorithms implemented in this module have different capabilities
322 which are displayed in the table below:
323
324 +--------------------+-------+-------+
325 | |VPTree |BKTree |
326 +====================+=======+=======+
327 |non-discrete metric | X | |
328 +--------------------+-------+-------+
329 |insertion | | X |
330 +--------------------+-------+-------+
331 |deletion | | |
332 +--------------------+-------+-------+
333
334 The first row is about the value returned by the distance function the
335 index uses. BKTrees are not prepared to handle non-discrete distances
336 (for example floats) and will most probably perform really bad when they
337 occur.
338
339 The other two rows describe what you can do with the indexes *after*
340 they have been constructed. Please note that neither of them may shrink
341 over time. Only BKTrees are able to handle insertion efficiently.
342
343 Performance
344 ............
345
346 [In short: you are probably better off using BKTrees unless you need a
347 non-discrete metric. When in doubt, test both options.]
348
349 Obviously, the whole point of this module is to speed up the process of
350 finding objects similar to one another. And, as you can imagine, the
351 different algorithms perform differently when exposed to a specific
352 problem.
353
354 Here's an example: I took the file ``american-english-large`` from the
355 Debian package ``wamerican-large`` and did some time measurements of
356 indexing and searching a random subset of it. The machine I used is an
357 1.1GHz Pentium3 running Python 2.4 from Debian stable **with Psyco
358 enabled**. Please note that the following numbers have been determined
359 completely unscientifical. I only show them to give you an idea of what
360 to expect. For serious benchmarking, I should have used the ``timeit``
361 module. Nevertheless, this is it:
362
363
364 +------+------------------------+------------------------+
365 | |BKTree |VPTree |
366 +======+======+=========+=======+=======+========+=======+
367 |size |time |per node |height |time |per node|height |
368 +------+------+---------+-------+-------+--------+-------+
369 | 5000 | 2.92|0.000584 |11 | 7.40|0.001480| 21|
370 +------+------+---------+-------+-------+--------+-------+
371 |10000 | 6.32|0.000632 |14 | 16.02|0.001602| 22|
372 +------+------+---------+-------+-------+--------+-------+
373 |15000 | 9.95|0.000663 |14 | 28.35|0.001890| 24|
374 +------+------+---------+-------+-------+--------+-------+
375 |20000 | 13.70|0.000685 |14 | 41.40|0.002070| 24|
376 +------+------+---------+-------+-------+--------+-------+
377 |25000 | 17.46|0.000699 |15 | 50.63|0.002025| 25|
378 +------+------+---------+-------+-------+--------+-------+
379 |30000 | 21.81|0.000727 |15 | 55.47|0.001849| 25|
380 +------+------+---------+-------+-------+--------+-------+
381 |35000 | 25.77|0.000736 |16 | 64.43|0.001841| 26|
382 +------+------+---------+-------+-------+--------+-------+
383 |40000 | 29.40|0.000735 |16 | 75.45|0.001886| 26|
384 +------+------+---------+-------+-------+--------+-------+
385 |45000 | 41.28|0.000917 |16 | 96.36|0.002141| 26|
386 +------+------+---------+-------+-------+--------+-------+
387 |50000 | 37.62|0.000752 |16 | 95.31|0.001906| 28|
388 +------+------+---------+-------+-------+--------+-------+
389
390
391 This table shows construction times (total and per node in seconds) for
392 data sets of a given size. Additionally, you can see the height
393 [#height]_ of the trees. Apparently, BKTrees can be constructed a lot
394 faster than VPTrees. Both of them need an increasing time per node with
395 growing data sets. This is expected since construction complexity is
396 ``O(n log(n))`` in both cases.
397
398 +---------------+-----------------+-----------------+
399 | |BK, size: 50,000 |VP, size: 50,000 |
400 +------+--------+-------+---------+-------+---------+
401 |k |results | time | # dist | time | # dist |
402 +------+--------+-------+---------+-------+---------+
403 |0 | 0.89 |0.0008 | 8.14 | 0.0019| 17.28|
404 +------+--------+-------+---------+-------+---------+
405 |1 | 4.07 |0.1670 | 1583.77 | 0.2801| 1933.64|
406 +------+--------+-------+---------+-------+---------+
407 |2 | 38.10 |1.1687 |10353.31 | 1.4413| 13845.67|
408 +------+--------+-------+---------+-------+---------+
409 |3 | 304.57 |2.5614 |22202.28 | 3.0497| 27514.51|
410 +------+--------+-------+---------+-------+---------+
411 |4 |1584.86 |3.8727 |32376.54 | 4.1518| 36877.62|
412 +------+--------+-------+---------+-------+---------+
413 |5 |5317.03 |4.4182 |39616.04 | 4.9935| 42720.38|
414 +------+--------+-------+---------+-------+---------+
415
416 This table shows the runtime of searches done in the trees (in seconds),
417 the number of distance calculations done and the number of results for
418 growing error tolerance. All values are given as average over 100 random
419 searches (from the same file, but not necessarily from the set of
420 indexed objects). As you can see, search runtime (tightly connected to
421 the number of distance calculations) literally explodes for larger
422 values of ``k``. This growth only fades when an increased part of the
423 complete search space is examined (the number of distance calculations
424 equals the number of nodes compared with the query object).
425
426 As you can see, too, long search runtimes for large values of ``k``
427 don't actually hurt usability very much since you only get a usable
428 number of results for small ``k`` anyway. This is of course due to the
429 nature of the data set and the distance function used in this example.
430 Your application may vary greatly.
431
432 You also have to keep in mind that the dictionary I used contains almost
433 no duplicates. If you used the words of a real document as your data
434 set, your tree would have significantly less nodes than the number of
435 words in your document since you typically repeat a lot of words very
436 often. A quick test with my diploma thesis revealed only 4400 distinct
437 words in a document with 14,000 words (including LaTeX commands). This
438 makes searching much faster because equal objects (as defined by the
439 distance function) are only evaluated once.
440
441 .. [#height] If you look closely, you can see that the VPTree's height
442 doesn't grow continuously when the data set grows. The reason for
443 that phenomenon is that this implementation does not try very hard to
444 pick a good vantage point which is the key factor to get
445 well-balanced trees. So, in some cases, a vantage point may be chosen
446 that may result in trees which are higher than strictly necessary.
447
448 Optimization potential (or: Why the f*ck is this so slow!?)
449 ------------------------------------------------------------
450
451 Ok, you have tried to index your small dictionary file and start to
452 wonder why this easy task takes several minutes. Here are a couple of
453 (possible) reasons:
454
455 - Indexing takes ``O(n log(n))`` for all data structures currently
456 implemented. So yes, doubling the number of indexed objects will
457 necessarily more than double your runtime for indexing, sorry.
458
459 - Python is slow. I have written an implementation very similar to this
460 one in Java which is significantly faster (by a factor of about 15 to
461 25, even when using Psyco_!). But the java implementation eats more
462 memory and unfortunately I cannot release it under a free license.
463
464 - Recursion is slow in Python. Recursion is the most natural way to
465 create and search trees and this code uses it a lot. Most of the
466 recursion in this module is "tail recursion", but the Python
467 interpreter is not able to optimize it into loops.
468
469 [Note: as of revision 60, searching in both trees and inserting into
470 BKTrees has been rewritten using loops instead of recursion.
471 Perfomance gain is quite small, though.]
472
473 - The distance distribution in your metric space is too dense. This
474 leads to the data structures being unable to discard large parts of
475 the indexed space while searching. In pathological cases, you may end
476 up with data structures resembling linked lists.
477
478 - Your distance function is non-discrete, i.e. it returns floats. How
479 well this is supported depends on the index structure in use.
480
481 - Your metric function is very expensive. Remember that this function
482 has to be called *very often* during indexing. You may use this
483 module's attribute ``dist_ctr`` to get an idea how often this is done.
484 It is incremented by one on every call to your distance function and
485 is otherwise unused.
486
487 - You are searching for non-similar objects. If you, for example, have
488 an index of strings with an average length of six characters and you
489 are continuously searching for strings with a maximum distance of
490 three or more, do not expect the search to be significantly faster
491 than linear testing of the complete search space. It may even be
492 slower.
493
494 Possible solutions include:
495
496 - Rewrite in C. Since I cannot be bothered to learn C, someone else
497 would have to do this.
498
499 - Use Pyrex or Psyco_. Quick tests with Psyco suggest that indexing
500 becomes about 3-4 times faster. This is as easy as doing an ``import
501 psyco; psyco.full()`` in your program. Read Psyco's manual for tuning
502 options.
503
504 [Note: as of about revision 46, on module loading an attempt is made
505 to import and enable Psyco for levenshtein and the tree classes. If it
506 doesn't succeed, you'll get a performance warning on ``stderr`` but
507 everything will continue to work flawlessly.]
508
509 - Pickle trees for static data sets. Pickling and unpickling indexes
510 using Python's standard ``pickle`` module is quite fast. But beware
511 that a pickled index takes a lot more space than your plain data.
512
513 - Rewrite tree construction and searching using loops. I am not sure
514 whether this will be faster, but it will probably take less memory
515 than the current approach and fix the problem with Python's recursion
516 limit. I fear that code readibility will suffer, though. The current
517 implementations are quite close to the original descriptions of the
518 algorithms.
519
520 [Note: partially done, see above.]
521
522 - Optimize the distance function in use. Since I do not know your
523 distance function, I cannot help you with this.
524
525 As for `levenshtein`: the algorithm implemented in this module is the
526 classical dynamic programming variant with ``O(|n|*|m|)`` time
527 complexity (and ``O(min(|n|, |m|))`` for space). Other algorithms for
528 Levenshtein try not to compute the whole matrix but only the values
529 around the diagonal of the matrix that are strictly necessary. By
530 doing this, the average ``O(|n|*|m|)`` of the classical dynamic
531 programming algorithm could be improved to something like
532 ``O(k*|n|)``. While it isn't obvious how much wall clock time this
533 approarch saves in reality, it may be in fact a whole lot faster
534 because ``k`` is usually only a fraction of the length of the input
535 strings.
536
537 And if you take a close look at the algorithms, you may realize that
538 searching may be sped up by using an algorithm that doesn't actually
539 compute the distance between two strings, but one that only answers
540 whether two strings have a distance less than or equal to a specified
541 maximum distance. This can be done in ``O(k^2)`` and should boost
542 search performance quite noticeably.
543
544 .. _Citeseer: http://citeseer.ist.psu.edu/navarro99guided.html
545
546
547 If you are interested in any of these optimizations, I suggest reading
548 Gonzalo Navarro's excellent survey "Guided Tour To Approximate String
549 Matching" which you can get from Citeseer_.
550
551 Currently, this module always uses the same function for searching and
552 indexing. If you want to test your own implementations, you have to
553 replace your index object's ``_func`` attribute after indexing and
554 before searching.
555
556
557 Contact
558 --------
559
560 For comments, patches, flames and hugs send mail to:
561 jrspieker@well-adjusted.de.
562
563 """
564
565 __docformat__ = 'restructuredtext en'
566 __author__ = '$Author: jrschulz $'
567 __version__ = '$Revision: 119 $'
568 __date__ = '$Date: 2012-09-17 02:13:52 +0200 (Mon, 17 Sep 2012) $'
569 __license__ = "GPL"
570
571 import sys, random, weakref
574 """
575
576 Base class for metric space indexes which have tree-like properties.
577 It's implementation is not complete enough to make it actually
578 usable, but it should make it quite easy to implement other indexing
579 and searching strategies. In an ideal case, all you have to do is to
580 derive from this class, implement the methods `construct`,
581 `_get_child_candidates` and optionally `insert` and make `children`
582 an attribute or property returning all child nodes of the current
583 node. If you need an ``__init__`` method, you should call
584 ``super(YourClass, self).__init__(objects, func, parent)`` in its
585 end as well.
586
587 Instances of this class (and its subclasses, of course) are supposed
588 to be recursive data structures. This means that they may be treated
589 like the root of a tree whether they have a parent or not. This
590 means every node has:
591
592 - a list of `_values` with every element of that list having a
593 distance of zero to all other elements in the list. Because of
594 triangle inequality, you only need to check this condition against
595 one element of this list.
596
597 - a `_parent` (`None` for root nodes)
598
599 - a list of `children` (empty list for leaf nodes)
600
601 - a `_height` (one for trees consisting of only one node)
602
603 - a number of nodes `_num_nodes` (the total number of nodes in this
604 subtree)
605
606 - a `_size` (the total number of objects stored in this subtree)
607
608 Note that in many (most? some?) cases, implementors don't have to
609 fiddle with these values directly. Instead, there are some helper
610 methods like `_incr_node_counter` and `_incr_size` that may be
611 called at appropriate times. If in doubt, just implement what you
612 strictly need first and then take a look whether node, size and
613 height counting works as expected (and whether you care).
614
615 As a bonus, this class implements some of python's magic methods so
616 that you can iterate over its instances' content and use the ``in``
617 operator to test for the existence of an object using the distance
618 function.
619
620 """
621
622 _parent = None
623 """
624
625 The parent of this node. ``None`` for root nodes.
626
627 """
628
629 _values = list()
630 """
631
632 A list of values in this subtree. All objects in this list are
633 considered equal by the distance function in use.
634
635 """
636
637 _size = 0
638 """
639
640 The number of objects in this subtree.
641
642 """
643
644 _func = 0
645 """
646
647 The distance function used for indexing and searching. It has to
648 accept any two objects from the list of objects you are indexing as
649 arguments and return a non-negative integer (or long).
650
651 """
652
653 _height = 0
654 """
655 """
656
657 _num_nodes = 0
658 """
659 """
660
661 - def __init__(self, objects=None, func=None, parent=None):
662 """
663
664 Create a new metric tree. If ``objects`` and ``func`` are given,
665 the given objects will be indexed immediately using the distance
666 function which makes it possible to immediately start so search
667 for other objects in it. Otherwise, you have to call `construct`
668 later in order to make use of this metric tree.
669
670 """
671 self._values = list()
672 self._size = 0
673 self._height = 1
674 self._func = func
675 self.parent = parent
676 self._num_nodes = 0
677 self._incr_node_counter()
678 self._calculate_height()
679 if objects and func:
680 self.construct(objects, func)
681
683 """
684
685 Return a list of all objects in this subtree whose distance to
686 `obj` is at least `min_dist` and at most `max_dist`. `min_dist`
687 and `max_dist` have to satisfy the condition ``0 <= min_dist <=
688 max_dist``.
689
690 If this metric tree is empty, an empty list is returned,
691 regardless of whether this tree has been previously constructed
692 or not.
693
694 If calling the distance function fails for any reason,
695 `UnindexableObjectError` will be raised.
696
697 """
698 assert( 0 <= min_dist <= max_dist )
699 if not self: return list()
700 result, candidates = list(), [self]
701 while candidates:
702 node = candidates.pop()
703 distance = node._get_dist(obj)
704 if min_dist <= distance <= max_dist:
705 result.extend(node._values)
706 candidates.extend( node._get_child_candidates(
707 distance, min_dist, max_dist))
708 return result
709
710 - def search(self, obj, max_dist):
711 """
712
713 Equivalent to range_search(obj, min_dist=0, max_dist).
714
715 """
716 return self.range_search(obj, max_dist=max_dist)
717
719 """
720
721 Perform a k-nearest-neighbour search and return a sorted list of
722 at most ``num`` objects together with their distance to the
723 query object (``(object, distance)`` tuples). Sorting is done by
724 the distance (ascending). Of course, `num` has to be an ``int``
725 (or ``long``, for that matter) larger than zero.
726
727 For obvious reasons, this method cannot return more objects than
728 are currently present in the tree. That means, if ``num >
729 len(self)``, only ``len(self)`` pairs of ```(object, distance)``
730 will be returned.
731
732 If this metric tree is empty, an empty list is returned,
733 regardless of whether this tree has been previously constructed
734 or not.
735
736 If several objects in this tree share the same maximum distance
737 to `obj`, no guarantee is given about which of these objects
738 will be returned and which are left out.
739
740 If calling the distance function fails for any reason,
741 `UnindexableObjectError` will be raised.
742
743 """
744 if num < 1: return list()
745 if not self: return list()
746 neighbours = list()
747 candidates = [self]
748 def cmp_neighbours(a, b): return cmp(a[1], b[1])
749 while candidates:
750 cand = candidates.pop()
751 distance = cand._get_dist(obj)
752 candidate_is_neighbour = bool([1 for n in neighbours
753 if distance < n[1]])
754 if candidate_is_neighbour or len(neighbours) < num:
755 neighbours.extend([(v, distance) for v in cand._values])
756 neighbours.sort(cmp_neighbours)
757 if len(neighbours) > num:
758 neighbours = neighbours[:num]
759 elif len(neighbours) == num:
760 max_dist = neighbours[-1][1]
761 if max_dist == 0:
762 break
763 candidates.extend(
764 node._get_child_candidates( distance, 0, max_dist))
765 else:
766 candidates.extend(node.children)
767 return neighbours
768
770 """
771
772 Return a sequence of child nodes that may contain objects with a
773 distance difference between (inclusive) ``min`` and ``max`` to a
774 certain query object. Note that the query object is *not*
775 passed to this method. Instead, ``distance`` is the query
776 object's previously calculated distance to this node.
777
778 """
779 raise NotImplementedError()
780
782 """
783
784 (Re)Index this space with the given ``objects`` using the
785 distance function ``func``. Previous contents will be
786 discarded. ``objects`` has to be a sequence or an iterable. The
787 distance function ``func`` needs to be applicable to all objects
788 contained in ``objects``.
789
790 If calling the distance function fails for any reason,
791 `UnindexableObjectError` will be raised.
792
793 """
794 raise NotImplementedError()
795
797 """
798
799 Insert a single object into the metric tree. Returns ``self``,
800 i.e. the tree itself.
801
802 This method may not be implemented by all subclasses of
803 `MetricTree` since not all data structures allow to do this
804 efficiently. `NotImplementedError` will be raised when this is
805 the case.
806
807 If calling the distance function fails for any reason,
808 `UnindexableObjectError` will be raised.
809
810 """
811 raise NotImplementedError()
812
814 """
815
816 Answer whether this node is the root of a tree (i.e. it has no
817 parent).
818
819 """
820 return self.parent is None
821
823 """
824
825 Answer whether this node is a leaf node (i.e. it has no
826 children)
827
828 """
829 return not self.children
830
832 """
833
834 A sequence of this node's children.
835
836 The possible number of children per node is
837 implementation-dependent. Leaf nodes return an empty sequence.
838
839 """
840 raise NotImplementedError()
841 children = property(__children)
842
844 """
845
846 The number of nodes in this tree.
847
848 This may be different from the number of objects contained in
849 this tree in cases where two or more of these objects are
850 considered equal by the distance function in use (i.e., for two
851 objects ``p`` and ``q``, calling ``self._func(p, q)`` returns
852 ``0`` and when the tree is empty, i.e. there is one node (the
853 root node) but it doesn't contain any values.
854
855 """
856 return self._num_nodes
857 num_nodes = property(__num_nodes)
858
859
861 """
862
863 The height of this tree.
864
865 Empty trees have a height of ``0``, trees containing one or more
866 objects have a height ``>= 1``.
867
868 """
869 return self._height
870 height = property(__height)
871
873 """
874
875 The parent of this node. `None` if this node is the root
876 of a tree.
877
878 """
879 return self._parent
880
882 """
883
884 Set the parent of this node.
885
886 Parent references are stored as "weak references" to avoid
887 circular references which the garbage collector cannot dissolve
888 by itself.
889
890 """
891 if parent is None:
892 self._parent = None
893 else:
894
895
896
897 callback = lambda proxy: self.__set_parent(None)
898 self._parent = weakref.proxy(parent, callback)
899
900 parent = property(__parent, __set_parent)
901
903 """
904
905 Increment the size counter for this node and all its parents
906 recursively.
907
908 Should be called whenever an object is inserted into the tree.
909
910 """
911 def f(node, incr=incr): node._size += incr
912 self._apply_upwards(f)
913
915 """
916
917 Set this node's height to one and (if `recursive` is `True`)
918 propagate this change upwards in the tree.
919
920 """
921 self._height = height = 1
922 if recursive:
923 node = self.parent
924 while node is not None:
925 height += 1
926 if node._height < height:
927 node._height = height
928 node = node.parent
929 else:
930 node = None
931
933 """
934
935 Increment the node counter for this node and all its parents
936 recursively.
937
938 Should be called whenever a new child of this node is created.
939
940 """
941 def f(node, incr=incr): node._num_nodes += incr
942 self._apply_upwards(f)
943
945 """
946
947 Helper method to apply a function to this node and all its
948 parents recursively. The given function must accept one node as
949 the first parameter and may accept arbitrary keyword parameters
950 as well.
951
952 """
953 node = self
954 func(node, **args)
955 while not node.is_root():
956 node = node.parent
957 func(node, **args)
958
960 """
961
962 Apply this node's distance function to the given object and one
963 of this node's values.
964
965 Raises `UnindexableObjectError` when distance computation fails.
966
967 """
968 global dist_ctr
969 try:
970 distance = self._func(self._values[0], obj)
971 dist_ctr += 1
972 except IndexError, e:
973 sys.stderr.write("Node is empty, cannot calculate distance!\n")
974 raise e
975 except Exception, e:
976 raise UnindexableObjectError(e, "Cannot calculate distance"
977 + " between objects %s and %s using %s" \
978 % (self._values[0], obj, self._func))
979 return distance
980
982 """
983
984 A generator that yields all objects in this node and its
985 children by doing recursive pre-order traversal. Implementors
986 might choose to use another traversal method which better suits
987 their data structure.
988
989 Note that objects are returned in no specific order.
990
991 This implementation will always return objects in the same order
992 as long as the tree's content does not change and the
993 implementation of `children` always returns the children in the
994 same order. If `children` is not implemented at all,
995 `NotImplementedError` will be raised.
996
997 """
998 for obj in self._values:
999 yield obj
1000 for child in self.children:
1001 for obj in child:
1002 yield obj
1003
1004 itervalues = __iter__
1005
1007 """
1008
1009 A generator that yields all nodes in this subtree by doing
1010 recursive pre-order traversal. Implementors might choose to use
1011 another traversal method which better suits their data
1012 structure.
1013
1014 Note that objects are returned unordered.
1015
1016 This implementation will always return nodes in the same order
1017 as long as the tree's content does not change and the
1018 implementation of `children` always returns the children in the
1019 same order. If `children` is not implemented at all,
1020 `NotImplementedError` will be raised.
1021
1022 """
1023 yield self
1024 for child in self.children:
1025 for node in child.iternodes():
1026 yield node
1027
1029 """
1030
1031 Return True if this node contains any objects.
1032
1033 """
1034 return len(self._values) > 0
1035
1037 """
1038
1039 Return the number of objects in this subtree
1040
1041 """
1042 return self._size
1043
1045 """
1046
1047 Search for objects with a distance of zero to `item` and return
1048 True if something is found, otherwise False.
1049
1050 Note that this does **not** check for object identity! Instead,
1051 the definition of equality is delegated to the distance function
1052 in use.
1053
1054 """
1055 return len(self.range_search(item)) > 0
1056
1058 if self:
1059 return str(self.__class__) + ": " + str(self._values[0])
1060 else:
1061 return "<empty node>"
1062
1063
1064 -class BKTree(MetricTree):
1065 """
1066
1067 "Burkhard-Keller Trees" are unbalanced multiway trees and may grow
1068 over time. They belong to the first general solutions to index
1069 arbitrary metric spaces.
1070
1071 Every node in a BKTree stores a list of objects which all have a
1072 distance of zero to each other, i.e. all objects are considered to
1073 be equal by the distance function in use. Additionally, every node
1074 keeps a dictionary. In this dictionary, every child is stored,
1075 referenceable by its distance to its parent.
1076
1077 Essentially, every node in a BKTree divides its data set ``S`` into
1078 subsets ``S^i`` so that every element in ``S^i`` has the distance
1079 ``i`` to one object arbitrarily picked from ``S`` (and stored in
1080 this node). For each ``S^i``, a new child node is created and its
1081 parent keeps a reference to this child together with its distance.
1082
1083 Insertion of a single object ``o`` in a node ``n`` is quite easy:
1084
1085 1. If ``n`` is empty, store ``o`` in ``n``. That's it.
1086
1087 2. Otherwise, calculate its distance to ``o``.
1088
1089 3. If the distance is zero, append this object to the list of
1090 objects in this node and return.
1091
1092 4. Otherwise, look for a child of ``n`` with the calculated distance
1093 to ``n``. If there is such a child, go back to 1. with this child
1094 being the new ``n``.
1095
1096 5. Otherwise, create a new node containing ``o`` and store it as a
1097 child of ``n`` with it's calculated distance.
1098
1099 Searching is done recursively by first calculating the distance of
1100 the query object to the current node and then using the triangle
1101 inequality to determine which children may contain other search
1102 results.
1103
1104 Runtime complexity for the construction phase is ``O(n log(n))``,
1105 searching is ``O(n^a)`` where ``0 <= a <= 1``. Space requirement is
1106 ``O(n)``.
1107
1108 This implementation is close to the original description of the
1109 algorithm and can only handle discrete distances. If your distance
1110 function returns floating point values, it will appear to work at
1111 indexing time but will most probably be horribly slow when
1112 searching.
1113
1114 """
1115
1116 __slots__ = [ '_parent', '_values', '_size', '_func', '_height',
1117 '_num_nodes', '_children' ]
1118
1119 - def __init__(self, objects=None, func=None, parent=None):
1120 self._children = {}
1121 if callable(objects):
1122 random.shuffle(list(objects()))
1123 super(BKTree, self).__init__(objects, func, parent)
1124
1126 self._func = func
1127 self._children = {}
1128 for o in objects:
1129 self.insert(o)
1130 return self
1131
1133 if not self:
1134 self._values.append(obj)
1135 self._incr_size()
1136 return self
1137 node = self
1138 while True:
1139 distance = node._get_dist(obj)
1140 if distance == 0:
1141 node._values.append(obj)
1142 node._incr_size()
1143 break
1144 child = node._children.get(distance, None)
1145 if child is None:
1146 child = BKTree([obj], node._func, node)
1147 node._children[distance] = child
1148 break
1149 node = child
1150 return self
1151
1153 assert( min_dist <= max_dist )
1154 return (child for dist, child in self._children.iteritems()
1155 if distance - max_dist <= dist <= distance + max_dist)
1156
1161 children = property(__children)
1162
1164 """
1165
1166 Simpler version of BKTree. The tree's content is kept in nested
1167 dictionaries where a "node"'s value is simply stored at key -1 and
1168 its children are stored using their distance to the value as the
1169 key. Example:
1170
1171 self.root = { -1: 'ABC',
1172 0: { -1: 'ABC' }
1173 1: { -1: 'ABCD',
1174 1: { -1: 'ABCE' }
1175 }
1176 2: { -1: 'A',
1177 1: { -1: 'B',
1178 1: { -1: 'C' }
1179 }
1180 }
1181 }
1182
1183 """
1184
1185
1187 self.root = None
1188 self.metric = metric
1189 self.size = 0
1190 self.height = 0
1191 self.num_nodes = 0
1192 for o in objects:
1193 self.insert(o)
1194
1196 global dist_ctr
1197
1198
1199 self.size += 1
1200 if self.root is None:
1201 self.root = { -1: obj }
1202 return
1203 node = self.root
1204 while True:
1205 distance = self.metric(node[-1], obj)
1206 dist_ctr += 1
1207 if distance in node:
1208 node = node[distance]
1209 else:
1210 node[distance] = { -1: obj }
1211 break
1212
1214 global dist_ctr
1215 candidates = [self.root]
1216 result = []
1217 while candidates:
1218 node = candidates.pop()
1219 distance = self.metric(node[-1], obj)
1220 dist_ctr += 1
1221 if min_dist <= distance <= max_dist:
1222 result.append(node[-1])
1223 candidates.extend( child for dist, child in node.iteritems()
1224 if distance - max_dist <= dist <= distance + max_dist
1225 and dist != -1)
1226 return result
1227
1228 - def search(self, obj, max_dist):
1229 return self.range_search(obj, min_dist=0, max_dist=max_dist)
1230
1232 if not self: return list()
1233 global dist_ctr
1234 neighbours = list()
1235 candidates = [self.root]
1236 def cmp_neighbours(a, b): return cmp(a[1], b[1])
1237 while candidates:
1238 cand = candidates.pop()
1239 distance =self.metric(node[-1], obj)
1240 dist_ctr += 1
1241 candidate_is_neighbour = any(1 for n in neighbours
1242 if distance < n[1])
1243 if candidate_is_neighbour or len(neighbours) < num:
1244 neighbours.append(cand[-1])
1245 neighbours.sort(cmp_neighbours)
1246 if len(neighbours) > num:
1247 neighbours = neighbours[:num]
1248 elif len(neighbours) == num:
1249 max_dist = max([n[1] for n in neighbours])
1250 candidates.extend(
1251 child for dist, child in node.iteritems()
1252 if distance - max_dist <= dist <= distance + max_dist
1253 and dist != -1)
1254 else:
1255 candidates.extend(child for child, dist in
1256 node.iteritems() if child != 1)
1257 return neighbours
1258
1259
1262
1263
1264 -class VPTree(MetricTree):
1265
1266 """
1267
1268 .. _Jeffrey K. Uhlmann: http://www.pnylab.com/pny/papers/vptree/vptree/
1269
1270 .. _Peter Yianilos: http://www.intermemory.net/pny/papers/vptree/vptree.ps
1271
1272 A "Vantage Point Tree" is a static, balanced, binary tree which can
1273 be used to index metric spaces with a non-discrete distance
1274 function. It has been independently described by both `Jeffrey K.
1275 Uhlmann`_ (1991) and `Peter Yianilos`_ (1993).
1276
1277 Construction is done by picking one object (the "vantage point")
1278 from the set of objects to index and then determining the distance
1279 of all remaining objects to the vantage point. The median distance
1280 is calculated and the set of remaining objects is split in two:
1281 objects whose distance to the vantage point is smaller than the
1282 median distance and the rest (objects whose distance to the vantage
1283 point is larger than or equal to the median). Yianilos called this
1284 process "ball decomposition". For both of the resulting sets new
1285 child nodes, called "left" and "right", are created recursively. The
1286 vantage point and the previously calculated median distance are
1287 stored in the current node.
1288
1289 When searching a node, the distance between the current node's value
1290 and the search object is calculated. If it is ``<= k`` (the given
1291 maximum distance), this node's value is added to the list of search
1292 results. Searching proceeds recursively into the left subtree if
1293 ``distance - k < median`` and into the right subtree if ``distance +
1294 k >= median``.
1295
1296 Since the construction phase takes advantage of the median of all
1297 distances, VPTrees do a fairly good job of balancing their subtrees.
1298 Only the fact that you have to put all objects whose distance to the
1299 vantage point equals the median distance always have to be put on
1300 the same side makes VPTrees tend to hang to the side containing
1301 these objects when using discrete distance functions.
1302
1303 Runtime complexity for the construction phase is ``O(n log(n))``,
1304 searching is ``O(log(n))``. Space requirement is ``O(n)``.
1305
1306 """
1307
1308 __slots__ = [ '_parent', '_values', '_size', '_func', '_height',
1309 '_num_nodes', '_median', '_leftchild', '_rightchild' ]
1310
1311 - def __init__(self, objects=None, func=None, parent=None):
1316
1336
1338
1339
1340 if len(objects) > 15:
1341 sample = objects[:5]
1342 max_diff = -1
1343 vp = None
1344 for o in sample:
1345 dists = [ self._func(other, o) for other in sample
1346 if other is not o]
1347 diff = max(dists) - min(dists)
1348 if diff > max_diff:
1349 max_diff, vp = diff, o
1350 objects.remove(vp)
1351 return vp
1352 else:
1353 return objects.pop()
1354
1356 """
1357
1358 Perform the process called "ball decomposition" by Peter
1359 Yianilos.
1360
1361 `objects` has to be an iterable that yields objects applicable
1362 to the metric function in use. The return value is a tuple of
1363 two lists: one list that contains all elements having a distance
1364 smaller than the median distance to this node's value, the
1365 second list contains all objects whose distance is equal to or
1366 larger than the median distance of all given `objects` to this
1367 node's value.
1368
1369 """
1370 dist_per_obj = list()
1371 for obj in objects:
1372 distance = self._get_dist(obj)
1373 if distance == 0:
1374 self._values.append(obj)
1375 else:
1376 dist_per_obj.append( (distance, obj) )
1377 if dist_per_obj:
1378 self._median = VPTree.determine_median(zip(*dist_per_obj)[0])
1379 left = [ obj for dist, obj in dist_per_obj
1380 if dist < self._median ]
1381 right = [ obj for dist, obj in dist_per_obj
1382 if dist >= self._median ]
1383 else:
1384 left, right = None, None
1385 return left, right
1386
1387 @staticmethod
1399
1405
1409 children = property(__children)
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480 -class UnindexableObjectError(Exception):
1481 """
1482
1483 This Exception is thrown when the call to the distance function in
1484 use by a metric space throws an exception. This should not happen
1485 during regular usage and is a hint that either your distance
1486 function is buggy or you tried to index objects which your distance
1487 function cannot handle.
1488
1489 """
1490
1494
1495 orig_exception = None
1496 """ The exception that has triggered this exception. """
1497
1498 pass
1499
1502 """
1503
1504 Compute Levenshtein (or "edit") distance between two strings.
1505 Levenshtein distance between two strings ``x`` and ``y`` is defined
1506 as the minimal number of operations on single characters that it
1507 takes to transform ``x`` into ``y`` (or vice versa).
1508
1509 Allowed operations are:
1510
1511 - Insertion,
1512 - Deletion and
1513 - Replacement
1514
1515 of a single character.
1516
1517 Levenshtein distance has all the properties of a strictly positive
1518 metric. That means for all x, y, z it holds:
1519
1520 - x == y <=> levenshtein(x, y) == 0
1521 - x != y <=> levenshtein(x, y) > 0 (positivity)
1522 - levenshtein(x, y) == levenshtein(y, x) (symmetry)
1523 - levenshtein(x, z) <= levenshtein(x, y) + levenshtein(y, z)
1524
1525 The upper bound for Levenshtein distance is the length of the longer
1526 string: ``max(len(x), len(y))``. A general lower bound is
1527 ``abs(len(x) - len(y))``. This is the case where one string is the
1528 pre-/postfix of the other one.
1529
1530 Incidentally, this implementation not only works for strings, but
1531 for all types of sequences. Objects in the given sequences are
1532 compared for equality using '==' to determine whether an edit
1533 operation is needed.
1534
1535 """
1536 if x == y:
1537 return 0
1538 m, n = len(x), len(y)
1539 if not (m and n):
1540 return m or n
1541 if n > m:
1542 x, y = y, x
1543 m, n = n, m
1544 cur = range(n+1)
1545 prev = [0] * (n+1)
1546 for i, char_x in enumerate(x):
1547 prev[0] = i+1
1548 for j, char_y in enumerate(y):
1549 if char_x == char_y:
1550 prev[j+1] = cur[j]
1551 else:
1552 prev[j+1] = min(prev[j], cur[j], cur[j+1]) + 1
1553 cur, prev = prev, cur
1554 return cur[-1]
1555
1557 """
1558
1559 Simple generator that tokenizes sequences of strings (or file-like
1560 objects) into uppercase words using the optional `separator`. This
1561 `separator` is passed directly to str.split so ``split``'s behaviour
1562 concerning `None` as separator applies here as well.
1563
1564 After the last token has been returned, there is an attempt to close
1565 `textfile`. If this yields an `AttributeError`, it will be silently
1566 ignored.
1567
1568 """
1569 for line in textfile:
1570 for token in line.split(separator):
1571 yield token.upper()
1572 try:
1573 textfile.close()
1574 except AttributeError, e:
1575 pass
1576
1577 dist_ctr = 0
1578
1579 try:
1580 import psyco
1581 psyco.bind(levenshtein)
1582 psyco.bind(VPTree)
1583 psyco.bind(BKTree)
1584 psyco.bind(FastBKTree)
1585 except ImportError, e:
1586 sys.stderr.write("Psyco not available, proceeding without it.\n")
1587
1588 if __name__ == '__main__':
1589 import codecs, time
1590 dumpfilename = "vptree.dump"
1591 f = codecs.open(sys.argv[1], encoding="iso-8859-1")
1592 names = tokenizer(f)
1593 t = BKTree()
1594 t.construct(names, levenshtein)
1595 f.close()
1596 print type(t)
1597 s = raw_input("press enter")
1598 print "t height: %s" % t.height
1599 print "t size: %s" % len(t)
1600 print "t nodes %s" % t.num_nodes
1601
1602
1603
1604 f = codecs.open(sys.argv[1], encoding="iso-8859-1")
1605 names = list(tokenizer(f))
1606 random.shuffle(names)
1607
1608 start = time.time()
1609 for i in range(0, 10):
1610 t.search(names.pop(), 1)
1611 end = time.time()
1612 print end - start
1613
1614
1615