1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23 """
24
25 Metric Spaces
26 ==============
27
28 .. contents::
29
30 Introduction
31 -------------
32
33 This module provides efficient similarity search by implementing
34 different data structures for indexing metric spaces. You can use it to
35 index an arbitrary number of objects which you later want to search for
36 objects "similar" to a given other object.
37
38 For this module to be useful, you need a function with metric properties
39 (see below for a definition) which you can apply to the type of objects
40 you want to search for. The only pre-defined metric funtion in this
41 module is `levenshtein`, which is a common metric to calculate the
42 (dis)similarity of two strings.
43
44 Beware that the indexing phase might take a long time, depending on the
45 number of objects you want to index and the runtime of your metric
46 distance function. Indexing only pays off when you are planning to do a
47 lot of searches.
48
49 Definitions
50 ............
51
52 A metric space is a pair ``(S, f)`` where ``S`` is an abritrary
53 (possibly infinite) set of objects and ``f`` is a function which returns
54 a "distance" for any two objects ``x``, ``y`` in ``S``. This distance is
55 a number used to express the similarity between the two objects. The
56 larger the distance, the more they differ. A distance of zero means they
57 are considered equal.
58
59 The function ``f`` has to be a "metric", i.e. the following conditions
60 have to be met for all ``x``, ``y``, ``z`` in ``S``:
61
62 - ``f(x, y) >= 0``
63
64 (positivity)
65
66 - ``f(x, y) == f(y, x)``
67
68 (symmetry)
69
70 - ``f(x, z) <= f(x, y) + f(y, z)``
71
72 (triangle inequality)
73
74 This definition adheres to most people's intuitive understanding of the
75 distance between two points in a Euclidian space (of arbitrary
76 dimensionality). [#euclid]_ Imagine you have three points in a
77 2-dimensional space like this::
78
79 A-------B
80 \ /
81 \ /
82 \ /
83 C
84
85 It is evident that the distance between two of these points is
86 necessarily larger than (or equal to) zero, and that the direction you
87 take for measuring (from ``A`` to ``B`` or vice versa) has no influence
88 on the distance. The third condition, the triangle inequality is easy to
89 grasp, too: if you know the distances between ``A`` and ``B`` and
90 between ``B`` and ``C``, you can be sure that their sum is equal to or
91 larger than the distance between ``A`` and ``C``. Voila, there's your
92 metric space.
93
94 In many cases, metric distance functions take a long time to compute.
95 Because of this, data structures used to index metric spaces are
96 designed to minimize the number of distance computations necessary for
97 searching the whole space by using the properties of its metric function
98 (most notably the triangle inequality) to discard as many objects as
99 possible after every computation.
100
101 .. [#euclid] In fact, Euclidian spaces are one type of metric space.
102
103
104 Use cases & caveats
105 ....................
106
107 Metric distance functions can be defined for objects other than points
108 in Euclidian space, too. One very well known distance function is the
109 Levenshtein distance, which is applicable to arbitrary strings. Since it
110 is quite popular and indexing strings is one of the major use cases for
111 metric spaces, an implementation of this function is included in this
112 module. Other metric distance functions for strings are the
113 Damerau-Levenshtein variant and Hamming distance. These can be used for
114 a variety of tasks, such as spell-checking, record linkage (or
115 "deduplication"), error detection etc.
116
117 Please note that indexing a large number of objects usually takes a
118 considerable amount of time *and* memory. Indexing only pays off if you
119 are going to do *a lot* of searches in the index. If you just want to
120 search a set of objects for all objects similar to one or a few other
121 objects, you are better off comparing them sequentially by hand with
122 your distance function.
123
124 Installation
125 .............
126
127 .. _Psyco: http://psyco.sourceforge.net/
128
129 This file is a stand-alone module for Python 2.4 and later which you
130 only have to save somewhere in your ``PYTHONPATH``. [#pythonpath]_ No
131 installation is necessary. The only tight dependencies are the modules
132 ``sys``, ``random`` and ``weakref`` which are most probably already
133 included in your Python installation.
134
135 If you are using this on a 32 Bit i386 machine running Windows or Linux,
136 you probably also want to install Psyco_ which speeds up execution times
137 quite noticeably (in exchange for only a slightly increased memory
138 usage). Psyco will be used automatically if it is available.
139
140 .. [#pythonpath] Run ``python -c "import sys; print sys.path"`` to learn
141 where Python currently looks for modules.
142
143 Basic usage
144 ------------
145
146 Index creation
147 ...............
148
149 Say you have a dictionary file containing one or more words per line.
150 You can make an index of all the words in it by doing:
151
152 .. python::
153
154 >>> import mspace
155 >>> mydict = file('dictionary.txt')
156 >>> words = mspace.tokenizer(mydict)
157 >>> index = mspace.VPTree(words, mspace.levenshtein)
158
159 You can delay the time-comsuming construction phase by creating the
160 index without any parameters and explicitly calling ``construct`` at a
161 later time:
162
163 .. python::
164
165 >>> index = mspace.VPTree()
166 >>> index.construct(words, mspace.levenshtein)
167
168 Please note that the index' content is dismissed on every call to
169 ``construct``. The index is built from scratch with only the objects and
170 distance function passed as parameters.
171
172 Performing range-searches
173 ..........................
174
175 After you have built the index (and got yourself some coffee if the
176 index is large) you may search it for all objects having distance to a
177 given other object between arbitrary bounds:
178
179 .. python::
180
181 >>> index.range_search('WOOD', min_dist=0, max_dist=1)
182 ['GOOD', 'WOOD', 'WOOL', 'MOOD']
183
184 In this case, you received four results: the object you searched for
185 (apparently it has been in your dictionary, too) and two other objects
186 "GOOD" and "WOOL", both of which have the requested maximum distance of
187 one to your query object. Result lists are unordered and do not contain
188 information about their contents' distance to the query object. If you
189 need this, you have to calculate it by hand.
190
191 Both ``min_dist`` and ``max_dist`` are optional parameters defaulting to
192 zero. Thus, if you leave them out, a search for objects which are
193 exactly equal (as defined by your distance function) to the query object
194 is performed. For historical reasons, and since range-searching with a
195 minimum distance of zero and a maximum greater than zero is quite
196 common, there is a shortcut to search with ``min_dist=0``:
197
198 .. python::
199
200 >>> index.search('WOOD', 1)
201 ['GOOD', 'WOOD', 'WOOL', 'MOOD']
202
203
204 Performing nearest-neighbour-searches
205 ......................................
206
207 The second type of search you can perform is "k-nearest-neighbour"
208 search. It returns a given number of objects from the index that are
209 closest to the query object. Search results are guaranteed to never
210 contain an object with a distance to the query object that is larger
211 than that of any other object in the tree.
212
213 Result lists are composed of ``(object, distance)`` tuples, sorted
214 ascendingly by the distance. If you don't specify a maximum number of
215 matches to return, it defaults to one:
216
217 .. python::
218
219 >>> index.nn_search('WOOD')
220 [('WOOD', 0)]
221 >>> index.nn_search('WOOD', 5)
222 [('WOOD', 0), ('WOOL', 1), ('GOOD', 1), ('MOOD', 1), ('FOOT', 2)]
223
224 Note that the index may contain other objects with a distance of two to
225 the query object ``'WOOD'`` (e.g. ``'FOOL'``). They have just been cut
226 off because a maximum of five objects has been requested. You have no
227 influence on the choice of representatives that is made.
228
229 Note that you must not expect this method to always return a list of
230 length ``num`` because you might ask for more objects than are currently
231 in the index.
232
233
234 Advanced usage
235 ---------------
236
237 Indexing complex objects
238 .........................
239
240 If you have "complex" objects which you want to index by only one
241 specific attribute, you can write a simple wrapper around `levenshtein`
242 (or some other function applicable to the attribute) which extracts this
243 attribute from your objects and returns the distance between their
244 attributes' value. This way you can search for and retrieve complex
245 objects from the index while comparing only a single attribute
246
247 .. python::
248
249 >>> import mspace
250 >>>
251 >>> class Record(object):
252 ... def __init__(self, firstname, surname):
253 ... self._firstname, self._surname = firstname, surname
254 ...
255 >>> def firstnameLev(r1, r2):
256 ... return mspace.levenshtein(r1._firstname, r2._firstname)
257 ...
258 >>> index = mspace.VPTree(listOfRecords, firstnameLev)
259 >>> rec = Record("Paul", "Brady")
260 >>> index.search(rec, 2)
261 [<Record: 'Paula Bean'>, <Record: 'Raoul Perky'>, <Record: 'Paul Simon'>]
262
263 Of course you can also use more advanced techniques like writing a
264 function factory which returns a function that extracts arbitrary
265 attributes from your objects and applies a metric function to it:
266
267 .. python::
268
269 >>> def metric_using_attr(metric, attr):
270 ... def f(one, other, attr=attr):
271 ... attr1, attr2 = getattr(one, attr), getattr(other, attr)
272 ... return metric(attr1, attr2)
273 ... return f
274 ...
275 >>> metric = metric_using_attr(mspace.levenshtein, "_firstname")
276 >>> metric( Record("Paul", "Brady"), Record("Paul", "Simon") )
277 0
278
279 (Note that the inner function ``f`` in this example deviates from the
280 standard protocol by accepting an optional third parameter. This is
281 done here only to pull the name ``attr`` into the inner function's
282 namespace to save some time when looking up its associated value. No
283 index structure in this module will ever call its distance function with
284 more than two parameters anyway, so that whatever you passed as ``attr``
285 to ``metric_using_attr`` when creating your function will be used when
286 this function gets called by an index. Got it?)
287
288 Reverse indexes
289 ................
290
291 Of course you can use a similar technique to avoid full indexes and
292 instead create an index which only contains references to your data (in
293 a database, text file, somewhere on the net etc.). Your distance
294 function would then use the supplied values to fetch the objects to
295 compare from your data source. But, I cannot stress this enough, your
296 distance function's performance is absolutely crucial to efficient
297 indexing and searching. Therefore you should make sure that your way of
298 accessing the data is really fast.
299
300 Choice of data structure
301 -------------------------
302
303 This module currently gives you two data structures to choose from.
304 While the underlying algorithms are different, their search results
305 (given the same set of objects to index, distance function and maximum
306 distance) are exactly the same. If you find a scenario where this is not
307 the case, please let me know because this would be a serious bug.
308
309 Capabilities
310 .............
311
312 The algorithms implemented in this module have different capabilities
313 which are displayed in the table below:
314
315 +--------------------+-------+-------+
316 | |VPTree |BKTree |
317 +====================+=======+=======+
318 |non-discrete metric | X | |
319 +--------------------+-------+-------+
320 |insertion | | X |
321 +--------------------+-------+-------+
322 |deletion | | |
323 +--------------------+-------+-------+
324
325 The first row is about the value returned by the distance function the
326 index uses. BKTrees are not prepared to handle non-discrete distances
327 (for example floats) and will most probably perform really bad when they
328 occur.
329
330 The other two rows describe what you can do with the indexes *after*
331 they have been constructed. Please note that neither of them may shrink
332 over time. Only BKTrees are able to handle insertion efficiently.
333
334 Performance
335 ............
336
337 [In short: you are probably better off using BKTrees unless you need a
338 non-discrete metric. When in doubt, test both options.]
339
340 Obviously, the whole point of this module is to speed up the process of
341 finding objects similar to one another. And, as you can imagine, the
342 different algorithms perform differently when exposed to a specific
343 problem.
344
345 Here's an example: I took the file ``american-english-large`` from the
346 Debian package ``wamerican-large`` and did some time measurements of
347 indexing and searching a random subset of it. The machine I used is an
348 1.1GHz Pentium3 running Python 2.4 from Debian stable **with Psyco
349 enabled**. Please note that the following numbers have been determined
350 completely unscientifical. I only show them to give you an idea of what
351 to expect. For serious benchmarking, I should have used the ``timeit``
352 module. Nevertheless, this is it:
353
354
355 +------+------------------------+------------------------+
356 | |BKTree |VPTree |
357 +======+======+=========+=======+=======+========+=======+
358 |size |time |per node |height |time |per node|height |
359 +------+------+---------+-------+-------+--------+-------+
360 | 5000 | 2.92|0.000584 |11 | 7.40|0.001480| 21|
361 +------+------+---------+-------+-------+--------+-------+
362 |10000 | 6.32|0.000632 |14 | 16.02|0.001602| 22|
363 +------+------+---------+-------+-------+--------+-------+
364 |15000 | 9.95|0.000663 |14 | 28.35|0.001890| 24|
365 +------+------+---------+-------+-------+--------+-------+
366 |20000 | 13.70|0.000685 |14 | 41.40|0.002070| 24|
367 +------+------+---------+-------+-------+--------+-------+
368 |25000 | 17.46|0.000699 |15 | 50.63|0.002025| 25|
369 +------+------+---------+-------+-------+--------+-------+
370 |30000 | 21.81|0.000727 |15 | 55.47|0.001849| 25|
371 +------+------+---------+-------+-------+--------+-------+
372 |35000 | 25.77|0.000736 |16 | 64.43|0.001841| 26|
373 +------+------+---------+-------+-------+--------+-------+
374 |40000 | 29.40|0.000735 |16 | 75.45|0.001886| 26|
375 +------+------+---------+-------+-------+--------+-------+
376 |45000 | 41.28|0.000917 |16 | 96.36|0.002141| 26|
377 +------+------+---------+-------+-------+--------+-------+
378 |50000 | 37.62|0.000752 |16 | 95.31|0.001906| 28|
379 +------+------+---------+-------+-------+--------+-------+
380
381
382 This table shows construction times (total and per node in seconds) for
383 data sets of a given size. Additionally, you can see the height
384 [#height]_ of the trees. Apparently, BKTrees can be constructed a lot
385 faster than VPTrees. Both of them need an increasing time per node with
386 growing data sets. This is expected since construction complexity is
387 ``O(n log(n))`` in both cases.
388
389 +---------------+-----------------+-----------------+
390 | |BK, size: 50,000 |VP, size: 50,000 |
391 +------+--------+-------+---------+-------+---------+
392 |k |results | time | # dist | time | # dist |
393 +------+--------+-------+---------+-------+---------+
394 |0 | 0.89 |0.0008 | 8.14 | 0.0019| 17.28|
395 +------+--------+-------+---------+-------+---------+
396 |1 | 4.07 |0.1670 | 1583.77 | 0.2801| 1933.64|
397 +------+--------+-------+---------+-------+---------+
398 |2 | 38.10 |1.1687 |10353.31 | 1.4413| 13845.67|
399 +------+--------+-------+---------+-------+---------+
400 |3 | 304.57 |2.5614 |22202.28 | 3.0497| 27514.51|
401 +------+--------+-------+---------+-------+---------+
402 |4 |1584.86 |3.8727 |32376.54 | 4.1518| 36877.62|
403 +------+--------+-------+---------+-------+---------+
404 |5 |5317.03 |4.4182 |39616.04 | 4.9935| 42720.38|
405 +------+--------+-------+---------+-------+---------+
406
407 This table shows the runtime of searches done in the trees (in seconds),
408 the number of distance calculations done and the number of results for
409 growing error tolerance. All values are given as average over 100 random
410 searches (from the same file, but not necessarily from the set of
411 indexed objects). As you can see, search runtime (tightly connected to
412 the number of distance calculations) literally explodes for larger
413 values of ``k``. This growth only fades when an increased part of the
414 complete search space is examined (the number of distance calculations
415 equals the number of nodes compared with the query object).
416
417 As you can see, too, long search runtimes for large values of ``k``
418 don't actually hurt usability very much since you only get a usable
419 number of results for small ``k`` anyway. This is of course due to the
420 nature of the data set and the distance function used in this example.
421 Your application may vary greatly.
422
423 You also have to keep in mind that the dictionary I used contains almost
424 no duplicates. If you used the words of a real document as your data
425 set, your tree would have significantly less nodes than the number of
426 words in your document since you typically repeat a lot of words very
427 often. A quick test with my diploma thesis revealed only 4400 distinct
428 words in a document with 14,000 words (including LaTeX commands). This
429 makes searching much faster because equal objects (as defined by the
430 distance function) are only evaluated once.
431
432 .. [#height] If you look closely, you can see that the VPTree's height
433 doesn't grow continuously when the data set grows. The reason for
434 that phenomenon is that this implementation does not try very hard to
435 pick a good vantage point which is the key factor to get
436 well-balanced trees. So, in some cases, a vantage point may be chosen
437 that may result in trees which are higher than strictly necessary.
438
439 Optimization potential (or: Why the f*ck is this so slow!?)
440 ------------------------------------------------------------
441
442 Ok, you have tried to index your small dictionary file and start to
443 wonder why this easy task takes several minutes. Here are a couple of
444 (possible) reasons:
445
446 - Indexing takes ``O(n log(n))`` for all data structures currently
447 implemented. So yes, doubling the number of indexed objects will
448 necessarily more than double your runtime for indexing, sorry.
449
450 - Python is slow. I have written an implementation very similar to this
451 one in Java which is significantly faster (by a factor of about 15 to
452 25, even when using Psyco_!). But the java implementation eats more
453 memory and unfortunately I cannot release it under a free license.
454
455 - Recursion is slow in Python. Recursion is the most natural way to
456 create and search trees and this code uses it a lot. Most of the
457 recursion in this module is "tail recursion", but the Python
458 interpreter is not able to optimize it into loops.
459
460 [Note: as of revision 60, searching in both trees and inserting into
461 BKTrees has been rewritten using loops instead of recursion.
462 Perfomance gain is quite small, though.]
463
464 - The distance distribution in your metric space is too dense. This
465 leads to the data structures being unable to discard large parts of
466 the indexed space while searching. In pathological cases, you may end
467 up with data structures resembling linked lists.
468
469 - Your distance function is non-discrete, i.e. it returns floats. How
470 well this is supported depends on the index structure in use.
471
472 - Your metric function is very expensive. Remember that this function
473 has to be called *very often* during indexing. You may use this
474 module's attribute ``dist_ctr`` to get an idea how often this is done.
475 It is incremented by one on every call to your distance function and
476 is otherwise unused.
477
478 - You are searching for non-similar objects. If you, for example, have
479 an index of strings with an average length of six characters and you
480 are continuously searching for strings with a maximum distance of
481 three or more, do not expect the search to be significantly faster
482 than linear testing of the complete search space. It may even be
483 slower.
484
485 Possible solutions include:
486
487 - Rewrite in C. Since I cannot be bothered to learn C, someone else
488 would have to do this.
489
490 - Use Pyrex or Psyco_. Quick tests with Psyco suggest that indexing
491 becomes about 3-4 times faster. This is as easy as doing an ``import
492 psyco; psyco.full()`` in your program. Read Psyco's manual for tuning
493 options.
494
495 [Note: as of about revision 46, on module loading an attempt is made
496 to import and enable Psyco for levenshtein and the tree classes. If it
497 doesn't succeed, you'll get a performance warning on ``stderr`` but
498 everything will continue to work flawlessly.]
499
500 - Pickle trees for static data sets. Pickling and unpickling indexes
501 using Python's standard ``pickle`` module is quite fast. But beware
502 that a pickled index takes a lot more space than your plain data.
503
504 - Rewrite tree construction and searching using loops. I am not sure
505 whether this will be faster, but it will probably take less memory
506 than the current approach and fix the problem with Python's recursion
507 limit. I fear that code readibility will suffer, though. The current
508 implementations are quite close to the original descriptions of the
509 algorithms.
510
511 [Note: partially done, see above.]
512
513 - Optimize the distance function in use. Since I do not know your
514 distance function, I cannot help you with this.
515
516 As for `levenshtein`: the algorithm implemented in this module is the
517 classical dynamic programming variant with ``O(|n|*|m|)`` time
518 complexity (and ``O(min(|n|, |m|))`` for space). Other algorithms for
519 Levenshtein try not to compute the whole matrix but only the values
520 around the diagonal of the matrix that are strictly necessary. By
521 doing this, the average ``O(|n|*|m|)`` of the classical dynamic
522 programming algorithm could be improved to something like
523 ``O(k*|n|)``. While it isn't obvious how much wall clock time this
524 approarch saves in reality, it may be in fact a whole lot faster
525 because ``k`` is usually only a fraction of the length of the input
526 strings.
527
528 And if you take a close look at the algorithms, you may realize that
529 searching may be sped up by using an algorithm that doesn't actually
530 compute the distance between two strings, but one that only answers
531 whether two strings have a distance less than or equal to a specified
532 maximum distance. This can be done in ``O(k^2)`` and should boost
533 search performance quite noticeably.
534
535 .. _Citeseer: http://citeseer.ist.psu.edu/navarro99guided.html
536
537
538 If you are interested in any of these optimizations, I suggest reading
539 Gonzalo Navarro's excellent survey "Guided Tour To Approximate String
540 Matching" which you can get from Citeseer_.
541
542 Currently, this module always uses the same function for searching and
543 indexing. If you want to test your own implementations, you have to
544 replace your index object's ``_func`` attribute after indexing and
545 before searching.
546
547
548 Contact
549 --------
550
551 For comments, patches, flames and hugs send mail to:
552 jrschulz@well-adjusted.de.
553
554 """
555
556 __docformat__ = 'restructuredtext en'
557 __author__ = '$Author: jrschulz $'
558 __version__ = '$Revision: 117 $'
559 __date__ = '$Date: 2008-07-23 23:01:46 +0200 (Wed, 23 Jul 2008) $'
560 __license__ = "GPL"
561
562 import sys, random, weakref
565 """
566
567 Base class for metric space indexes which have tree-like properties.
568 It's implementation is not complete enough to make it actually
569 usable, but it should make it quite easy to implement other indexing
570 and searching strategies. In an ideal case, all you have to do is to
571 derive from this class, implement the methods `construct`,
572 `_get_child_candidates` and optionally `insert` and make `children`
573 an attribute or property returning all child nodes of the current
574 node. If you need an ``__init__`` method, you should call
575 ``super(YourClass, self).__init__(objects, func, parent)`` in its
576 end as well.
577
578 Instances of this class (and its subclasses, of course) are supposed
579 to be recursive data structures. This means that they may be treated
580 like the root of a tree whether they have a parent or not. This
581 means every node has:
582
583 - a list of `_values` with every element of that list having a
584 distance of zero to all other elements in the list. Because of
585 triangle inequality, you only need to check this condition against
586 one element of this list.
587
588 - a `_parent` (`None` for root nodes)
589
590 - a list of `children` (empty list for leaf nodes)
591
592 - a `_height` (one for trees consisting of only one node)
593
594 - a number of nodes `_num_nodes` (the total number of nodes in this
595 subtree)
596
597 - a `_size` (the total number of objects stored in this subtree)
598
599 Note that in many (most? some?) cases, implementors don't have to
600 fiddle with these values directly. Instead, there are some helper
601 methods like `_incr_node_counter` and `_incr_size` that may be
602 called at appropriate times. If in doubt, just implement what you
603 strictly need first and then take a look whether node, size and
604 height counting works as expected (and whether you care).
605
606 As a bonus, this class implements some of python's magic methods so
607 that you can iterate over its instances' content and use the ``in``
608 operator to test for the existence of an object using the distance
609 function.
610
611 """
612
613 _parent = None
614 """
615
616 The parent of this node. ``None`` for root nodes.
617
618 """
619
620 _values = list()
621 """
622
623 A list of values in this subtree. All objects in this list are
624 considered equal by the distance function in use.
625
626 """
627
628 _size = 0
629 """
630
631 The number of objects in this subtree.
632
633 """
634
635 _func = 0
636 """
637
638 The distance function used for indexing and searching. It has to
639 accept any two objects from the list of objects you are indexing as
640 arguments and return a non-negative integer (or long).
641
642 """
643
644 _height = 0
645 """
646 """
647
648 _num_nodes = 0
649 """
650 """
651
652 - def __init__(self, objects=None, func=None, parent=None):
653 """
654
655 Create a new metric tree. If ``objects`` and ``func`` are given,
656 the given objects will be indexed immediately using the distance
657 function which makes it possible to immediately start so search
658 for other objects in it. Otherwise, you have to call `construct`
659 later in order to make use of this metric tree.
660
661 """
662 self._values = list()
663 self._size = 0
664 self._height = 1
665 self._func = func
666 self.parent = parent
667 self._num_nodes = 0
668 self._incr_node_counter()
669 self._calculate_height()
670 if objects and func:
671 self.construct(objects, func)
672
674 """
675
676 Return a list of all objects in this subtree whose distance to
677 `obj` is at least `min_dist` and at most `max_dist`. `min_dist`
678 and `max_dist` have to satisfy the condition ``0 <= min_dist <=
679 max_dist``.
680
681 If this metric tree is empty, an empty list is returned,
682 regardless of whether this tree has been previously constructed
683 or not.
684
685 If calling the distance function fails for any reason,
686 `UnindexableObjectError` will be raised.
687
688 """
689 assert( 0 <= min_dist <= max_dist )
690 if not self: return list()
691 result, candidates = list(), [self]
692 while candidates:
693 node = candidates.pop()
694 distance = node._get_dist(obj)
695 if min_dist <= distance <= max_dist:
696 result.extend(node._values)
697 candidates.extend( node._get_child_candidates(
698 distance, min_dist, max_dist))
699 return result
700
701 - def search(self, obj, max_dist):
702 """
703
704 Equivalent to range_search(obj, min_dist=0, max_dist).
705
706 """
707 return self.range_search(obj, max_dist=max_dist)
708
710 """
711
712 Perform a k-nearest-neighbour search and return a sorted list of
713 at most ``num`` objects together with their distance to the
714 query object (``(object, distance)`` tuples). Sorting is done by
715 the distance (ascending). Of course, `num` has to be an ``int``
716 (or ``long``, for that matter) larger than zero.
717
718 For obvious reasons, this method cannot return more objects than
719 are currently present in the tree. That means, if ``num >
720 len(self)``, only ``len(self)`` pairs of ```(object, distance)``
721 will be returned.
722
723 If this metric tree is empty, an empty list is returned,
724 regardless of whether this tree has been previously constructed
725 or not.
726
727 If several objects in this tree share the same maximum distance
728 to `obj`, no guarantee is given about which of these objects
729 will be returned and which are left out.
730
731 If calling the distance function fails for any reason,
732 `UnindexableObjectError` will be raised.
733
734 """
735 if num < 1: return list()
736 if not self: return list()
737 neighbours = list()
738 candidates = [self]
739 def cmp_neighbours(a, b): return cmp(a[1], b[1])
740 while candidates:
741 cand = candidates.pop()
742 distance = cand._get_dist(obj)
743 candidate_is_neighbour = bool([1 for n in neighbours
744 if distance < n[1]])
745 if candidate_is_neighbour or len(neighbours) < num:
746 neighbours.extend([(v, distance) for v in cand._values])
747 neighbours.sort(cmp_neighbours)
748 if len(neighbours) > num:
749 neighbours = neighbours[:num]
750 elif len(neighbours) == num:
751 max_dist = neighbours[-1][1]
752 if max_dist == 0:
753 break
754 candidates.extend(
755 node._get_child_candidates( distance, 0, max_dist))
756 else:
757 candidates.extend(node.children)
758 return neighbours
759
761 """
762
763 Return a sequence of child nodes that may contain objects with a
764 distance difference between (inclusive) ``min`` and ``max`` to a
765 certain query object. Note that the query object is *not*
766 passed to this method. Instead, ``distance`` is the query
767 object's previously calculated distance to this node.
768
769 """
770 raise NotImplementedError()
771
773 """
774
775 (Re)Index this space with the given ``objects`` using the
776 distance function ``func``. Previous contents will be
777 discarded. ``objects`` has to be a sequence or an iterable. The
778 distance function ``func`` needs to be applicable to all objects
779 contained in ``objects``.
780
781 If calling the distance function fails for any reason,
782 `UnindexableObjectError` will be raised.
783
784 """
785 raise NotImplementedError()
786
788 """
789
790 Insert a single object into the metric tree. Returns ``self``,
791 i.e. the tree itself.
792
793 This method may not be implemented by all subclasses of
794 `MetricTree` since not all data structures allow to do this
795 efficiently. `NotImplementedError` will be raised when this is
796 the case.
797
798 If calling the distance function fails for any reason,
799 `UnindexableObjectError` will be raised.
800
801 """
802 raise NotImplementedError()
803
805 """
806
807 Answer whether this node is the root of a tree (i.e. it has no
808 parent).
809
810 """
811 return self.parent is None
812
814 """
815
816 Answer whether this node is a leaf node (i.e. it has no
817 children)
818
819 """
820 return not self.children
821
823 """
824
825 A sequence of this node's children.
826
827 The possible number of children per node is
828 implementation-dependent. Leaf nodes return an empty sequence.
829
830 """
831 raise NotImplementedError()
832 children = property(__children)
833
835 """
836
837 The number of nodes in this tree.
838
839 This may be different from the number of objects contained in
840 this tree in cases where two or more of these objects are
841 considered equal by the distance function in use (i.e., for two
842 objects ``p`` and ``q``, calling ``self._func(p, q)`` returns
843 ``0`` and when the tree is empty, i.e. there is one node (the
844 root node) but it doesn't contain any values.
845
846 """
847 return self._num_nodes
848 num_nodes = property(__num_nodes)
849
850
852 """
853
854 The height of this tree.
855
856 Empty trees have a height of ``0``, trees containing one or more
857 objects have a height ``>= 1``.
858
859 """
860 return self._height
861 height = property(__height)
862
864 """
865
866 The parent of this node. `None` if this node is the root
867 of a tree.
868
869 """
870 return self._parent
871
873 """
874
875 Set the parent of this node.
876
877 Parent references are stored as "weak references" to avoid
878 circular references which the garbage collector cannot dissolve
879 by itself.
880
881 """
882 if parent is None:
883 self._parent = None
884 else:
885
886
887
888 callback = lambda proxy: self.__set_parent(None)
889 self._parent = weakref.proxy(parent, callback)
890
891 parent = property(__parent, __set_parent)
892
894 """
895
896 Increment the size counter for this node and all its parents
897 recursively.
898
899 Should be called whenever an object is inserted into the tree.
900
901 """
902 def f(node, incr=incr): node._size += incr
903 self._apply_upwards(f)
904
906 """
907
908 Set this node's height to one and (if `recursive` is `True`)
909 propagate this change upwards in the tree.
910
911 """
912 self._height = height = 1
913 if recursive:
914 node = self.parent
915 while node is not None:
916 height += 1
917 if node._height < height:
918 node._height = height
919 node = node.parent
920 else:
921 node = None
922
924 """
925
926 Increment the node counter for this node and all its parents
927 recursively.
928
929 Should be called whenever a new child of this node is created.
930
931 """
932 def f(node, incr=incr): node._num_nodes += incr
933 self._apply_upwards(f)
934
936 """
937
938 Helper method to apply a function to this node and all its
939 parents recursively. The given function must accept one node as
940 the first parameter and may accept arbitrary keyword parameters
941 as well.
942
943 """
944 node = self
945 func(node, **args)
946 while not node.is_root():
947 node = node.parent
948 func(node, **args)
949
951 """
952
953 Apply this node's distance function to the given object and one
954 of this node's values.
955
956 Raises `UnindexableObjectError` when distance computation fails.
957
958 """
959 global dist_ctr
960 try:
961 distance = self._func(self._values[0], obj)
962 dist_ctr += 1
963 except IndexError, e:
964 sys.stderr.write("Node is empty, cannot calculate distance!\n")
965 raise e
966 except Exception, e:
967 raise UnindexableObjectError(e, "Cannot calculate distance"
968 + " between objects %s and %s using %s" \
969 % (self._values[0], obj, self._func))
970 return distance
971
973 """
974
975 A generator that yields all objects in this node and its
976 children by doing recursive pre-order traversal. Implementors
977 might choose to use another traversal method which better suits
978 their data structure.
979
980 Note that objects are returned in no specific order.
981
982 This implementation will always return objects in the same order
983 as long as the tree's content does not change and the
984 implementation of `children` always returns the children in the
985 same order. If `children` is not implemented at all,
986 `NotImplementedError` will be raised.
987
988 """
989 for obj in self._values:
990 yield obj
991 for child in self.children:
992 for obj in child:
993 yield obj
994
995 itervalues = __iter__
996
998 """
999
1000 A generator that yields all nodes in this subtree by doing
1001 recursive pre-order traversal. Implementors might choose to use
1002 another traversal method which better suits their data
1003 structure.
1004
1005 Note that objects are returned unordered.
1006
1007 This implementation will always return nodes in the same order
1008 as long as the tree's content does not change and the
1009 implementation of `children` always returns the children in the
1010 same order. If `children` is not implemented at all,
1011 `NotImplementedError` will be raised.
1012
1013 """
1014 yield self
1015 for child in self.children:
1016 for node in child.iternodes():
1017 yield node
1018
1020 """
1021
1022 Return True if this node contains any objects.
1023
1024 """
1025 return len(self._values) > 0
1026
1028 """
1029
1030 Return the number of objects in this subtree
1031
1032 """
1033 return self._size
1034
1036 """
1037
1038 Search for objects with a distance of zero to `item` and return
1039 True if something is found, otherwise False.
1040
1041 Note that this does **not** check for object identity! Instead,
1042 the definition of equality is delegated to the distance function
1043 in use.
1044
1045 """
1046 return len(self.range_search(item)) > 0
1047
1049 if self:
1050 return str(self.__class__) + ": " + str(self._values[0])
1051 else:
1052 return "<empty node>"
1053
1054
1055 -class BKTree(MetricTree):
1056 """
1057
1058 "Burkhard-Keller Trees" are unbalanced multiway trees and may grow
1059 over time. They belong to the first general solutions to index
1060 arbitrary metric spaces.
1061
1062 Every node in a BKTree stores a list of objects which all have a
1063 distance of zero to each other, i.e. all objects are considered to
1064 be equal by the distance function in use. Additionally, every node
1065 keeps a dictionary. In this dictionary, every child is stored,
1066 referenceable by its distance to its parent.
1067
1068 Essentially, every node in a BKTree divides its data set ``S`` into
1069 subsets ``S^i`` so that every element in ``S^i`` has the distance
1070 ``i`` to one object arbitrarily picked from ``S`` (and stored in
1071 this node). For each ``S^i``, a new child node is created and its
1072 parent keeps a reference to this child together with its distance.
1073
1074 Insertion of a single object ``o`` in a node ``n`` is quite easy:
1075
1076 1. If ``n`` is empty, store ``o`` in ``n``. That's it.
1077
1078 2. Otherwise, calculate its distance to ``o``.
1079
1080 3. If the distance is zero, append this object to the list of
1081 objects in this node and return.
1082
1083 4. Otherwise, look for a child of ``n`` with the calculated distance
1084 to ``n``. If there is such a child, go back to 1. with this child
1085 being the new ``n``.
1086
1087 5. Otherwise, create a new node containing ``o`` and store it as a
1088 child of ``n`` with it's calculated distance.
1089
1090 Searching is done recursively by first calculating the distance of
1091 the query object to the current node and then using the triangle
1092 inequality to determine which children may contain other search
1093 results.
1094
1095 Runtime complexity for the construction phase is ``O(n log(n))``,
1096 searching is ``O(n^a)`` where ``0 <= a <= 1``. Space requirement is
1097 ``O(n)``.
1098
1099 This implementation is close to the original description of the
1100 algorithm and can only handle discrete distances. If your distance
1101 function returns floating point values, it will appear to work at
1102 indexing time but will most probably be horribly slow when
1103 searching.
1104
1105 """
1106
1107 __slots__ = [ '_parent', '_values', '_size', '_func', '_height',
1108 '_num_nodes', '_children' ]
1109
1110 - def __init__(self, objects=None, func=None, parent=None):
1111 self._children = {}
1112 if callable(objects):
1113 random.shuffle(list(objects()))
1114 super(BKTree, self).__init__(objects, func, parent)
1115
1117 self._func = func
1118 self._children = {}
1119 for o in objects:
1120 self.insert(o)
1121 return self
1122
1124 if not self:
1125 self._values.append(obj)
1126 self._incr_size()
1127 return self
1128 node = self
1129 while True:
1130 distance = node._get_dist(obj)
1131 if distance == 0:
1132 node._values.append(obj)
1133 node._incr_size()
1134 break
1135 child = node._children.get(distance, None)
1136 if child is None:
1137 child = BKTree([obj], node._func, node)
1138 node._children[distance] = child
1139 break
1140 node = child
1141 return self
1142
1144 assert( min_dist <= max_dist )
1145 return (child for dist, child in self._children.iteritems()
1146 if distance - max_dist <= dist <= distance + max_dist)
1147
1149 children = self._children
1150 sorted_dists = sorted(children.keys())
1151 return [children[dist] for dist in sorted_dists]
1152 children = property(__children)
1153
1155 """
1156
1157 Simpler version of BKTree. The tree's content is kept in nested
1158 dictionaries where a "node"'s value is simply stored at key -1 and
1159 its children are stored using their distance to the value as the
1160 key. Example:
1161
1162 .. python::
1163
1164 self.root = { -1: 'ABC',
1165 0: { -1: 'ABC' }
1166 1: { -1: 'ABCD',
1167 1: { -1: 'ABCE' }
1168 }
1169 2: { -1: 'A',
1170 1: { -1: 'B',
1171 1: { -1: 'C' }
1172 }
1173 }
1174 }
1175
1176 """
1177
1178
1180 self.root = None
1181 self.metric = metric
1182 self.size = 0
1183 self.height = 0
1184 self.num_nodes = 0
1185 for o in objects:
1186 self.insert(o)
1187
1189 global dist_ctr
1190
1191
1192 self.size += 1
1193 if self.root is None:
1194 self.root = { -1: obj }
1195 return
1196 node = self.root
1197 while True:
1198 distance = self.metric(node[-1], obj)
1199 dist_ctr += 1
1200 if distance in node:
1201 node = node[distance]
1202 else:
1203 node[distance] = { -1: obj }
1204 break
1205
1207 global dist_ctr
1208 candidates = [self.root]
1209 result = []
1210 while candidates:
1211 node = candidates.pop()
1212 distance = self.metric(node[-1], obj)
1213 dist_ctr += 1
1214 if min_dist <= distance <= max_dist:
1215 result.append(node[-1])
1216 candidates.extend( child for dist, child in node.iteritems()
1217 if distance - max_dist <= dist <= distance + max_dist
1218 and dist != -1)
1219 return result
1220
1221 - def search(self, obj, max_dist):
1222 return self.range_search(obj, min_dist=0, max_dist=max_dist)
1223
1225 if not self: return list()
1226 global dist_ctr
1227 neighbours = list()
1228 candidates = [self.root]
1229 def cmp_neighbours(a, b): return cmp(a[1], b[1])
1230 while candidates:
1231 cand = candidates.pop()
1232 distance =self.metric(node[-1], obj)
1233 dist_ctr += 1
1234 candidate_is_neighbour = any(1 for n in neighbours
1235 if distance < n[1])
1236 if candidate_is_neighbour or len(neighbours) < num:
1237 neighbours.append(cand[-1])
1238 neighbours.sort(cmp_neighbours)
1239 if len(neighbours) > num:
1240 neighbours = neighbours[:num]
1241 elif len(neighbours) == num:
1242 max_dist = max([n[1] for n in neighbours])
1243 candidates.extend(
1244 child for dist, child in node.iteritems()
1245 if distance - max_dist <= dist <= distance + max_dist
1246 and dist != -1)
1247 else:
1248 candidates.extend(child for child, dist in
1249 node.iteritems() if child != 1)
1250 return neighbours
1251
1252
1255
1256
1257 -class VPTree(MetricTree):
1258
1259 """
1260
1261 .. _Jeffrey K. Uhlmann: http://www.pnylab.com/pny/papers/vptree/vptree/
1262
1263 .. _Peter Yianilos: http://www.intermemory.net/pny/papers/vptree/vptree.ps
1264
1265 A "Vantage Point Tree" is a static, balanced, binary tree which can
1266 be used to index metric spaces with a non-discrete distance
1267 function. It has been independently described by both `Jeffrey K.
1268 Uhlmann`_ (1991) and `Peter Yianilos`_ (1993).
1269
1270 Construction is done by picking one object (the "vantage point")
1271 from the set of objects to index and then determining the distance
1272 of all remaining objects to the vantage point. The median distance
1273 is calculated and the set of remaining objects is split in two:
1274 objects whose distance to the vantage point is smaller than the
1275 median distance and the rest (objects whose distance to the vantage
1276 point is larger than or equal to the median). Yianilos called this
1277 process "ball decomposition". For both of the resulting sets new
1278 child nodes, called "left" and "right", are created recursively. The
1279 vantage point and the previously calculated median distance are
1280 stored in the current node.
1281
1282 When searching a node, the distance between the current node's value
1283 and the search object is calculated. If it is ``<= k`` (the given
1284 maximum distance), this node's value is added to the list of search
1285 results. Searching proceeds recursively into the left subtree if
1286 ``distance - k < median`` and into the right subtree if ``distance +
1287 k >= median``.
1288
1289 Since the construction phase takes advantage of the median of all
1290 distances, VPTrees do a fairly good job of balancing their subtrees.
1291 Only the fact that you have to put all objects whose distance to the
1292 vantage point equals the median distance always have to be put on
1293 the same side makes VPTrees tend to hang to the side containing
1294 these objects when using discrete distance functions.
1295
1296 Runtime complexity for the construction phase is ``O(n log(n))``,
1297 searching is ``O(log(n))``. Space requirement is ``O(n)``.
1298
1299 """
1300
1301 __slots__ = [ '_parent', '_values', '_size', '_func', '_height',
1302 '_num_nodes', '_median', '_leftchild', '_rightchild' ]
1303
1304 - def __init__(self, objects=None, func=None, parent=None):
1305 self._median = None
1306 self._leftchild = None
1307 self._rightchild = None
1308 super(VPTree, self).__init__(objects, func, parent)
1309
1311 self._func = func
1312 if objects:
1313 if self.is_root():
1314
1315
1316
1317 objects = list(objects)
1318
1319 self._values = [self._pick_VP(objects)]
1320 left, right = self._decompose(objects)
1321 del objects
1322
1323 self._incr_size(len(self._values))
1324 if left:
1325 self._leftchild = VPTree(left, func, self)
1326 if right:
1327 self._rightchild = VPTree(right, func, self)
1328 return self
1329
1331
1332
1333 if len(objects) > 15:
1334 sample = objects[:5]
1335 max_diff = -1
1336 vp = None
1337 for o in sample:
1338 dists = [ self._func(other, o) for other in sample
1339 if other is not o]
1340 diff = max(dists) - min(dists)
1341 if diff > max_diff:
1342 max_diff, vp = diff, o
1343 objects.remove(vp)
1344 return vp
1345 else:
1346 return objects.pop()
1347
1349 """
1350
1351 Perform the process called "ball decomposition" by Peter
1352 Yianilos.
1353
1354 `objects` has to be an iterable that yields objects applicable
1355 to the metric function in use. The return value is a tuple of
1356 two lists: one list that contains all elements having a distance
1357 smaller than the median distance to this node's value, the
1358 second list contains all objects whose distance is equal to or
1359 larger than the median distance of all given `objects` to this
1360 node's value.
1361
1362 """
1363 dist_per_obj = list()
1364 for obj in objects:
1365 distance = self._get_dist(obj)
1366 if distance == 0:
1367 self._values.append(obj)
1368 else:
1369 dist_per_obj.append( (distance, obj) )
1370 if dist_per_obj:
1371 self._median = VPTree.determine_median(zip(*dist_per_obj)[0])
1372 left = [ obj for dist, obj in dist_per_obj
1373 if dist < self._median ]
1374 right = [ obj for dist, obj in dist_per_obj
1375 if dist >= self._median ]
1376 else:
1377 left, right = None, None
1378 return left, right
1379
1380 @staticmethod
1392
1394 if self._leftchild and distance - max_dist < self._median:
1395 yield self._leftchild
1396 if self._rightchild and distance + max_dist >= self._median:
1397 yield self._rightchild
1398
1400 return [child for child in (self._leftchild, self._rightchild)
1401 if child]
1402 children = property(__children)
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473 -class UnindexableObjectError(Exception):
1474 """
1475
1476 This Exception is thrown when the call to the distance function in
1477 use by a metric space throws an exception. This should not happen
1478 during regular usage and is a hint that either your distance
1479 function is buggy or you tried to index objects which your distance
1480 function cannot handle.
1481
1482 """
1483
1487
1488 orig_exception = None
1489 """ The exception that has triggered this exception. """
1490
1491 pass
1492
1495 """
1496
1497 Compute Levenshtein (or "edit") distance between two strings.
1498 Levenshtein distance between two strings ``x`` and ``y`` is defined
1499 as the minimal number of operations on single characters that it
1500 takes to transform ``x`` into ``y`` (or vice versa).
1501
1502 Allowed operations are:
1503
1504 - Insertion,
1505 - Deletion and
1506 - Replacement
1507
1508 of a single character.
1509
1510 Levenshtein distance has all the properties of a strictly positive
1511 metric. That means for all x, y, z it holds:
1512
1513 - x == y <=> levenshtein(x, y) == 0
1514 - x != y <=> levenshtein(x, y) > 0 (positivity)
1515 - levenshtein(x, y) == levenshtein(y, x) (symmetry)
1516 - levenshtein(x, z) <= levenshtein(x, y) + levenshtein(y, z)
1517
1518 The upper bound for Levenshtein distance is the length of the longer
1519 string: ``max(len(x), len(y))``. A general lower bound is
1520 ``abs(len(x) - len(y))``. This is the case where one string is the
1521 pre-/postfix of the other one.
1522
1523 Incidentally, this implementation not only works for strings, but
1524 for all types of sequences. Objects in the given sequences are
1525 compared for equality using '==' to determine whether an edit
1526 operation is needed.
1527
1528 """
1529
1530 if s1 == s2: return 0
1531 if len(s1) > len(s2):
1532 s1, s2 = s2, s1
1533 r1 = range(len(s2) + 1)
1534 r2 = [0] * len(r1)
1535 i = 0
1536 for c1 in s1:
1537 r2[0] = i + 1
1538 j = 0
1539 for c2 in s2:
1540 if c1 == c2:
1541 r2[j+1] = r1[j]
1542 else:
1543 a1 = r2[j]
1544 a2 = r1[j]
1545 a3 = r1[j+1]
1546 if a1 > a2:
1547 if a2 > a3:
1548 r2[j+1] = 1 + a3
1549 else:
1550 r2[j+1] = 1 + a2
1551 else:
1552 if a1 > a3:
1553 r2[j+1] = 1 + a3
1554 else:
1555 r2[j+1] = 1 + a1
1556 j += 1
1557 aux = r1; r1 = r2; r2 = aux
1558 i += 1
1559 return r1[-1]
1560
1562 """
1563
1564 Simple generator that tokenizes sequences of strings (or file-like
1565 objects) into uppercase words using the optional `separator`. This
1566 `separator` is passed directly to str.split so ``split``'s behaviour
1567 concerning `None` as separator applies here as well.
1568
1569 After the last token has been returned, there is an attempt to close
1570 `textfile`. If this yields an `AttributeError`, it will be silently
1571 ignored.
1572
1573 """
1574 for line in textfile:
1575 for token in line.split(separator):
1576 yield token.upper()
1577 try:
1578 textfile.close()
1579 except AttributeError, e:
1580 pass
1581
1582 dist_ctr = 0
1583
1584 try:
1585 import psyco
1586 psyco.bind(levenshtein)
1587 psyco.bind(VPTree)
1588 psyco.bind(BKTree)
1589 psyco.bind(FastBKTree)
1590 except ImportError, e:
1591 sys.stderr.write("Psyco not available, proceeding without it.\n")
1592
1593 if __name__ == '__main__':
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618 pass
1619
1620
1621