mspace

1 #!/usr/bin/python -tt 2 # -*- coding: utf-8 -*- 3 4 # mspace.py - An implementation of metric space indexes for efficient 5 # similarity search 6 # 7 # Copyright (C) 2007s2012 Jochen Spieker 8 # 9 # This program is free software; you can redistribute it and/or modify 10 # it under the terms of the GNU General Public License version 2 as 11 # published by the Free Software Foundation. 12 # 13 # This program is distributed in the hope that it will be useful, but 14 # WITHOUT ANY WARRANTY; without even the implied warranty of 15 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU 16 # General Public License for more details. 17 # 18 # You should have received a copy of the GNU General Public License 19 # along with this program; if not, write to the Free Software 20 # Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307, 21 # USA. 22 23 """ 24 25 Metric Spaces 26 ============== 27 28 .. contents:: 29 30 Introduction 31 ------------- 32 33 This module provides efficient similarity search by implementing 34 different data structures for indexing metric spaces. You can use it to 35 index an arbitrary number of objects which you later want to search for 36 objects "similar" to a given other object. 37 38 For this module to be useful, you need a function with metric properties 39 (see below for a definition) which you can apply to the type of objects 40 you want to search for. The only pre-defined metric funtion in this 41 module is `levenshtein`, which is a common metric to calculate the 42 (dis)similarity of two strings. 43 44 Beware that the indexing phase might take a long time, depending on the 45 number of objects you want to index and the runtime of your metric 46 distance function. Indexing only pays off when you are planning to do a 47 lot of searches. 48 49 Definitions 50 ............ 51 52 A metric space is a pair ``(S, f)`` where ``S`` is an abritrary 53 (possibly infinite) set of objects and ``f`` is a function which returns 54 a "distance" for any two objects ``x``, ``y`` in ``S``. This distance is 55 a number used to express the similarity between the two objects. The 56 larger the distance, the more they differ. A distance of zero means they 57 are considered equal. 58 59 The function ``f`` has to be a "metric", i.e. the following conditions 60 have to be met for all ``x``, ``y``, ``z`` in ``S``: 61 62 - ``f(x, y) >= 0`` 63 64 (positivity) 65 66 - ``f(x, y) == f(y, x)`` 67 68 (symmetry) 69 70 - ``f(x, z) <= f(x, y) + f(y, z)`` 71 72 (triangle inequality) 73 74 This definition adheres to most people's intuitive understanding of the 75 distance between two points in a Euclidian space (of arbitrary 76 dimensionality). [#euclid]_ Imagine you have three points in a 77 2-dimensional space like this:: 78 79 A-------B 80 \ / 81 \ / 82 \ / 83 C 84 85 It is evident that the distance between two of these points is 86 necessarily larger than (or equal to) zero, and that the direction you 87 take for measuring (from ``A`` to ``B`` or vice versa) has no influence 88 on the distance. The third condition, the triangle inequality is easy to 89 grasp, too: if you know the distances between ``A`` and ``B`` and 90 between ``B`` and ``C``, you can be sure that their sum is equal to or 91 larger than the distance between ``A`` and ``C``. Voila, there's your 92 metric space. 93 94 In many cases, metric distance functions take a long time to compute. 95 Because of this, data structures used to index metric spaces are 96 designed to minimize the number of distance computations necessary for 97 searching the whole space by using the properties of its metric function 98 (most notably the triangle inequality) to discard as many objects as 99 possible after every computation. 100 101 .. [#euclid] In fact, Euclidian spaces are one type of metric space. 102 103 104 Use cases & caveats 105 .................... 106 107 Metric distance functions can be defined for objects other than points 108 in Euclidian space, too. One very well known distance function is the 109 Levenshtein distance, which is applicable to arbitrary strings. Since it 110 is quite popular and indexing strings is one of the major use cases for 111 metric spaces, an implementation of this function is included in this 112 module. Other metric distance functions for strings are the 113 Damerau-Levenshtein variant and Hamming distance. These can be used for 114 a variety of tasks, such as spell-checking, record linkage (or 115 "deduplication"), error detection etc. 116 117 Please note that indexing a large number of objects usually takes a 118 considerable amount of time *and* memory. Indexing only pays off if you 119 are going to do *a lot* of searches in the index. If you just want to 120 search a set of objects for all objects similar to one or a few other 121 objects, you are better off comparing them sequentially by hand with 122 your distance function. 123 124 Installation 125 ............. 126 127 .. _Psyco: http://psyco.sourceforge.net/ 128 129 This file is a stand-alone module for Python 2.4 and later which you 130 only have to save somewhere in your ``PYTHONPATH``. [#pythonpath]_ No 131 installation is necessary. The only tight dependencies are the modules 132 ``sys``, ``random`` and ``weakref`` which are most probably already 133 included in your Python installation. 134 135 If you are using this on a 32 Bit i386 machine running Windows or Linux, 136 you probably also want to install Psyco_ which speeds up execution times 137 quite noticeably (in exchange for only a slightly increased memory 138 usage). Psyco will be used automatically if it is available. 139 140 .. [#pythonpath] Run ``python -c "import sys; print sys.path"`` to learn 141 where Python currently looks for modules. 142 143 The most current version is always available from this site: 144 145 http://well-adjusted.de/mspace.py/ 146 147 The project is currently kept in SVN and may be obtained from the 148 following URL: 149 150 svn://svn.well-adjusted.de/mspace/trunk 151 152 Basic usage 153 ------------ 154 155 Index creation 156 ............... 157 158 Say you have a dictionary file containing one or more words per line. 159 You can make an index of all the words in it by doing: 160 161 .. python:: 162 163 >>> import mspace 164 >>> mydict = file('dictionary.txt') 165 >>> words = mspace.tokenizer(mydict) 166 >>> index = mspace.VPTree(words, mspace.levenshtein) 167 168 You can delay the time-comsuming construction phase by creating the 169 index without any parameters and explicitly calling ``construct`` at a 170 later time: 171 172 .. python:: 173 174 >>> index = mspace.VPTree() 175 >>> index.construct(words, mspace.levenshtein) 176 177 Please note that the index' content is dismissed on every call to 178 ``construct``. The index is built from scratch with only the objects and 179 distance function passed as parameters. 180 181 Performing range-searches 182 .......................... 183 184 After you have built the index (and got yourself some coffee if the 185 index is large) you may search it for all objects having distance to a 186 given other object between arbitrary bounds: 187 188 .. python:: 189 190 >>> index.range_search('WOOD', min_dist=0, max_dist=1) 191 ['GOOD', 'WOOD', 'WOOL', 'MOOD'] 192 193 In this case, you received four results: the object you searched for 194 (apparently it has been in your dictionary, too) and two other objects 195 "GOOD" and "WOOL", both of which have the requested maximum distance of 196 one to your query object. Result lists are unordered and do not contain 197 information about their contents' distance to the query object. If you 198 need this, you have to calculate it by hand. 199 200 Both ``min_dist`` and ``max_dist`` are optional parameters defaulting to 201 zero. Thus, if you leave them out, a search for objects which are 202 exactly equal (as defined by your distance function) to the query object 203 is performed. For historical reasons, and since range-searching with a 204 minimum distance of zero and a maximum greater than zero is quite 205 common, there is a shortcut to search with ``min_dist=0``: 206 207 .. python:: 208 209 >>> index.search('WOOD', 1) 210 ['GOOD', 'WOOD', 'WOOL', 'MOOD'] 211 212 213 Performing nearest-neighbour-searches 214 ...................................... 215 216 The second type of search you can perform is "k-nearest-neighbour" 217 search. It returns a given number of objects from the index that are 218 closest to the query object. Search results are guaranteed to never 219 contain an object with a distance to the query object that is larger 220 than that of any other object in the tree. 221 222 Result lists are composed of ``(object, distance)`` tuples, sorted 223 ascendingly by the distance. If you don't specify a maximum number of 224 matches to return, it defaults to one: 225 226 .. python:: 227 228 >>> index.nn_search('WOOD') 229 [('WOOD', 0)] 230 >>> index.nn_search('WOOD', 5) 231 [('WOOD', 0), ('WOOL', 1), ('GOOD', 1), ('MOOD', 1), ('FOOT', 2)] 232 233 Note that the index may contain other objects with a distance of two to 234 the query object ``'WOOD'`` (e.g. ``'FOOL'``). They have just been cut 235 off because a maximum of five objects has been requested. You have no 236 influence on the choice of representatives that is made. 237 238 Note that you must not expect this method to always return a list of 239 length ``num`` because you might ask for more objects than are currently 240 in the index. 241 242 243 Advanced usage 244 --------------- 245 246 Indexing complex objects 247 ......................... 248 249 If you have "complex" objects which you want to index by only one 250 specific attribute, you can write a simple wrapper around `levenshtein` 251 (or some other function applicable to the attribute) which extracts this 252 attribute from your objects and returns the distance between their 253 attributes' value. This way you can search for and retrieve complex 254 objects from the index while comparing only a single attribute 255 256 .. python:: 257 258 >>> import mspace 259 >>> 260 >>> class Record(object): 261 ... def __init__(self, firstname, surname): 262 ... self._firstname, self._surname = firstname, surname 263 ... 264 >>> def firstnameLev(r1, r2): 265 ... return mspace.levenshtein(r1._firstname, r2._firstname) 266 ... 267 >>> index = mspace.VPTree(listOfRecords, firstnameLev) 268 >>> rec = Record("Paul", "Brady") 269 >>> index.search(rec, 2) 270 [<Record: 'Paula Bean'>, <Record: 'Raoul Perky'>, <Record: 'Paul Simon'>] 271 272 Of course you can also use more advanced techniques like writing a 273 function factory which returns a function that extracts arbitrary 274 attributes from your objects and applies a metric function to it: 275 276 .. python:: 277 278 >>> def metric_using_attr(metric, attr): 279 ... def f(one, other, attr=attr): 280 ... attr1, attr2 = getattr(one, attr), getattr(other, attr) 281 ... return metric(attr1, attr2) 282 ... return f 283 ... 284 >>> metric = metric_using_attr(mspace.levenshtein, "_firstname") 285 >>> metric( Record("Paul", "Brady"), Record("Paul", "Simon") ) 286 0 287 288 (Note that the inner function ``f`` in this example deviates from the 289 standard protocol by accepting an optional third parameter. This is 290 done here only to pull the name ``attr`` into the inner function's 291 namespace to save some time when looking up its associated value. No 292 index structure in this module will ever call its distance function with 293 more than two parameters anyway, so that whatever you passed as ``attr`` 294 to ``metric_using_attr`` when creating your function will be used when 295 this function gets called by an index. Got it?) 296 297 Reverse indexes 298 ................ 299 300 Of course you can use a similar technique to avoid full indexes and 301 instead create an index which only contains references to your data (in 302 a database, text file, somewhere on the net etc.). Your distance 303 function would then use the supplied values to fetch the objects to 304 compare from your data source. But, I cannot stress this enough, your 305 distance function's performance is absolutely crucial to efficient 306 indexing and searching. Therefore you should make sure that your way of 307 accessing the data is really fast. 308 309 Choice of data structure 310 ------------------------- 311 312 This module currently gives you two data structures to choose from. 313 While the underlying algorithms are different, their search results 314 (given the same set of objects to index, distance function and maximum 315 distance) are exactly the same. If you find a scenario where this is not 316 the case, please let me know because this would be a serious bug. 317 318 Capabilities 319 ............. 320 321 The algorithms implemented in this module have different capabilities 322 which are displayed in the table below: 323 324 +--------------------+-------+-------+ 325 | |VPTree |BKTree | 326 +====================+=======+=======+ 327 |non-discrete metric | X | | 328 +--------------------+-------+-------+ 329 |insertion | | X | 330 +--------------------+-------+-------+ 331 |deletion | | | 332 +--------------------+-------+-------+ 333 334 The first row is about the value returned by the distance function the 335 index uses. BKTrees are not prepared to handle non-discrete distances 336 (for example floats) and will most probably perform really bad when they 337 occur. 338 339 The other two rows describe what you can do with the indexes *after* 340 they have been constructed. Please note that neither of them may shrink 341 over time. Only BKTrees are able to handle insertion efficiently. 342 343 Performance 344 ............ 345 346 [In short: you are probably better off using BKTrees unless you need a 347 non-discrete metric. When in doubt, test both options.] 348 349 Obviously, the whole point of this module is to speed up the process of 350 finding objects similar to one another. And, as you can imagine, the 351 different algorithms perform differently when exposed to a specific 352 problem. 353 354 Here's an example: I took the file ``american-english-large`` from the 355 Debian package ``wamerican-large`` and did some time measurements of 356 indexing and searching a random subset of it. The machine I used is an 357 1.1GHz Pentium3 running Python 2.4 from Debian stable **with Psyco 358 enabled**. Please note that the following numbers have been determined 359 completely unscientifical. I only show them to give you an idea of what 360 to expect. For serious benchmarking, I should have used the ``timeit`` 361 module. Nevertheless, this is it: 362 363 364 +------+------------------------+------------------------+ 365 | |BKTree |VPTree | 366 +======+======+=========+=======+=======+========+=======+ 367 |size |time |per node |height |time |per node|height | 368 +------+------+---------+-------+-------+--------+-------+ 369 | 5000 | 2.92|0.000584 |11 | 7.40|0.001480| 21| 370 +------+------+---------+-------+-------+--------+-------+ 371 |10000 | 6.32|0.000632 |14 | 16.02|0.001602| 22| 372 +------+------+---------+-------+-------+--------+-------+ 373 |15000 | 9.95|0.000663 |14 | 28.35|0.001890| 24| 374 +------+------+---------+-------+-------+--------+-------+ 375 |20000 | 13.70|0.000685 |14 | 41.40|0.002070| 24| 376 +------+------+---------+-------+-------+--------+-------+ 377 |25000 | 17.46|0.000699 |15 | 50.63|0.002025| 25| 378 +------+------+---------+-------+-------+--------+-------+ 379 |30000 | 21.81|0.000727 |15 | 55.47|0.001849| 25| 380 +------+------+---------+-------+-------+--------+-------+ 381 |35000 | 25.77|0.000736 |16 | 64.43|0.001841| 26| 382 +------+------+---------+-------+-------+--------+-------+ 383 |40000 | 29.40|0.000735 |16 | 75.45|0.001886| 26| 384 +------+------+---------+-------+-------+--------+-------+ 385 |45000 | 41.28|0.000917 |16 | 96.36|0.002141| 26| 386 +------+------+---------+-------+-------+--------+-------+ 387 |50000 | 37.62|0.000752 |16 | 95.31|0.001906| 28| 388 +------+------+---------+-------+-------+--------+-------+ 389 390 391 This table shows construction times (total and per node in seconds) for 392 data sets of a given size. Additionally, you can see the height 393 [#height]_ of the trees. Apparently, BKTrees can be constructed a lot 394 faster than VPTrees. Both of them need an increasing time per node with 395 growing data sets. This is expected since construction complexity is 396 ``O(n log(n))`` in both cases. 397 398 +---------------+-----------------+-----------------+ 399 | |BK, size: 50,000 |VP, size: 50,000 | 400 +------+--------+-------+---------+-------+---------+ 401 |k |results | time | # dist | time | # dist | 402 +------+--------+-------+---------+-------+---------+ 403 |0 | 0.89 |0.0008 | 8.14 | 0.0019| 17.28| 404 +------+--------+-------+---------+-------+---------+ 405 |1 | 4.07 |0.1670 | 1583.77 | 0.2801| 1933.64| 406 +------+--------+-------+---------+-------+---------+ 407 |2 | 38.10 |1.1687 |10353.31 | 1.4413| 13845.67| 408 +------+--------+-------+---------+-------+---------+ 409 |3 | 304.57 |2.5614 |22202.28 | 3.0497| 27514.51| 410 +------+--------+-------+---------+-------+---------+ 411 |4 |1584.86 |3.8727 |32376.54 | 4.1518| 36877.62| 412 +------+--------+-------+---------+-------+---------+ 413 |5 |5317.03 |4.4182 |39616.04 | 4.9935| 42720.38| 414 +------+--------+-------+---------+-------+---------+ 415 416 This table shows the runtime of searches done in the trees (in seconds), 417 the number of distance calculations done and the number of results for 418 growing error tolerance. All values are given as average over 100 random 419 searches (from the same file, but not necessarily from the set of 420 indexed objects). As you can see, search runtime (tightly connected to 421 the number of distance calculations) literally explodes for larger 422 values of ``k``. This growth only fades when an increased part of the 423 complete search space is examined (the number of distance calculations 424 equals the number of nodes compared with the query object). 425 426 As you can see, too, long search runtimes for large values of ``k`` 427 don't actually hurt usability very much since you only get a usable 428 number of results for small ``k`` anyway. This is of course due to the 429 nature of the data set and the distance function used in this example. 430 Your application may vary greatly. 431 432 You also have to keep in mind that the dictionary I used contains almost 433 no duplicates. If you used the words of a real document as your data 434 set, your tree would have significantly less nodes than the number of 435 words in your document since you typically repeat a lot of words very 436 often. A quick test with my diploma thesis revealed only 4400 distinct 437 words in a document with 14,000 words (including LaTeX commands). This 438 makes searching much faster because equal objects (as defined by the 439 distance function) are only evaluated once. 440 441 .. [#height] If you look closely, you can see that the VPTree's height 442 doesn't grow continuously when the data set grows. The reason for 443 that phenomenon is that this implementation does not try very hard to 444 pick a good vantage point which is the key factor to get 445 well-balanced trees. So, in some cases, a vantage point may be chosen 446 that may result in trees which are higher than strictly necessary. 447 448 Optimization potential (or: Why the f*ck is this so slow!?) 449 ------------------------------------------------------------ 450 451 Ok, you have tried to index your small dictionary file and start to 452 wonder why this easy task takes several minutes. Here are a couple of 453 (possible) reasons: 454 455 - Indexing takes ``O(n log(n))`` for all data structures currently 456 implemented. So yes, doubling the number of indexed objects will 457 necessarily more than double your runtime for indexing, sorry. 458 459 - Python is slow. I have written an implementation very similar to this 460 one in Java which is significantly faster (by a factor of about 15 to 461 25, even when using Psyco_!). But the java implementation eats more 462 memory and unfortunately I cannot release it under a free license. 463 464 - Recursion is slow in Python. Recursion is the most natural way to 465 create and search trees and this code uses it a lot. Most of the 466 recursion in this module is "tail recursion", but the Python 467 interpreter is not able to optimize it into loops. 468 469 [Note: as of revision 60, searching in both trees and inserting into 470 BKTrees has been rewritten using loops instead of recursion. 471 Perfomance gain is quite small, though.] 472 473 - The distance distribution in your metric space is too dense. This 474 leads to the data structures being unable to discard large parts of 475 the indexed space while searching. In pathological cases, you may end 476 up with data structures resembling linked lists. 477 478 - Your distance function is non-discrete, i.e. it returns floats. How 479 well this is supported depends on the index structure in use. 480 481 - Your metric function is very expensive. Remember that this function 482 has to be called *very often* during indexing. You may use this 483 module's attribute ``dist_ctr`` to get an idea how often this is done. 484 It is incremented by one on every call to your distance function and 485 is otherwise unused. 486 487 - You are searching for non-similar objects. If you, for example, have 488 an index of strings with an average length of six characters and you 489 are continuously searching for strings with a maximum distance of 490 three or more, do not expect the search to be significantly faster 491 than linear testing of the complete search space. It may even be 492 slower. 493 494 Possible solutions include: 495 496 - Rewrite in C. Since I cannot be bothered to learn C, someone else 497 would have to do this. 498 499 - Use Pyrex or Psyco_. Quick tests with Psyco suggest that indexing 500 becomes about 3-4 times faster. This is as easy as doing an ``import 501 psyco; psyco.full()`` in your program. Read Psyco's manual for tuning 502 options. 503 504 [Note: as of about revision 46, on module loading an attempt is made 505 to import and enable Psyco for levenshtein and the tree classes. If it 506 doesn't succeed, you'll get a performance warning on ``stderr`` but 507 everything will continue to work flawlessly.] 508 509 - Pickle trees for static data sets. Pickling and unpickling indexes 510 using Python's standard ``pickle`` module is quite fast. But beware 511 that a pickled index takes a lot more space than your plain data. 512 513 - Rewrite tree construction and searching using loops. I am not sure 514 whether this will be faster, but it will probably take less memory 515 than the current approach and fix the problem with Python's recursion 516 limit. I fear that code readibility will suffer, though. The current 517 implementations are quite close to the original descriptions of the 518 algorithms. 519 520 [Note: partially done, see above.] 521 522 - Optimize the distance function in use. Since I do not know your 523 distance function, I cannot help you with this. 524 525 As for `levenshtein`: the algorithm implemented in this module is the 526 classical dynamic programming variant with ``O(|n|*|m|)`` time 527 complexity (and ``O(min(|n|, |m|))`` for space). Other algorithms for 528 Levenshtein try not to compute the whole matrix but only the values 529 around the diagonal of the matrix that are strictly necessary. By 530 doing this, the average ``O(|n|*|m|)`` of the classical dynamic 531 programming algorithm could be improved to something like 532 ``O(k*|n|)``. While it isn't obvious how much wall clock time this 533 approarch saves in reality, it may be in fact a whole lot faster 534 because ``k`` is usually only a fraction of the length of the input 535 strings. 536 537 And if you take a close look at the algorithms, you may realize that 538 searching may be sped up by using an algorithm that doesn't actually 539 compute the distance between two strings, but one that only answers 540 whether two strings have a distance less than or equal to a specified 541 maximum distance. This can be done in ``O(k^2)`` and should boost 542 search performance quite noticeably. 543 544 .. _Citeseer: http://citeseer.ist.psu.edu/navarro99guided.html 545 546 547 If you are interested in any of these optimizations, I suggest reading 548 Gonzalo Navarro's excellent survey "Guided Tour To Approximate String 549 Matching" which you can get from Citeseer_. 550 551 Currently, this module always uses the same function for searching and 552 indexing. If you want to test your own implementations, you have to 553 replace your index object's ``_func`` attribute after indexing and 554 before searching. 555 556 557 Contact 558 -------- 559 560 For comments, patches, flames and hugs send mail to: 561 jrspieker@well-adjusted.de. 562 563 """ 564 565 __docformat__ = 'restructuredtext en' 566 __author__ = '$Author: jrschulz $' 567 __version__ = '$Revision: 119 $' 568 __date__ = '$Date: 2012-09-17 02:13:52 +0200 (Mon, 17 Sep 2012) $' 569 __license__ = "GPL" 570 571 import sys, random, weakref

572 573 -class MetricTree(object):

574 """ 575 576 Base class for metric space indexes which have tree-like properties. 577 It's implementation is not complete enough to make it actually 578 usable, but it should make it quite easy to implement other indexing 579 and searching strategies. In an ideal case, all you have to do is to 580 derive from this class, implement the methods `construct`, 581 `_get_child_candidates` and optionally `insert` and make `children` 582 an attribute or property returning all child nodes of the current 583 node. If you need an ``__init__`` method, you should call 584 ``super(YourClass, self).__init__(objects, func, parent)`` in its 585 end as well. 586 587 Instances of this class (and its subclasses, of course) are supposed 588 to be recursive data structures. This means that they may be treated 589 like the root of a tree whether they have a parent or not. This 590 means every node has: 591 592 - a list of `_values` with every element of that list having a 593 distance of zero to all other elements in the list. Because of 594 triangle inequality, you only need to check this condition against 595 one element of this list. 596 597 - a `_parent` (`None` for root nodes) 598 599 - a list of `children` (empty list for leaf nodes) 600 601 - a `_height` (one for trees consisting of only one node) 602 603 - a number of nodes `_num_nodes` (the total number of nodes in this 604 subtree) 605 606 - a `_size` (the total number of objects stored in this subtree) 607 608 Note that in many (most? some?) cases, implementors don't have to 609 fiddle with these values directly. Instead, there are some helper 610 methods like `_incr_node_counter` and `_incr_size` that may be 611 called at appropriate times. If in doubt, just implement what you 612 strictly need first and then take a look whether node, size and 613 height counting works as expected (and whether you care). 614 615 As a bonus, this class implements some of python's magic methods so 616 that you can iterate over its instances' content and use the ``in`` 617 operator to test for the existence of an object using the distance 618 function. 619 620 """ 621 622 _parent = None 623 """ 624 625 The parent of this node. ``None`` for root nodes. 626 627 """ 628 629 _values = list() 630 """ 631 632 A list of values in this subtree. All objects in this list are 633 considered equal by the distance function in use. 634 635 """ 636 637 _size = 0 638 """ 639 640 The number of objects in this subtree. 641 642 """ 643 644 _func = 0 645 """ 646 647 The distance function used for indexing and searching. It has to 648 accept any two objects from the list of objects you are indexing as 649 arguments and return a non-negative integer (or long). 650 651 """ 652 653 _height = 0 654 """ 655 """ 656 657 _num_nodes = 0 658 """ 659 """ 660

661 - def __init__(self, objects=None, func=None, parent=None):

662 """ 663 664 Create a new metric tree. If ``objects`` and ``func`` are given, 665 the given objects will be indexed immediately using the distance 666 function which makes it possible to immediately start so search 667 for other objects in it. Otherwise, you have to call `construct` 668 later in order to make use of this metric tree. 669 670 """ 671 self._values = list() 672 self._size = 0 673 self._height = 1 674 self._func = func 675 self.parent = parent 676 self._num_nodes = 0 677 self._incr_node_counter() 678 self._calculate_height() 679 if objects and func: 680 self.construct(objects, func)

681

682 - def range_search(self, obj, min_dist=0, max_dist=0):

683 """ 684 685 Return a list of all objects in this subtree whose distance to 686 `obj` is at least `min_dist` and at most `max_dist`. `min_dist` 687 and `max_dist` have to satisfy the condition ``0 <= min_dist <= 688 max_dist``. 689 690 If this metric tree is empty, an empty list is returned, 691 regardless of whether this tree has been previously constructed 692 or not. 693 694 If calling the distance function fails for any reason, 695 `UnindexableObjectError` will be raised. 696 697 """ 698 assert( 0 <= min_dist <= max_dist ) 699 if not self: return list() 700 result, candidates = list(), [self] 701 while candidates: 702 node = candidates.pop() 703 distance = node._get_dist(obj) 704 if min_dist <= distance <= max_dist: 705 result.extend(node._values) 706 candidates.extend( node._get_child_candidates( 707 distance, min_dist, max_dist)) 708 return result

709

710 - def search(self, obj, max_dist):

711 """ 712 713 Equivalent to range_search(obj, min_dist=0, max_dist). 714 715 """ 716 return self.range_search(obj, max_dist=max_dist)

717

718 - def nn_search(self, obj, num=1):

719 """ 720 721 Perform a k-nearest-neighbour search and return a sorted list of 722 at most ``num`` objects together with their distance to the 723 query object (``(object, distance)`` tuples). Sorting is done by 724 the distance (ascending). Of course, `num` has to be an ``int`` 725 (or ``long``, for that matter) larger than zero. 726 727 For obvious reasons, this method cannot return more objects than 728 are currently present in the tree. That means, if ``num > 729 len(self)``, only ``len(self)`` pairs of ```(object, distance)`` 730 will be returned. 731 732 If this metric tree is empty, an empty list is returned, 733 regardless of whether this tree has been previously constructed 734 or not. 735 736 If several objects in this tree share the same maximum distance 737 to `obj`, no guarantee is given about which of these objects 738 will be returned and which are left out. 739 740 If calling the distance function fails for any reason, 741 `UnindexableObjectError` will be raised. 742 743 """ 744 if num < 1: return list() 745 if not self: return list() 746 neighbours = list() # list of (value, distance) tuples 747 candidates = [self] 748 def cmp_neighbours(a, b): return cmp(a[1], b[1]) 749 while candidates: 750 cand = candidates.pop() 751 distance = cand._get_dist(obj) 752 candidate_is_neighbour = bool([1 for n in neighbours 753 if distance < n[1]]) 754 if candidate_is_neighbour or len(neighbours) < num: 755 neighbours.extend([(v, distance) for v in cand._values]) 756 neighbours.sort(cmp_neighbours) 757 if len(neighbours) > num: 758 neighbours = neighbours[:num] 759 elif len(neighbours) == num: 760 max_dist = neighbours[-1][1] 761 if max_dist == 0: 762 break 763 candidates.extend( 764 node._get_child_candidates( distance, 0, max_dist)) 765 else: 766 candidates.extend(node.children) 767 return neighbours

768

769 - def _get_child_candidates(self, distance, min_dist, max_dist):

770 """ 771 772 Return a sequence of child nodes that may contain objects with a 773 distance difference between (inclusive) ``min`` and ``max`` to a 774 certain query object. Note that the query object is *not* 775 passed to this method. Instead, ``distance`` is the query 776 object's previously calculated distance to this node. 777 778 """ 779 raise NotImplementedError()

780

781 - def construct(self, objects, func):

782 """ 783 784 (Re)Index this space with the given ``objects`` using the 785 distance function ``func``. Previous contents will be 786 discarded. ``objects`` has to be a sequence or an iterable. The 787 distance function ``func`` needs to be applicable to all objects 788 contained in ``objects``. 789 790 If calling the distance function fails for any reason, 791 `UnindexableObjectError` will be raised. 792 793 """ 794 raise NotImplementedError()

795

796 - def insert(self, obj):

797 """ 798 799 Insert a single object into the metric tree. Returns ``self``, 800 i.e. the tree itself. 801 802 This method may not be implemented by all subclasses of 803 `MetricTree` since not all data structures allow to do this 804 efficiently. `NotImplementedError` will be raised when this is 805 the case. 806 807 If calling the distance function fails for any reason, 808 `UnindexableObjectError` will be raised. 809 810 """ 811 raise NotImplementedError()

812

813 - def is_root(self):

814 """ 815 816 Answer whether this node is the root of a tree (i.e. it has no 817 parent). 818 819 """ 820 return self.parent is None

821

822 - def is_leaf(self):

823 """ 824 825 Answer whether this node is a leaf node (i.e. it has no 826 children) 827 828 """ 829 return not self.children

830

831 - def __children(self):

832 """ 833 834 A sequence of this node's children. 835 836 The possible number of children per node is 837 implementation-dependent. Leaf nodes return an empty sequence. 838 839 """ 840 raise NotImplementedError()

841 children = property(__children) 842

843 - def __num_nodes(self):

844 """ 845 846 The number of nodes in this tree. 847 848 This may be different from the number of objects contained in 849 this tree in cases where two or more of these objects are 850 considered equal by the distance function in use (i.e., for two 851 objects ``p`` and ``q``, calling ``self._func(p, q)`` returns 852 ``0`` and when the tree is empty, i.e. there is one node (the 853 root node) but it doesn't contain any values. 854 855 """ 856 return self._num_nodes

857 num_nodes = property(__num_nodes) 858 859

860 - def __height(self):

861 """ 862 863 The height of this tree. 864 865 Empty trees have a height of ``0``, trees containing one or more 866 objects have a height ``>= 1``. 867 868 """ 869 return self._height

870 height = property(__height) 871

872 - def __parent(self):

873 """ 874 875 The parent of this node. `None` if this node is the root 876 of a tree. 877 878 """ 879 return self._parent

880

881 - def __set_parent(self, parent):

882 """ 883 884 Set the parent of this node. 885 886 Parent references are stored as "weak references" to avoid 887 circular references which the garbage collector cannot dissolve 888 by itself. 889 890 """ 891 if parent is None: 892 self._parent = None 893 else: 894 # the callback ensures that the weakref is deleted 895 # as soon as the parent node disappears so that this 896 # node recognizes it is the new root. 897 callback = lambda proxy: self.__set_parent(None) 898 self._parent = weakref.proxy(parent, callback)

899 900 parent = property(__parent, __set_parent) 901

902 - def _incr_size(self, incr=1):

903 """ 904 905 Increment the size counter for this node and all its parents 906 recursively. 907 908 Should be called whenever an object is inserted into the tree. 909 910 """ 911 def f(node, incr=incr): node._size += incr 912 self._apply_upwards(f)

913

914 - def _calculate_height(self, recursive=True):

915 """ 916 917 Set this node's height to one and (if `recursive` is `True`) 918 propagate this change upwards in the tree. 919 920 """ 921 self._height = height = 1 922 if recursive: 923 node = self.parent 924 while node is not None: 925 height += 1 926 if node._height < height: 927 node._height = height 928 node = node.parent 929 else: 930 node = None

931

932 - def _incr_node_counter(self, incr=1):

933 """ 934 935 Increment the node counter for this node and all its parents 936 recursively. 937 938 Should be called whenever a new child of this node is created. 939 940 """ 941 def f(node, incr=incr): node._num_nodes += incr 942 self._apply_upwards(f)

943

944 - def _apply_upwards(self, func, **args):

945 """ 946 947 Helper method to apply a function to this node and all its 948 parents recursively. The given function must accept one node as 949 the first parameter and may accept arbitrary keyword parameters 950 as well. 951 952 """ 953 node = self 954 func(node, **args) 955 while not node.is_root(): 956 node = node.parent 957 func(node, **args)

958

959 - def _get_dist(self, obj):

960 """ 961 962 Apply this node's distance function to the given object and one 963 of this node's values. 964 965 Raises `UnindexableObjectError` when distance computation fails. 966 967 """ 968 global dist_ctr 969 try: 970 distance = self._func(self._values[0], obj) 971 dist_ctr += 1 972 except IndexError, e: 973 sys.stderr.write("Node is empty, cannot calculate distance!\n") 974 raise e 975 except Exception, e: 976 raise UnindexableObjectError(e, "Cannot calculate distance" 977 + " between objects %s and %s using %s" \ 978 % (self._values[0], obj, self._func)) 979 return distance

980

981 - def __iter__(self):

982 """ 983 984 A generator that yields all objects in this node and its 985 children by doing recursive pre-order traversal. Implementors 986 might choose to use another traversal method which better suits 987 their data structure. 988 989 Note that objects are returned in no specific order. 990 991 This implementation will always return objects in the same order 992 as long as the tree's content does not change and the 993 implementation of `children` always returns the children in the 994 same order. If `children` is not implemented at all, 995 `NotImplementedError` will be raised. 996 997 """ 998 for obj in self._values: 999 yield obj 1000 for child in self.children: 1001 for obj in child: 1002 yield obj

1003 1004 itervalues = __iter__ 1005

1006 - def iternodes(self):

1007 """ 1008 1009 A generator that yields all nodes in this subtree by doing 1010 recursive pre-order traversal. Implementors might choose to use 1011 another traversal method which better suits their data 1012 structure. 1013 1014 Note that objects are returned unordered. 1015 1016 This implementation will always return nodes in the same order 1017 as long as the tree's content does not change and the 1018 implementation of `children` always returns the children in the 1019 same order. If `children` is not implemented at all, 1020 `NotImplementedError` will be raised. 1021 1022 """ 1023 yield self 1024 for child in self.children: 1025 for node in child.iternodes(): 1026 yield node

1027

1028 - def __nonzero__(self):

1029 """ 1030 1031 Return True if this node contains any objects. 1032 1033 """ 1034 return len(self._values) > 0

1035

1036 - def __len__(self):

1037 """ 1038 1039 Return the number of objects in this subtree 1040 1041 """ 1042 return self._size

1043

1044 - def __contains__(self, item):

1045 """ 1046 1047 Search for objects with a distance of zero to `item` and return 1048 True if something is found, otherwise False. 1049 1050 Note that this does **not** check for object identity! Instead, 1051 the definition of equality is delegated to the distance function 1052 in use. 1053 1054 """ 1055 return len(self.range_search(item)) > 0

1056

1057 - def __repr__(self):

1058 if self: 1059 return str(self.__class__) + ": " + str(self._values[0]) 1060 else: 1061 return "<empty node>"

1062

1063 1064 -class BKTree(MetricTree):

1065 """ 1066 1067 "Burkhard-Keller Trees" are unbalanced multiway trees and may grow 1068 over time. They belong to the first general solutions to index 1069 arbitrary metric spaces. 1070 1071 Every node in a BKTree stores a list of objects which all have a 1072 distance of zero to each other, i.e. all objects are considered to 1073 be equal by the distance function in use. Additionally, every node 1074 keeps a dictionary. In this dictionary, every child is stored, 1075 referenceable by its distance to its parent. 1076 1077 Essentially, every node in a BKTree divides its data set ``S`` into 1078 subsets ``S^i`` so that every element in ``S^i`` has the distance 1079 ``i`` to one object arbitrarily picked from ``S`` (and stored in 1080 this node). For each ``S^i``, a new child node is created and its 1081 parent keeps a reference to this child together with its distance. 1082 1083 Insertion of a single object ``o`` in a node ``n`` is quite easy: 1084 1085 1. If ``n`` is empty, store ``o`` in ``n``. That's it. 1086 1087 2. Otherwise, calculate its distance to ``o``. 1088 1089 3. If the distance is zero, append this object to the list of 1090 objects in this node and return. 1091 1092 4. Otherwise, look for a child of ``n`` with the calculated distance 1093 to ``n``. If there is such a child, go back to 1. with this child 1094 being the new ``n``. 1095 1096 5. Otherwise, create a new node containing ``o`` and store it as a 1097 child of ``n`` with it's calculated distance. 1098 1099 Searching is done recursively by first calculating the distance of 1100 the query object to the current node and then using the triangle 1101 inequality to determine which children may contain other search 1102 results. 1103 1104 Runtime complexity for the construction phase is ``O(n log(n))``, 1105 searching is ``O(n^a)`` where ``0 <= a <= 1``. Space requirement is 1106 ``O(n)``. 1107 1108 This implementation is close to the original description of the 1109 algorithm and can only handle discrete distances. If your distance 1110 function returns floating point values, it will appear to work at 1111 indexing time but will most probably be horribly slow when 1112 searching. 1113 1114 """ 1115 1116 __slots__ = [ '_parent', '_values', '_size', '_func', '_height', 1117 '_num_nodes', '_children' ] 1118

1119 - def __init__(self, objects=None, func=None, parent=None):

1120 self._children = {} 1121 if callable(objects): 1122 random.shuffle(list(objects())) 1123 super(BKTree, self).__init__(objects, func, parent)

1124

1125 - def construct(self, objects, func):

1126 self._func = func 1127 self._children = {} 1128 for o in objects: 1129 self.insert(o) 1130 return self

1131

1132 - def insert(self, obj):

1133 if not self: 1134 self._values.append(obj) 1135 self._incr_size() 1136 return self 1137 node = self 1138 while True: 1139 distance = node._get_dist(obj) 1140 if distance == 0: 1141 node._values.append(obj) 1142 node._incr_size() 1143 break 1144 child = node._children.get(distance, None) 1145 if child is None: 1146 child = BKTree([obj], node._func, node) 1147 node._children[distance] = child 1148 break 1149 node = child 1150 return self

1151

1152 - def _get_child_candidates(self, distance, min_dist, max_dist):

1153 assert( min_dist <= max_dist ) 1154 return (child for dist, child in self._children.iteritems() 1155 if distance - max_dist <= dist <= distance + max_dist)

1156

1157 - def __children(self):

1158 children = self._children 1159 sorted_dists = sorted(children.keys()) 1160 return [children[dist] for dist in sorted_dists]

1161 children = property(__children)

1162

1163 -class FastBKTree(object):

1164 """ 1165 1166 Simpler version of BKTree. The tree's content is kept in nested 1167 dictionaries where a "node"'s value is simply stored at key -1 and 1168 its children are stored using their distance to the value as the 1169 key. Example: 1170 1171 self.root = { -1: 'ABC', 1172 0: { -1: 'ABC' } 1173 1: { -1: 'ABCD', 1174 1: { -1: 'ABCE' } 1175 } 1176 2: { -1: 'A', 1177 1: { -1: 'B', 1178 1: { -1: 'C' } 1179 } 1180 } 1181 } 1182 1183 """ 1184 1185

1186 - def __init__(self, objects, metric):

1187 self.root = None 1188 self.metric = metric 1189 self.size = 0 1190 self.height = 0 1191 self.num_nodes = 0 1192 for o in objects: 1193 self.insert(o)

1194

1195 - def insert(self, obj):

1196 global dist_ctr 1197 #value_of = lambda node: node[-1] 1198 #new_node = lambda obj: { -1: obj } 1199 self.size += 1 1200 if self.root is None: 1201 self.root = { -1: obj } 1202 return 1203 node = self.root 1204 while True: 1205 distance = self.metric(node[-1], obj) 1206 dist_ctr += 1 1207 if distance in node: 1208 node = node[distance] 1209 else: 1210 node[distance] = { -1: obj } 1211 break

1212

1213 - def range_search(self, obj, min_dist=0, max_dist=0):

1214 global dist_ctr 1215 candidates = [self.root] 1216 result = [] 1217 while candidates: 1218 node = candidates.pop() 1219 distance = self.metric(node[-1], obj) 1220 dist_ctr += 1 1221 if min_dist <= distance <= max_dist: 1222 result.append(node[-1]) 1223 candidates.extend( child for dist, child in node.iteritems() 1224 if distance - max_dist <= dist <= distance + max_dist 1225 and dist != -1) 1226 return result

1227

1228 - def search(self, obj, max_dist):

1229 return self.range_search(obj, min_dist=0, max_dist=max_dist)

1230

1231 - def nn_search(self, obj, num=1):

1232 if not self: return list() 1233 global dist_ctr 1234 neighbours = list() # list of (value, distance) tuples 1235 candidates = [self.root] 1236 def cmp_neighbours(a, b): return cmp(a[1], b[1]) 1237 while candidates: 1238 cand = candidates.pop() 1239 distance =self.metric(node[-1], obj) 1240 dist_ctr += 1 1241 candidate_is_neighbour = any(1 for n in neighbours 1242 if distance < n[1]) 1243 if candidate_is_neighbour or len(neighbours) < num: 1244 neighbours.append(cand[-1]) 1245 neighbours.sort(cmp_neighbours) 1246 if len(neighbours) > num: 1247 neighbours = neighbours[:num] 1248 elif len(neighbours) == num: 1249 max_dist = max([n[1] for n in neighbours]) 1250 candidates.extend( 1251 child for dist, child in node.iteritems() 1252 if distance - max_dist <= dist <= distance + max_dist 1253 and dist != -1) 1254 else: 1255 candidates.extend(child for child, dist in 1256 node.iteritems() if child != 1) 1257 return neighbours

1258 1259

1260 - def __len__(self):

1261 return self.size

1262

1263 1264 -class VPTree(MetricTree):

1265 1266 """ 1267 1268 .. _Jeffrey K. Uhlmann: http://www.pnylab.com/pny/papers/vptree/vptree/ 1269 1270 .. _Peter Yianilos: http://www.intermemory.net/pny/papers/vptree/vptree.ps 1271 1272 A "Vantage Point Tree" is a static, balanced, binary tree which can 1273 be used to index metric spaces with a non-discrete distance 1274 function. It has been independently described by both `Jeffrey K. 1275 Uhlmann`_ (1991) and `Peter Yianilos`_ (1993). 1276 1277 Construction is done by picking one object (the "vantage point") 1278 from the set of objects to index and then determining the distance 1279 of all remaining objects to the vantage point. The median distance 1280 is calculated and the set of remaining objects is split in two: 1281 objects whose distance to the vantage point is smaller than the 1282 median distance and the rest (objects whose distance to the vantage 1283 point is larger than or equal to the median). Yianilos called this 1284 process "ball decomposition". For both of the resulting sets new 1285 child nodes, called "left" and "right", are created recursively. The 1286 vantage point and the previously calculated median distance are 1287 stored in the current node. 1288 1289 When searching a node, the distance between the current node's value 1290 and the search object is calculated. If it is ``<= k`` (the given 1291 maximum distance), this node's value is added to the list of search 1292 results. Searching proceeds recursively into the left subtree if 1293 ``distance - k < median`` and into the right subtree if ``distance + 1294 k >= median``. 1295 1296 Since the construction phase takes advantage of the median of all 1297 distances, VPTrees do a fairly good job of balancing their subtrees. 1298 Only the fact that you have to put all objects whose distance to the 1299 vantage point equals the median distance always have to be put on 1300 the same side makes VPTrees tend to hang to the side containing 1301 these objects when using discrete distance functions. 1302 1303 Runtime complexity for the construction phase is ``O(n log(n))``, 1304 searching is ``O(log(n))``. Space requirement is ``O(n)``. 1305 1306 """ 1307 1308 __slots__ = [ '_parent', '_values', '_size', '_func', '_height', 1309 '_num_nodes', '_median', '_leftchild', '_rightchild' ] 1310

1311 - def __init__(self, objects=None, func=None, parent=None):

1312 self._median = None 1313 self._leftchild = None 1314 self._rightchild = None 1315 super(VPTree, self).__init__(objects, func, parent)

1316

1317 - def construct(self, objects, func):

1318 self._func = func 1319 if objects: 1320 if self.is_root(): 1321 # when building the root of the tree, we make sure `objects` 1322 # is a shuffled list to improve VP picking and make 1323 # decomposing easier. 1324 objects = list(objects) 1325 #random.shuffle(objects) 1326 self._values = [self._pick_VP(objects)] 1327 left, right = self._decompose(objects) 1328 del objects # we don't need that list anymore so release it 1329 # before doing recursive calls 1330 self._incr_size(len(self._values)) 1331 if left: 1332 self._leftchild = VPTree(left, func, self) 1333 if right: 1334 self._rightchild = VPTree(right, func, self) 1335 return self

1336

1337 - def _pick_VP(self, objects):

1338 # this probably makes no sense whatsoever, simply pop()ing would 1339 # do just as well, I guess. Need to think about good strategies. 1340 if len(objects) > 15: 1341 sample = objects[:5] 1342 max_diff = -1 1343 vp = None 1344 for o in sample: 1345 dists = [ self._func(other, o) for other in sample 1346 if other is not o] 1347 diff = max(dists) - min(dists) 1348 if diff > max_diff: 1349 max_diff, vp = diff, o 1350 objects.remove(vp) 1351 return vp 1352 else: 1353 return objects.pop()

1354

1355 - def _decompose(self, objects):

1356 """ 1357 1358 Perform the process called "ball decomposition" by Peter 1359 Yianilos. 1360 1361 `objects` has to be an iterable that yields objects applicable 1362 to the metric function in use. The return value is a tuple of 1363 two lists: one list that contains all elements having a distance 1364 smaller than the median distance to this node's value, the 1365 second list contains all objects whose distance is equal to or 1366 larger than the median distance of all given `objects` to this 1367 node's value. 1368 1369 """ 1370 dist_per_obj = list() 1371 for obj in objects: 1372 distance = self._get_dist(obj) 1373 if distance == 0: 1374 self._values.append(obj) 1375 else: 1376 dist_per_obj.append( (distance, obj) ) 1377 if dist_per_obj: 1378 self._median = VPTree.determine_median(zip(*dist_per_obj)[0]) 1379 left = [ obj for dist, obj in dist_per_obj 1380 if dist < self._median ] 1381 right = [ obj for dist, obj in dist_per_obj 1382 if dist >= self._median ] 1383 else: 1384 left, right = None, None 1385 return left, right

1386 1387 @staticmethod

1388 - def determine_median(numbers):

1389 """ 1390 1391 Determine the median from a sequence of numbers (or anything 1392 else that can be ``sorted()``). 1393 1394 This does not use an optimal ``O(n)`` algorithm but instead 1395 relies on CPython's speed when sorting (``O(n log(n))``). 1396 1397 """ 1398 return sorted(numbers)[ len(numbers) / 2 ]

1399

1400 - def _get_child_candidates(self, distance, min_dist, max_dist):

1401 if self._leftchild and distance - max_dist < self._median: 1402 yield self._leftchild 1403 if self._rightchild and distance + max_dist >= self._median: 1404 yield self._rightchild

1405

1406 - def __children(self):

1407 return [child for child in (self._leftchild, self._rightchild) 1408 if child]

1409 children = property(__children)

1410

1411 1412 #class BSTree(object): 1413 # 1414 # class Sector(object): 1415 # """ 1416 # 1417 # * _values contains objects with distance == 0 to each other 1418 # 1419 # * _radius is the largest distance of all objects to this node's 1420 # values. 1421 # 1422 # * _child is a reference to a BSTree. 1423 # 1424 # * _node is a reference to the BSTree this Sector belongs to. 1425 # 1426 # """ 1427 # 1428 # __slots__ = [ '_values', '_radius', '_child', '_node' ] 1429 # 1430 # def __init__(self, node, *values): 1431 # self._node = node 1432 # self._values = list(values) 1433 # 1434 # def insert(self, obj): 1435 # if not self._values: 1436 # self._values = [obj] 1437 # return self._node 1438 # else: 1439 # distance = MetricTree._get_dist(self, obj) 1440 # self._radius = max(self._radius, distance) 1441 # return self._child.insert(obj) 1442 # 1443 # def __contains__(self, obj): 1444 # return MetricTree.__contains__(self, obj) 1445 # 1446 # 1447 # __slots__ = [ 1448 # '_parent', '_size', '_func', '_height', '_num_nodes', 1449 # '_left_sector', '_right_sector' 1450 # ] 1451 # 1452 # def __init__(self, objects=None, func=None, parent=None): 1453 # self._left_sector = Sector() 1454 # self._right_sector = Sector() 1455 # super(BSTree, self).__init__(objects, func, parent) 1456 # 1457 # def construct(self, objects, func): 1458 # split = ([], []) 1459 # if not self._left_values or not self._right_values: 1460 # pass # TODO 1461 # for obj in objects: 1462 # to_left = this._func(self._left_values[0]) 1463 # to_right = this._func(self._right_values[0]) 1464 # 1465 # def decompose(self, objects): 1466 # return objects.pop() 1467 # 1468 # def pick_good_sector_values(self, objects): 1469 # return objects.pop(), objects.pop() 1470 # 1471 # def __nonzero__(self): 1472 # pass 1473 # 1474 # def __iter__(self): 1475 # pass 1476 # 1477 # def __repr__(self): 1478 # pass 1479 1480 -class UnindexableObjectError(Exception):

1481 """ 1482 1483 This Exception is thrown when the call to the distance function in 1484 use by a metric space throws an exception. This should not happen 1485 during regular usage and is a hint that either your distance 1486 function is buggy or you tried to index objects which your distance 1487 function cannot handle. 1488 1489 """ 1490

1491 - def __init__(self, msg, e):

1492 self.orig_exception = e 1493 Exception.__init__(self, msg)

1494 1495 orig_exception = None 1496 """ The exception that has triggered this exception. """ 1497 1498 pass

1499

1500 1501 -def levenshtein(x, y):

1502 """ 1503 1504 Compute Levenshtein (or "edit") distance between two strings. 1505 Levenshtein distance between two strings ``x`` and ``y`` is defined 1506 as the minimal number of operations on single characters that it 1507 takes to transform ``x`` into ``y`` (or vice versa). 1508 1509 Allowed operations are: 1510 1511 - Insertion, 1512 - Deletion and 1513 - Replacement 1514 1515 of a single character. 1516 1517 Levenshtein distance has all the properties of a strictly positive 1518 metric. That means for all x, y, z it holds: 1519 1520 - x == y <=> levenshtein(x, y) == 0 1521 - x != y <=> levenshtein(x, y) > 0 (positivity) 1522 - levenshtein(x, y) == levenshtein(y, x) (symmetry) 1523 - levenshtein(x, z) <= levenshtein(x, y) + levenshtein(y, z) 1524 1525 The upper bound for Levenshtein distance is the length of the longer 1526 string: ``max(len(x), len(y))``. A general lower bound is 1527 ``abs(len(x) - len(y))``. This is the case where one string is the 1528 pre-/postfix of the other one. 1529 1530 Incidentally, this implementation not only works for strings, but 1531 for all types of sequences. Objects in the given sequences are 1532 compared for equality using '==' to determine whether an edit 1533 operation is needed. 1534 1535 """ 1536 if x == y: 1537 return 0 # equal strings have distance zero 1538 m, n = len(x), len(y) 1539 if not (m and n): # if one of the lengths is zero, 1540 return m or n # we can return the other one. 1541 if n > m: 1542 x, y = y, x # switch m and n so that n 1543 m, n = n, m # is the smaller one (saves space for cur and prev) 1544 cur = range(n+1) 1545 prev = [0] * (n+1) 1546 for i, char_x in enumerate(x): 1547 prev[0] = i+1 1548 for j, char_y in enumerate(y): 1549 if char_x == char_y: 1550 prev[j+1] = cur[j] 1551 else: 1552 prev[j+1] = min(prev[j], cur[j], cur[j+1]) + 1 1553 cur, prev = prev, cur 1554 return cur[-1]

1555

1556 -def tokenizer(textfile, separator=None):

1557 """ 1558 1559 Simple generator that tokenizes sequences of strings (or file-like 1560 objects) into uppercase words using the optional `separator`. This 1561 `separator` is passed directly to str.split so ``split``'s behaviour 1562 concerning `None` as separator applies here as well. 1563 1564 After the last token has been returned, there is an attempt to close 1565 `textfile`. If this yields an `AttributeError`, it will be silently 1566 ignored. 1567 1568 """ 1569 for line in textfile: 1570 for token in line.split(separator): 1571 yield token.upper() 1572 try: 1573 textfile.close() 1574 except AttributeError, e: 1575 pass

1576 1577 dist_ctr = 0 1578 1579 try: 1580 import psyco 1581 psyco.bind(levenshtein) 1582 psyco.bind(VPTree) 1583 psyco.bind(BKTree) 1584 psyco.bind(FastBKTree) 1585 except ImportError, e: 1586 sys.stderr.write("Psyco not available, proceeding without it.\n") 1587 1588 if __name__ == '__main__': 1589 import codecs, time 1590 dumpfilename = "vptree.dump" 1591 f = codecs.open(sys.argv[1], encoding="iso-8859-1") 1592 names = tokenizer(f) 1593 t = BKTree() 1594 t.construct(names, levenshtein) 1595 f.close() 1596 print type(t) 1597 s = raw_input("press enter") 1598 print "t height: %s" % t.height 1599 print "t size: %s" % len(t) 1600 print "t nodes %s" % t.num_nodes 1601 1602 #print "results: %s " % t.search("MARTIN", 1) 1603 1604 f = codecs.open(sys.argv[1], encoding="iso-8859-1") 1605 names = list(tokenizer(f)) 1606 random.shuffle(names) 1607 1608 start = time.time() 1609 for i in range(0, 10): 1610 t.search(names.pop(), 1) 1611 end = time.time() 1612 print end - start 1613 1614 # vim: set et ts=4 sw=4 tw=72 nu: 1615

Source Code for Module mspace