Module mspace
[hide private]
[frames] | no frames]

Source Code for Module mspace

   1  #!/usr/bin/python -tt 
   2  # -*- coding: utf-8 -*- 
   3   
   4  # mspace.py - An implementation of metric space indexes for efficient 
   5  # similarity search 
   6  # 
   7  # Copyright (C) 2007,2008 Jochen Schulz 
   8  # 
   9  # This program is free software; you can redistribute it and/or modify 
  10  # it under the terms of the GNU General Public License version 2 as 
  11  # published by the Free Software Foundation. 
  12  # 
  13  # This program is distributed in the hope that it will be useful, but 
  14  # WITHOUT ANY WARRANTY; without even the implied warranty of 
  15  # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU 
  16  # General Public License for more details. 
  17  # 
  18  # You should have received a copy of the GNU General Public License 
  19  # along with this program; if not, write to the Free Software 
  20  # Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307, 
  21  # USA. 
  22   
  23  """ 
  24   
  25  Metric Spaces 
  26  ============== 
  27   
  28  .. contents:: 
  29   
  30  Introduction 
  31  ------------- 
  32   
  33  This module provides efficient similarity search by implementing 
  34  different data structures for indexing metric spaces.  You can use it to 
  35  index an arbitrary number of objects which you later want to search for 
  36  objects "similar" to a given other object. 
  37   
  38  For this module to be useful, you need a function with metric properties 
  39  (see below for a definition) which you can apply to the type of objects 
  40  you want to search for. The only pre-defined metric funtion in this 
  41  module is `levenshtein`, which is a common metric to calculate the 
  42  (dis)similarity of two strings. 
  43   
  44  Beware that the indexing phase might take a long time, depending on the 
  45  number of objects you want to index and the runtime of your metric 
  46  distance function. Indexing only pays off when you are planning to do a 
  47  lot of searches. 
  48   
  49  Definitions 
  50  ............ 
  51   
  52  A metric space is a pair ``(S, f)`` where ``S`` is an abritrary 
  53  (possibly infinite) set of objects and ``f`` is a function which returns 
  54  a "distance" for any two objects ``x``, ``y`` in ``S``. This distance is 
  55  a number used to express the similarity between the two objects. The 
  56  larger the distance, the more they differ. A distance of zero means they 
  57  are considered equal. 
  58   
  59  The function ``f`` has to be a "metric", i.e. the following conditions 
  60  have to be met for all ``x``, ``y``, ``z`` in ``S``: 
  61   
  62  -  ``f(x, y) >= 0`` 
  63   
  64     (positivity) 
  65   
  66  -  ``f(x, y) == f(y, x)`` 
  67   
  68     (symmetry) 
  69   
  70  -  ``f(x, z) <= f(x, y) + f(y, z)`` 
  71   
  72     (triangle inequality) 
  73   
  74  This definition adheres to most people's intuitive understanding of the 
  75  distance between two points in a Euclidian space (of arbitrary 
  76  dimensionality).  [#euclid]_ Imagine you have three points in a 
  77  2-dimensional space like this:: 
  78   
  79      A-------B 
  80       \     / 
  81        \   / 
  82         \ / 
  83          C 
  84   
  85  It is evident that the distance between two of these points is 
  86  necessarily larger than (or equal to) zero, and that the direction you 
  87  take for measuring (from ``A`` to ``B`` or vice versa) has no influence 
  88  on the distance. The third condition, the triangle inequality is easy to 
  89  grasp, too: if you know the distances between ``A`` and ``B`` and 
  90  between ``B`` and ``C``, you can be sure that their sum is equal to or 
  91  larger than the distance between ``A`` and ``C``. Voila, there's your 
  92  metric space. 
  93   
  94  In many cases, metric distance functions take a long time to compute. 
  95  Because of this, data structures used to index metric spaces are 
  96  designed to minimize the number of distance computations necessary for 
  97  searching the whole space by using the properties of its metric function 
  98  (most notably the triangle inequality) to discard as many objects as 
  99  possible after every computation. 
 100   
 101  .. [#euclid] In fact, Euclidian spaces are one type of metric space. 
 102   
 103   
 104  Use cases & caveats 
 105  .................... 
 106   
 107  Metric distance functions can be defined for objects other than points 
 108  in Euclidian space, too.  One very well known distance function is the 
 109  Levenshtein distance, which is applicable to arbitrary strings. Since it 
 110  is quite popular and indexing strings is one of the major use cases for 
 111  metric spaces, an implementation of this function is included in this 
 112  module. Other metric distance functions for strings are the 
 113  Damerau-Levenshtein variant and Hamming distance. These can be used for 
 114  a variety of tasks, such as spell-checking, record linkage (or 
 115  "deduplication"), error detection etc. 
 116   
 117  Please note that indexing a large number of objects usually takes a 
 118  considerable amount of time *and* memory. Indexing only pays off if you 
 119  are going to do *a lot* of searches in the index. If you just want to 
 120  search a set of objects for all objects similar to one or a few other 
 121  objects, you are better off comparing them sequentially by hand with 
 122  your distance function. 
 123   
 124  Installation 
 125  ............. 
 126   
 127  .. _Psyco: http://psyco.sourceforge.net/ 
 128   
 129  This file is a stand-alone module for Python 2.4 and later which you 
 130  only have to save somewhere in your ``PYTHONPATH``. [#pythonpath]_ No 
 131  installation is necessary. The only tight dependencies are the modules 
 132  ``sys``, ``random`` and ``weakref`` which are most probably already 
 133  included in your Python installation. 
 134   
 135  If you are using this on a 32 Bit i386 machine running Windows or Linux, 
 136  you probably also want to install Psyco_ which speeds up execution times 
 137  quite noticeably (in exchange for only a slightly increased memory 
 138  usage). Psyco will be used automatically if it is available. 
 139   
 140  .. [#pythonpath] Run ``python -c "import sys; print sys.path"`` to learn 
 141    where Python currently looks for modules. 
 142   
 143  Basic usage 
 144  ------------ 
 145   
 146  Index creation 
 147  ............... 
 148   
 149  Say you have a dictionary file containing one or more words per line. 
 150  You can make an index of all the words in it by doing: 
 151   
 152  .. python:: 
 153   
 154      >>> import mspace 
 155      >>> mydict = file('dictionary.txt') 
 156      >>> words = mspace.tokenizer(mydict) 
 157      >>> index = mspace.VPTree(words, mspace.levenshtein) 
 158   
 159  You can delay the time-comsuming construction phase by creating the 
 160  index without any parameters and explicitly calling ``construct`` at a 
 161  later time: 
 162   
 163  .. python:: 
 164   
 165      >>> index = mspace.VPTree() 
 166      >>> index.construct(words, mspace.levenshtein) 
 167   
 168  Please note that the index' content is dismissed on every call to 
 169  ``construct``. The index is built from scratch with only the objects and 
 170  distance function passed as parameters. 
 171   
 172  Performing range-searches 
 173  .......................... 
 174   
 175  After you have built the index (and got yourself some coffee if the 
 176  index is large) you may search it for all objects having distance to a 
 177  given other object between arbitrary bounds: 
 178   
 179  .. python:: 
 180   
 181      >>> index.range_search('WOOD', min_dist=0, max_dist=1) 
 182      ['GOOD', 'WOOD', 'WOOL', 'MOOD'] 
 183   
 184  In this case, you received four results: the object you searched for 
 185  (apparently it has been in your dictionary, too) and two other objects 
 186  "GOOD" and "WOOL", both of which have the requested maximum distance of 
 187  one to your query object. Result lists are unordered and do not contain 
 188  information about their contents' distance to the query object. If you 
 189  need this, you have to calculate it by hand. 
 190   
 191  Both ``min_dist`` and ``max_dist`` are optional parameters defaulting to 
 192  zero. Thus, if you leave them out, a search for objects which are 
 193  exactly equal (as defined by your distance function) to the query object 
 194  is performed. For historical reasons, and since range-searching with a 
 195  minimum distance of zero and a maximum greater than zero is quite 
 196  common, there is a shortcut to search with ``min_dist=0``: 
 197   
 198  .. python:: 
 199   
 200      >>> index.search('WOOD', 1) 
 201      ['GOOD', 'WOOD', 'WOOL', 'MOOD'] 
 202   
 203   
 204  Performing nearest-neighbour-searches 
 205  ...................................... 
 206   
 207  The second type of search you can perform is "k-nearest-neighbour" 
 208  search. It returns a given number of objects from the index that are 
 209  closest to the query object. Search results are guaranteed to never 
 210  contain an object with a distance to the query object that is larger 
 211  than that of any other object in the tree. 
 212   
 213  Result lists are composed of ``(object, distance)`` tuples, sorted 
 214  ascendingly by the distance. If you don't specify a maximum number of 
 215  matches to return, it defaults to one: 
 216   
 217  .. python:: 
 218   
 219      >>> index.nn_search('WOOD') 
 220      [('WOOD', 0)] 
 221      >>> index.nn_search('WOOD', 5) 
 222      [('WOOD', 0), ('WOOL', 1), ('GOOD', 1), ('MOOD', 1), ('FOOT', 2)] 
 223   
 224  Note that the index may contain other objects with a distance of two to 
 225  the query object ``'WOOD'`` (e.g. ``'FOOL'``). They have just been cut 
 226  off because a maximum of five objects has been requested. You have no 
 227  influence on the choice of representatives that is made. 
 228   
 229  Note that you must not expect this method to always return a list of 
 230  length ``num`` because you might ask for more objects than are currently 
 231  in the index. 
 232   
 233   
 234  Advanced usage 
 235  --------------- 
 236   
 237  Indexing complex objects 
 238  ......................... 
 239   
 240  If you have "complex" objects which you want to index by only one 
 241  specific attribute, you can write a simple wrapper around `levenshtein` 
 242  (or some other function applicable to the attribute) which extracts this 
 243  attribute from your objects and returns the distance between their 
 244  attributes' value.  This way you can search for and retrieve complex 
 245  objects from the index while comparing only a single attribute 
 246   
 247  .. python:: 
 248   
 249      >>> import mspace 
 250      >>> 
 251      >>> class Record(object): 
 252      ...     def __init__(self, firstname, surname): 
 253      ...         self._firstname, self._surname = firstname, surname 
 254      ... 
 255      >>> def firstnameLev(r1, r2): 
 256      ...     return mspace.levenshtein(r1._firstname, r2._firstname) 
 257      ... 
 258      >>> index = mspace.VPTree(listOfRecords, firstnameLev) 
 259      >>> rec = Record("Paul", "Brady") 
 260      >>> index.search(rec, 2) 
 261      [<Record: 'Paula Bean'>, <Record: 'Raoul Perky'>, <Record: 'Paul Simon'>] 
 262   
 263  Of course you can also use more advanced techniques like writing a 
 264  function factory which returns a function that extracts arbitrary 
 265  attributes from your objects and applies a metric function to it: 
 266   
 267  .. python:: 
 268   
 269      >>> def metric_using_attr(metric, attr): 
 270      ...     def f(one, other, attr=attr): 
 271      ...         attr1, attr2 = getattr(one, attr), getattr(other, attr) 
 272      ...         return metric(attr1, attr2) 
 273      ...     return f 
 274      ... 
 275      >>> metric = metric_using_attr(mspace.levenshtein, "_firstname") 
 276      >>> metric( Record("Paul", "Brady"), Record("Paul", "Simon") ) 
 277      0 
 278   
 279  (Note that the inner function ``f`` in this example deviates from the 
 280  standard protocol by accepting an optional third parameter.  This is 
 281  done here only to pull the name ``attr`` into the inner function's 
 282  namespace to save some time when looking up its associated value.  No 
 283  index structure in this module will ever call its distance function with 
 284  more than two parameters anyway, so that whatever you passed as ``attr`` 
 285  to ``metric_using_attr`` when creating your function will be used when 
 286  this function gets called by an index. Got it?) 
 287   
 288  Reverse indexes 
 289  ................ 
 290   
 291  Of course you can use a similar technique to avoid full indexes and 
 292  instead create an index which only contains references to your data (in 
 293  a database, text file, somewhere on the net etc.). Your distance 
 294  function would then use the supplied values to fetch the objects to 
 295  compare from your data source.  But, I cannot stress this enough, your 
 296  distance function's performance is absolutely crucial to efficient 
 297  indexing and searching. Therefore you should make sure that your way of 
 298  accessing the data is really fast. 
 299   
 300  Choice of data structure 
 301  ------------------------- 
 302   
 303  This module currently gives you two data structures to choose from. 
 304  While the underlying algorithms are different, their search results 
 305  (given the same set of objects to index, distance function and maximum 
 306  distance) are exactly the same. If you find a scenario where this is not 
 307  the case, please let me know because this would be a serious bug. 
 308   
 309  Capabilities 
 310  ............. 
 311   
 312  The algorithms implemented in this module have different capabilities 
 313  which are displayed in the table below: 
 314   
 315  +--------------------+-------+-------+ 
 316  |                    |VPTree |BKTree | 
 317  +====================+=======+=======+ 
 318  |non-discrete metric |     X |       | 
 319  +--------------------+-------+-------+ 
 320  |insertion           |       |     X | 
 321  +--------------------+-------+-------+ 
 322  |deletion            |       |       | 
 323  +--------------------+-------+-------+ 
 324   
 325  The first row is about the value returned by the distance function the 
 326  index uses. BKTrees are not prepared to handle non-discrete distances 
 327  (for example floats) and will most probably perform really bad when they 
 328  occur. 
 329   
 330  The other two rows describe what you can do with the indexes *after* 
 331  they have been constructed. Please note that neither of them may shrink 
 332  over time. Only BKTrees are able to handle insertion efficiently. 
 333   
 334  Performance 
 335  ............ 
 336   
 337  [In short: you are probably better off using BKTrees unless you need a 
 338  non-discrete metric. When in doubt, test both options.] 
 339   
 340  Obviously, the whole point of this module is to speed up the process of 
 341  finding objects similar to one another. And, as you can imagine, the 
 342  different algorithms perform differently when exposed to a specific 
 343  problem. 
 344   
 345  Here's an example: I took the file ``american-english-large`` from the 
 346  Debian package ``wamerican-large`` and did some time measurements of 
 347  indexing and searching a random subset of it. The machine I used is an 
 348  1.1GHz Pentium3 running Python 2.4 from Debian stable **with Psyco 
 349  enabled**. Please note that the following numbers have been determined 
 350  completely unscientifical. I only show them to give you an idea of what 
 351  to expect. For serious benchmarking, I should have used the ``timeit`` 
 352  module. Nevertheless, this is it: 
 353   
 354   
 355  +------+------------------------+------------------------+ 
 356  |      |BKTree                  |VPTree                  | 
 357  +======+======+=========+=======+=======+========+=======+ 
 358  |size  |time  |per node |height |time   |per node|height | 
 359  +------+------+---------+-------+-------+--------+-------+ 
 360  | 5000 |  2.92|0.000584 |11     |   7.40|0.001480|     21| 
 361  +------+------+---------+-------+-------+--------+-------+ 
 362  |10000 |  6.32|0.000632 |14     |  16.02|0.001602|     22| 
 363  +------+------+---------+-------+-------+--------+-------+ 
 364  |15000 |  9.95|0.000663 |14     |  28.35|0.001890|     24| 
 365  +------+------+---------+-------+-------+--------+-------+ 
 366  |20000 | 13.70|0.000685 |14     |  41.40|0.002070|     24| 
 367  +------+------+---------+-------+-------+--------+-------+ 
 368  |25000 | 17.46|0.000699 |15     |  50.63|0.002025|     25| 
 369  +------+------+---------+-------+-------+--------+-------+ 
 370  |30000 | 21.81|0.000727 |15     |  55.47|0.001849|     25| 
 371  +------+------+---------+-------+-------+--------+-------+ 
 372  |35000 | 25.77|0.000736 |16     |  64.43|0.001841|     26| 
 373  +------+------+---------+-------+-------+--------+-------+ 
 374  |40000 | 29.40|0.000735 |16     |  75.45|0.001886|     26| 
 375  +------+------+---------+-------+-------+--------+-------+ 
 376  |45000 | 41.28|0.000917 |16     |  96.36|0.002141|     26| 
 377  +------+------+---------+-------+-------+--------+-------+ 
 378  |50000 | 37.62|0.000752 |16     |  95.31|0.001906|     28| 
 379  +------+------+---------+-------+-------+--------+-------+ 
 380   
 381   
 382  This table shows construction times (total and per node in seconds) for 
 383  data sets of a given size. Additionally, you can see the height 
 384  [#height]_ of the trees.  Apparently, BKTrees can be constructed a lot 
 385  faster than VPTrees. Both of them need an increasing time per node with 
 386  growing data sets. This is expected since construction complexity is 
 387  ``O(n log(n))`` in both cases. 
 388   
 389  +---------------+-----------------+-----------------+ 
 390  |               |BK, size: 50,000 |VP, size: 50,000 | 
 391  +------+--------+-------+---------+-------+---------+ 
 392  |k     |results |  time |  # dist |  time |  # dist | 
 393  +------+--------+-------+---------+-------+---------+ 
 394  |0     |   0.89 |0.0008 |    8.14 | 0.0019|    17.28| 
 395  +------+--------+-------+---------+-------+---------+ 
 396  |1     |   4.07 |0.1670 | 1583.77 | 0.2801|  1933.64| 
 397  +------+--------+-------+---------+-------+---------+ 
 398  |2     |  38.10 |1.1687 |10353.31 | 1.4413| 13845.67| 
 399  +------+--------+-------+---------+-------+---------+ 
 400  |3     | 304.57 |2.5614 |22202.28 | 3.0497| 27514.51| 
 401  +------+--------+-------+---------+-------+---------+ 
 402  |4     |1584.86 |3.8727 |32376.54 | 4.1518| 36877.62| 
 403  +------+--------+-------+---------+-------+---------+ 
 404  |5     |5317.03 |4.4182 |39616.04 | 4.9935| 42720.38| 
 405  +------+--------+-------+---------+-------+---------+ 
 406   
 407  This table shows the runtime of searches done in the trees (in seconds), 
 408  the number of distance calculations done and the number of results for 
 409  growing error tolerance. All values are given as average over 100 random 
 410  searches (from the same file, but not necessarily from the set of 
 411  indexed objects).  As you can see, search runtime (tightly connected to 
 412  the number of distance calculations) literally explodes for larger 
 413  values of ``k``.  This growth only fades when an increased part of the 
 414  complete search space is examined (the number of distance calculations 
 415  equals the number of nodes compared with the query object). 
 416   
 417  As you can see, too, long search runtimes for large values of ``k`` 
 418  don't actually hurt usability very much since you only get a usable 
 419  number of results for small ``k`` anyway. This is of course due to the 
 420  nature of the data set and the distance function used in this example. 
 421  Your application may vary greatly. 
 422   
 423  You also have to keep in mind that the dictionary I used contains almost 
 424  no duplicates. If you used the words of a real document as your data 
 425  set, your tree would have significantly less nodes than the number of 
 426  words in your document since you typically repeat a lot of words very 
 427  often. A quick test with my diploma thesis revealed only 4400 distinct 
 428  words in a document with 14,000 words (including LaTeX commands). This 
 429  makes searching much faster because equal objects (as defined by the 
 430  distance function) are only evaluated once. 
 431   
 432  .. [#height] If you look closely, you can see that the VPTree's height 
 433     doesn't grow continuously when the data set grows. The reason for 
 434     that phenomenon is that this implementation does not try very hard to 
 435     pick a good vantage point which is the key factor to get 
 436     well-balanced trees. So, in some cases, a vantage point may be chosen 
 437     that may result in trees which are higher than strictly necessary. 
 438   
 439  Optimization potential (or: Why the f*ck is this so slow!?) 
 440  ------------------------------------------------------------ 
 441   
 442  Ok, you have tried to index your small dictionary file and start to 
 443  wonder why this easy task takes several minutes. Here are a couple of 
 444  (possible) reasons: 
 445   
 446  - Indexing takes ``O(n log(n))`` for all data structures currently 
 447    implemented. So yes, doubling the number of indexed objects will 
 448    necessarily more than double your runtime for indexing, sorry. 
 449   
 450  - Python is slow. I have written an implementation very similar to this 
 451    one in Java which is significantly faster (by a factor of about 15 to 
 452    25, even when using Psyco_!).  But the java implementation eats more 
 453    memory and unfortunately I cannot release it under a free license. 
 454   
 455  - Recursion is slow in Python. Recursion is the most natural way to 
 456    create and search trees and this code uses it a lot. Most of the 
 457    recursion in this module is "tail recursion", but the Python 
 458    interpreter is not able to optimize it into loops. 
 459   
 460    [Note: as of revision 60, searching in both trees and inserting into 
 461    BKTrees has been rewritten using loops instead of recursion. 
 462    Perfomance gain is quite small, though.] 
 463   
 464  - The distance distribution in your metric space is too dense. This 
 465    leads to the data structures being unable to discard large parts of 
 466    the indexed space while searching. In pathological cases, you may end 
 467    up with data structures resembling linked lists. 
 468   
 469  - Your distance function is non-discrete, i.e. it returns floats. How 
 470    well this is supported depends on the index structure in use. 
 471   
 472  - Your metric function is very expensive. Remember that this function 
 473    has to be called *very often* during indexing. You may use this 
 474    module's attribute ``dist_ctr`` to get an idea how often this is done. 
 475    It is incremented by one on every call to your distance function and 
 476    is otherwise unused. 
 477   
 478  - You are searching for non-similar objects. If you, for example,  have 
 479    an index of strings with an average length of six characters and you 
 480    are continuously searching for strings with a maximum distance of 
 481    three or more, do not expect the search to be significantly faster 
 482    than linear testing of the complete search space. It may even be 
 483    slower. 
 484   
 485  Possible solutions include: 
 486   
 487  - Rewrite in C. Since I cannot be bothered to learn C, someone else 
 488    would have to do this. 
 489   
 490  - Use Pyrex or Psyco_. Quick tests with Psyco suggest that indexing 
 491    becomes about 3-4 times faster. This is as easy as doing an ``import 
 492    psyco; psyco.full()`` in your program. Read Psyco's manual for tuning 
 493    options. 
 494   
 495    [Note: as of about revision 46, on module loading an attempt is made 
 496    to import and enable Psyco for levenshtein and the tree classes. If it 
 497    doesn't succeed, you'll get a performance warning on ``stderr`` but 
 498    everything will continue to work flawlessly.] 
 499   
 500  - Pickle trees for static data sets. Pickling and unpickling indexes 
 501    using Python's standard ``pickle`` module is quite fast. But beware 
 502    that a pickled index takes a lot more space than your plain data. 
 503   
 504  - Rewrite tree construction and searching using loops. I am not sure 
 505    whether this will be faster, but it will probably take less memory 
 506    than the current approach and fix the problem with Python's recursion 
 507    limit. I fear that code readibility will suffer, though. The current 
 508    implementations are quite close to the original descriptions of the 
 509    algorithms. 
 510   
 511    [Note: partially done, see above.] 
 512   
 513  - Optimize the distance function in use. Since I do not know your 
 514    distance function, I cannot help you with this. 
 515   
 516    As for `levenshtein`: the algorithm implemented in this module is the 
 517    classical dynamic programming variant with ``O(|n|*|m|)`` time 
 518    complexity (and ``O(min(|n|, |m|))`` for space).  Other algorithms for 
 519    Levenshtein try not to compute the whole matrix but only the values 
 520    around the diagonal of the matrix that are strictly necessary. By 
 521    doing this, the average ``O(|n|*|m|)`` of the classical dynamic 
 522    programming algorithm could be improved to something like 
 523    ``O(k*|n|)``. While it isn't obvious how much wall clock time this 
 524    approarch saves in reality, it may be in fact a whole lot faster 
 525    because ``k`` is usually only a fraction of the length of the input 
 526    strings. 
 527   
 528    And if you take a close look at the algorithms, you may realize that 
 529    searching may be sped up by using an algorithm that doesn't actually 
 530    compute the distance between two strings, but one that only answers 
 531    whether two strings have a distance less than or equal to a specified 
 532    maximum distance. This can be done in ``O(k^2)`` and should boost 
 533    search performance quite noticeably. 
 534   
 535    .. _Citeseer: http://citeseer.ist.psu.edu/navarro99guided.html 
 536   
 537   
 538    If you are interested in any of these optimizations, I suggest reading 
 539    Gonzalo Navarro's excellent survey "Guided Tour To Approximate String 
 540    Matching" which you can get from Citeseer_. 
 541   
 542    Currently, this module always uses the same function for searching and 
 543    indexing. If you want to test your own implementations, you have to 
 544    replace your index object's ``_func`` attribute after indexing and 
 545    before searching. 
 546   
 547   
 548  Contact 
 549  -------- 
 550   
 551  For comments, patches, flames and hugs send mail to: 
 552  jrschulz@well-adjusted.de. 
 553   
 554  """ 
 555   
 556  __docformat__ = 'restructuredtext en' 
 557  __author__    = '$Author: jrschulz $' 
 558  __version__   = '$Revision: 117 $' 
 559  __date__      = '$Date: 2008-07-23 23:01:46 +0200 (Wed, 23 Jul 2008) $' 
 560  __license__   = "GPL" 
 561   
 562  import sys, random, weakref 
563 564 -class MetricTree(object):
565 """ 566 567 Base class for metric space indexes which have tree-like properties. 568 It's implementation is not complete enough to make it actually 569 usable, but it should make it quite easy to implement other indexing 570 and searching strategies. In an ideal case, all you have to do is to 571 derive from this class, implement the methods `construct`, 572 `_get_child_candidates` and optionally `insert` and make `children` 573 an attribute or property returning all child nodes of the current 574 node. If you need an ``__init__`` method, you should call 575 ``super(YourClass, self).__init__(objects, func, parent)`` in its 576 end as well. 577 578 Instances of this class (and its subclasses, of course) are supposed 579 to be recursive data structures. This means that they may be treated 580 like the root of a tree whether they have a parent or not. This 581 means every node has: 582 583 - a list of `_values` with every element of that list having a 584 distance of zero to all other elements in the list. Because of 585 triangle inequality, you only need to check this condition against 586 one element of this list. 587 588 - a `_parent` (`None` for root nodes) 589 590 - a list of `children` (empty list for leaf nodes) 591 592 - a `_height` (one for trees consisting of only one node) 593 594 - a number of nodes `_num_nodes` (the total number of nodes in this 595 subtree) 596 597 - a `_size` (the total number of objects stored in this subtree) 598 599 Note that in many (most? some?) cases, implementors don't have to 600 fiddle with these values directly. Instead, there are some helper 601 methods like `_incr_node_counter` and `_incr_size` that may be 602 called at appropriate times. If in doubt, just implement what you 603 strictly need first and then take a look whether node, size and 604 height counting works as expected (and whether you care). 605 606 As a bonus, this class implements some of python's magic methods so 607 that you can iterate over its instances' content and use the ``in`` 608 operator to test for the existence of an object using the distance 609 function. 610 611 """ 612 613 _parent = None 614 """ 615 616 The parent of this node. ``None`` for root nodes. 617 618 """ 619 620 _values = list() 621 """ 622 623 A list of values in this subtree. All objects in this list are 624 considered equal by the distance function in use. 625 626 """ 627 628 _size = 0 629 """ 630 631 The number of objects in this subtree. 632 633 """ 634 635 _func = 0 636 """ 637 638 The distance function used for indexing and searching. It has to 639 accept any two objects from the list of objects you are indexing as 640 arguments and return a non-negative integer (or long). 641 642 """ 643 644 _height = 0 645 """ 646 """ 647 648 _num_nodes = 0 649 """ 650 """ 651
652 - def __init__(self, objects=None, func=None, parent=None):
653 """ 654 655 Create a new metric tree. If ``objects`` and ``func`` are given, 656 the given objects will be indexed immediately using the distance 657 function which makes it possible to immediately start so search 658 for other objects in it. Otherwise, you have to call `construct` 659 later in order to make use of this metric tree. 660 661 """ 662 self._values = list() 663 self._size = 0 664 self._height = 1 665 self._func = func 666 self.parent = parent 667 self._num_nodes = 0 668 self._incr_node_counter() 669 self._calculate_height() 670 if objects and func: 671 self.construct(objects, func)
672
673 - def range_search(self, obj, min_dist=0, max_dist=0):
674 """ 675 676 Return a list of all objects in this subtree whose distance to 677 `obj` is at least `min_dist` and at most `max_dist`. `min_dist` 678 and `max_dist` have to satisfy the condition ``0 <= min_dist <= 679 max_dist``. 680 681 If this metric tree is empty, an empty list is returned, 682 regardless of whether this tree has been previously constructed 683 or not. 684 685 If calling the distance function fails for any reason, 686 `UnindexableObjectError` will be raised. 687 688 """ 689 assert( 0 <= min_dist <= max_dist ) 690 if not self: return list() 691 result, candidates = list(), [self] 692 while candidates: 693 node = candidates.pop() 694 distance = node._get_dist(obj) 695 if min_dist <= distance <= max_dist: 696 result.extend(node._values) 697 candidates.extend( node._get_child_candidates( 698 distance, min_dist, max_dist)) 699 return result
700
701 - def search(self, obj, max_dist):
702 """ 703 704 Equivalent to range_search(obj, min_dist=0, max_dist). 705 706 """ 707 return self.range_search(obj, max_dist=max_dist)
708
709 - def nn_search(self, obj, num=1):
710 """ 711 712 Perform a k-nearest-neighbour search and return a sorted list of 713 at most ``num`` objects together with their distance to the 714 query object (``(object, distance)`` tuples). Sorting is done by 715 the distance (ascending). Of course, `num` has to be an ``int`` 716 (or ``long``, for that matter) larger than zero. 717 718 For obvious reasons, this method cannot return more objects than 719 are currently present in the tree. That means, if ``num > 720 len(self)``, only ``len(self)`` pairs of ```(object, distance)`` 721 will be returned. 722 723 If this metric tree is empty, an empty list is returned, 724 regardless of whether this tree has been previously constructed 725 or not. 726 727 If several objects in this tree share the same maximum distance 728 to `obj`, no guarantee is given about which of these objects 729 will be returned and which are left out. 730 731 If calling the distance function fails for any reason, 732 `UnindexableObjectError` will be raised. 733 734 """ 735 if num < 1: return list() 736 if not self: return list() 737 neighbours = list() # list of (value, distance) tuples 738 candidates = [self] 739 def cmp_neighbours(a, b): return cmp(a[1], b[1]) 740 while candidates: 741 cand = candidates.pop() 742 distance = cand._get_dist(obj) 743 candidate_is_neighbour = bool([1 for n in neighbours 744 if distance < n[1]]) 745 if candidate_is_neighbour or len(neighbours) < num: 746 neighbours.extend([(v, distance) for v in cand._values]) 747 neighbours.sort(cmp_neighbours) 748 if len(neighbours) > num: 749 neighbours = neighbours[:num] 750 elif len(neighbours) == num: 751 max_dist = neighbours[-1][1] 752 if max_dist == 0: 753 break 754 candidates.extend( 755 node._get_child_candidates( distance, 0, max_dist)) 756 else: 757 candidates.extend(node.children) 758 return neighbours
759
760 - def _get_child_candidates(self, distance, min_dist, max_dist):
761 """ 762 763 Return a sequence of child nodes that may contain objects with a 764 distance difference between (inclusive) ``min`` and ``max`` to a 765 certain query object. Note that the query object is *not* 766 passed to this method. Instead, ``distance`` is the query 767 object's previously calculated distance to this node. 768 769 """ 770 raise NotImplementedError()
771
772 - def construct(self, objects, func):
773 """ 774 775 (Re)Index this space with the given ``objects`` using the 776 distance function ``func``. Previous contents will be 777 discarded. ``objects`` has to be a sequence or an iterable. The 778 distance function ``func`` needs to be applicable to all objects 779 contained in ``objects``. 780 781 If calling the distance function fails for any reason, 782 `UnindexableObjectError` will be raised. 783 784 """ 785 raise NotImplementedError()
786
787 - def insert(self, obj):
788 """ 789 790 Insert a single object into the metric tree. Returns ``self``, 791 i.e. the tree itself. 792 793 This method may not be implemented by all subclasses of 794 `MetricTree` since not all data structures allow to do this 795 efficiently. `NotImplementedError` will be raised when this is 796 the case. 797 798 If calling the distance function fails for any reason, 799 `UnindexableObjectError` will be raised. 800 801 """ 802 raise NotImplementedError()
803
804 - def is_root(self):
805 """ 806 807 Answer whether this node is the root of a tree (i.e. it has no 808 parent). 809 810 """ 811 return self.parent is None
812
813 - def is_leaf(self):
814 """ 815 816 Answer whether this node is a leaf node (i.e. it has no 817 children) 818 819 """ 820 return not self.children
821
822 - def __children(self):
823 """ 824 825 A sequence of this node's children. 826 827 The possible number of children per node is 828 implementation-dependent. Leaf nodes return an empty sequence. 829 830 """ 831 raise NotImplementedError()
832 children = property(__children) 833
834 - def __num_nodes(self):
835 """ 836 837 The number of nodes in this tree. 838 839 This may be different from the number of objects contained in 840 this tree in cases where two or more of these objects are 841 considered equal by the distance function in use (i.e., for two 842 objects ``p`` and ``q``, calling ``self._func(p, q)`` returns 843 ``0`` and when the tree is empty, i.e. there is one node (the 844 root node) but it doesn't contain any values. 845 846 """ 847 return self._num_nodes
848 num_nodes = property(__num_nodes) 849 850
851 - def __height(self):
852 """ 853 854 The height of this tree. 855 856 Empty trees have a height of ``0``, trees containing one or more 857 objects have a height ``>= 1``. 858 859 """ 860 return self._height
861 height = property(__height) 862
863 - def __parent(self):
864 """ 865 866 The parent of this node. `None` if this node is the root 867 of a tree. 868 869 """ 870 return self._parent
871
872 - def __set_parent(self, parent):
873 """ 874 875 Set the parent of this node. 876 877 Parent references are stored as "weak references" to avoid 878 circular references which the garbage collector cannot dissolve 879 by itself. 880 881 """ 882 if parent is None: 883 self._parent = None 884 else: 885 # the callback ensures that the weakref is deleted 886 # as soon as the parent node disappears so that this 887 # node recognizes it is the new root. 888 callback = lambda proxy: self.__set_parent(None) 889 self._parent = weakref.proxy(parent, callback)
890 891 parent = property(__parent, __set_parent) 892
893 - def _incr_size(self, incr=1):
894 """ 895 896 Increment the size counter for this node and all its parents 897 recursively. 898 899 Should be called whenever an object is inserted into the tree. 900 901 """ 902 def f(node, incr=incr): node._size += incr 903 self._apply_upwards(f)
904
905 - def _calculate_height(self, recursive=True):
906 """ 907 908 Set this node's height to one and (if `recursive` is `True`) 909 propagate this change upwards in the tree. 910 911 """ 912 self._height = height = 1 913 if recursive: 914 node = self.parent 915 while node is not None: 916 height += 1 917 if node._height < height: 918 node._height = height 919 node = node.parent 920 else: 921 node = None
922
923 - def _incr_node_counter(self, incr=1):
924 """ 925 926 Increment the node counter for this node and all its parents 927 recursively. 928 929 Should be called whenever a new child of this node is created. 930 931 """ 932 def f(node, incr=incr): node._num_nodes += incr 933 self._apply_upwards(f)
934
935 - def _apply_upwards(self, func, **args):
936 """ 937 938 Helper method to apply a function to this node and all its 939 parents recursively. The given function must accept one node as 940 the first parameter and may accept arbitrary keyword parameters 941 as well. 942 943 """ 944 node = self 945 func(node, **args) 946 while not node.is_root(): 947 node = node.parent 948 func(node, **args)
949
950 - def _get_dist(self, obj):
951 """ 952 953 Apply this node's distance function to the given object and one 954 of this node's values. 955 956 Raises `UnindexableObjectError` when distance computation fails. 957 958 """ 959 global dist_ctr 960 try: 961 distance = self._func(self._values[0], obj) 962 dist_ctr += 1 963 except IndexError, e: 964 sys.stderr.write("Node is empty, cannot calculate distance!\n") 965 raise e 966 except Exception, e: 967 raise UnindexableObjectError(e, "Cannot calculate distance" 968 + " between objects %s and %s using %s" \ 969 % (self._values[0], obj, self._func)) 970 return distance
971
972 - def __iter__(self):
973 """ 974 975 A generator that yields all objects in this node and its 976 children by doing recursive pre-order traversal. Implementors 977 might choose to use another traversal method which better suits 978 their data structure. 979 980 Note that objects are returned in no specific order. 981 982 This implementation will always return objects in the same order 983 as long as the tree's content does not change and the 984 implementation of `children` always returns the children in the 985 same order. If `children` is not implemented at all, 986 `NotImplementedError` will be raised. 987 988 """ 989 for obj in self._values: 990 yield obj 991 for child in self.children: 992 for obj in child: 993 yield obj
994 995 itervalues = __iter__ 996
997 - def iternodes(self):
998 """ 999 1000 A generator that yields all nodes in this subtree by doing 1001 recursive pre-order traversal. Implementors might choose to use 1002 another traversal method which better suits their data 1003 structure. 1004 1005 Note that objects are returned unordered. 1006 1007 This implementation will always return nodes in the same order 1008 as long as the tree's content does not change and the 1009 implementation of `children` always returns the children in the 1010 same order. If `children` is not implemented at all, 1011 `NotImplementedError` will be raised. 1012 1013 """ 1014 yield self 1015 for child in self.children: 1016 for node in child.iternodes(): 1017 yield node
1018
1019 - def __nonzero__(self):
1020 """ 1021 1022 Return True if this node contains any objects. 1023 1024 """ 1025 return len(self._values) > 0
1026
1027 - def __len__(self):
1028 """ 1029 1030 Return the number of objects in this subtree 1031 1032 """ 1033 return self._size
1034
1035 - def __contains__(self, item):
1036 """ 1037 1038 Search for objects with a distance of zero to `item` and return 1039 True if something is found, otherwise False. 1040 1041 Note that this does **not** check for object identity! Instead, 1042 the definition of equality is delegated to the distance function 1043 in use. 1044 1045 """ 1046 return len(self.range_search(item)) > 0
1047
1048 - def __repr__(self):
1049 if self: 1050 return str(self.__class__) + ": " + str(self._values[0]) 1051 else: 1052 return "<empty node>"
1053
1054 1055 -class BKTree(MetricTree):
1056 """ 1057 1058 "Burkhard-Keller Trees" are unbalanced multiway trees and may grow 1059 over time. They belong to the first general solutions to index 1060 arbitrary metric spaces. 1061 1062 Every node in a BKTree stores a list of objects which all have a 1063 distance of zero to each other, i.e. all objects are considered to 1064 be equal by the distance function in use. Additionally, every node 1065 keeps a dictionary. In this dictionary, every child is stored, 1066 referenceable by its distance to its parent. 1067 1068 Essentially, every node in a BKTree divides its data set ``S`` into 1069 subsets ``S^i`` so that every element in ``S^i`` has the distance 1070 ``i`` to one object arbitrarily picked from ``S`` (and stored in 1071 this node). For each ``S^i``, a new child node is created and its 1072 parent keeps a reference to this child together with its distance. 1073 1074 Insertion of a single object ``o`` in a node ``n`` is quite easy: 1075 1076 1. If ``n`` is empty, store ``o`` in ``n``. That's it. 1077 1078 2. Otherwise, calculate its distance to ``o``. 1079 1080 3. If the distance is zero, append this object to the list of 1081 objects in this node and return. 1082 1083 4. Otherwise, look for a child of ``n`` with the calculated distance 1084 to ``n``. If there is such a child, go back to 1. with this child 1085 being the new ``n``. 1086 1087 5. Otherwise, create a new node containing ``o`` and store it as a 1088 child of ``n`` with it's calculated distance. 1089 1090 Searching is done recursively by first calculating the distance of 1091 the query object to the current node and then using the triangle 1092 inequality to determine which children may contain other search 1093 results. 1094 1095 Runtime complexity for the construction phase is ``O(n log(n))``, 1096 searching is ``O(n^a)`` where ``0 <= a <= 1``. Space requirement is 1097 ``O(n)``. 1098 1099 This implementation is close to the original description of the 1100 algorithm and can only handle discrete distances. If your distance 1101 function returns floating point values, it will appear to work at 1102 indexing time but will most probably be horribly slow when 1103 searching. 1104 1105 """ 1106 1107 __slots__ = [ '_parent', '_values', '_size', '_func', '_height', 1108 '_num_nodes', '_children' ] 1109
1110 - def __init__(self, objects=None, func=None, parent=None):
1111 self._children = {} 1112 if callable(objects): 1113 random.shuffle(list(objects())) 1114 super(BKTree, self).__init__(objects, func, parent)
1115
1116 - def construct(self, objects, func):
1117 self._func = func 1118 self._children = {} 1119 for o in objects: 1120 self.insert(o) 1121 return self
1122
1123 - def insert(self, obj):
1124 if not self: 1125 self._values.append(obj) 1126 self._incr_size() 1127 return self 1128 node = self 1129 while True: 1130 distance = node._get_dist(obj) 1131 if distance == 0: 1132 node._values.append(obj) 1133 node._incr_size() 1134 break 1135 child = node._children.get(distance, None) 1136 if child is None: 1137 child = BKTree([obj], node._func, node) 1138 node._children[distance] = child 1139 break 1140 node = child 1141 return self
1142
1143 - def _get_child_candidates(self, distance, min_dist, max_dist):
1144 assert( min_dist <= max_dist ) 1145 return (child for dist, child in self._children.iteritems() 1146 if distance - max_dist <= dist <= distance + max_dist)
1147
1148 - def __children(self):
1149 children = self._children 1150 sorted_dists = sorted(children.keys()) 1151 return [children[dist] for dist in sorted_dists]
1152 children = property(__children)
1153
1154 -class FastBKTree(object):
1155 """ 1156 1157 Simpler version of BKTree. The tree's content is kept in nested 1158 dictionaries where a "node"'s value is simply stored at key -1 and 1159 its children are stored using their distance to the value as the 1160 key. Example: 1161 1162 .. python:: 1163 1164 self.root = { -1: 'ABC', 1165 0: { -1: 'ABC' } 1166 1: { -1: 'ABCD', 1167 1: { -1: 'ABCE' } 1168 } 1169 2: { -1: 'A', 1170 1: { -1: 'B', 1171 1: { -1: 'C' } 1172 } 1173 } 1174 } 1175 1176 """ 1177 1178
1179 - def __init__(self, objects, metric):
1180 self.root = None 1181 self.metric = metric 1182 self.size = 0 1183 self.height = 0 1184 self.num_nodes = 0 1185 for o in objects: 1186 self.insert(o)
1187
1188 - def insert(self, obj):
1189 global dist_ctr 1190 #value_of = lambda node: node[-1] 1191 #new_node = lambda obj: { -1: obj } 1192 self.size += 1 1193 if self.root is None: 1194 self.root = { -1: obj } 1195 return 1196 node = self.root 1197 while True: 1198 distance = self.metric(node[-1], obj) 1199 dist_ctr += 1 1200 if distance in node: 1201 node = node[distance] 1202 else: 1203 node[distance] = { -1: obj } 1204 break
1205
1206 - def range_search(self, obj, min_dist=0, max_dist=0):
1207 global dist_ctr 1208 candidates = [self.root] 1209 result = [] 1210 while candidates: 1211 node = candidates.pop() 1212 distance = self.metric(node[-1], obj) 1213 dist_ctr += 1 1214 if min_dist <= distance <= max_dist: 1215 result.append(node[-1]) 1216 candidates.extend( child for dist, child in node.iteritems() 1217 if distance - max_dist <= dist <= distance + max_dist 1218 and dist != -1) 1219 return result
1220
1221 - def search(self, obj, max_dist):
1222 return self.range_search(obj, min_dist=0, max_dist=max_dist)
1223
1224 - def nn_search(self, obj, num=1):
1225 if not self: return list() 1226 global dist_ctr 1227 neighbours = list() # list of (value, distance) tuples 1228 candidates = [self.root] 1229 def cmp_neighbours(a, b): return cmp(a[1], b[1]) 1230 while candidates: 1231 cand = candidates.pop() 1232 distance =self.metric(node[-1], obj) 1233 dist_ctr += 1 1234 candidate_is_neighbour = any(1 for n in neighbours 1235 if distance < n[1]) 1236 if candidate_is_neighbour or len(neighbours) < num: 1237 neighbours.append(cand[-1]) 1238 neighbours.sort(cmp_neighbours) 1239 if len(neighbours) > num: 1240 neighbours = neighbours[:num] 1241 elif len(neighbours) == num: 1242 max_dist = max([n[1] for n in neighbours]) 1243 candidates.extend( 1244 child for dist, child in node.iteritems() 1245 if distance - max_dist <= dist <= distance + max_dist 1246 and dist != -1) 1247 else: 1248 candidates.extend(child for child, dist in 1249 node.iteritems() if child != 1) 1250 return neighbours
1251 1252
1253 - def __len__(self):
1254 return self.size
1255
1256 1257 -class VPTree(MetricTree):
1258 1259 """ 1260 1261 .. _Jeffrey K. Uhlmann: http://www.pnylab.com/pny/papers/vptree/vptree/ 1262 1263 .. _Peter Yianilos: http://www.intermemory.net/pny/papers/vptree/vptree.ps 1264 1265 A "Vantage Point Tree" is a static, balanced, binary tree which can 1266 be used to index metric spaces with a non-discrete distance 1267 function. It has been independently described by both `Jeffrey K. 1268 Uhlmann`_ (1991) and `Peter Yianilos`_ (1993). 1269 1270 Construction is done by picking one object (the "vantage point") 1271 from the set of objects to index and then determining the distance 1272 of all remaining objects to the vantage point. The median distance 1273 is calculated and the set of remaining objects is split in two: 1274 objects whose distance to the vantage point is smaller than the 1275 median distance and the rest (objects whose distance to the vantage 1276 point is larger than or equal to the median). Yianilos called this 1277 process "ball decomposition". For both of the resulting sets new 1278 child nodes, called "left" and "right", are created recursively. The 1279 vantage point and the previously calculated median distance are 1280 stored in the current node. 1281 1282 When searching a node, the distance between the current node's value 1283 and the search object is calculated. If it is ``<= k`` (the given 1284 maximum distance), this node's value is added to the list of search 1285 results. Searching proceeds recursively into the left subtree if 1286 ``distance - k < median`` and into the right subtree if ``distance + 1287 k >= median``. 1288 1289 Since the construction phase takes advantage of the median of all 1290 distances, VPTrees do a fairly good job of balancing their subtrees. 1291 Only the fact that you have to put all objects whose distance to the 1292 vantage point equals the median distance always have to be put on 1293 the same side makes VPTrees tend to hang to the side containing 1294 these objects when using discrete distance functions. 1295 1296 Runtime complexity for the construction phase is ``O(n log(n))``, 1297 searching is ``O(log(n))``. Space requirement is ``O(n)``. 1298 1299 """ 1300 1301 __slots__ = [ '_parent', '_values', '_size', '_func', '_height', 1302 '_num_nodes', '_median', '_leftchild', '_rightchild' ] 1303
1304 - def __init__(self, objects=None, func=None, parent=None):
1305 self._median = None 1306 self._leftchild = None 1307 self._rightchild = None 1308 super(VPTree, self).__init__(objects, func, parent)
1309
1310 - def construct(self, objects, func):
1311 self._func = func 1312 if objects: 1313 if self.is_root(): 1314 # when building the root of the tree, we make sure `objects` 1315 # is a shuffled list to improve VP picking and make 1316 # decomposing easier. 1317 objects = list(objects) 1318 #random.shuffle(objects) 1319 self._values = [self._pick_VP(objects)] 1320 left, right = self._decompose(objects) 1321 del objects # we don't need that list anymore so release it 1322 # before doing recursive calls 1323 self._incr_size(len(self._values)) 1324 if left: 1325 self._leftchild = VPTree(left, func, self) 1326 if right: 1327 self._rightchild = VPTree(right, func, self) 1328 return self
1329
1330 - def _pick_VP(self, objects):
1331 # this probably makes no sense whatsoever, simply pop()ing would 1332 # do just as well, I guess. Need to think about good strategies. 1333 if len(objects) > 15: 1334 sample = objects[:5] 1335 max_diff = -1 1336 vp = None 1337 for o in sample: 1338 dists = [ self._func(other, o) for other in sample 1339 if other is not o] 1340 diff = max(dists) - min(dists) 1341 if diff > max_diff: 1342 max_diff, vp = diff, o 1343 objects.remove(vp) 1344 return vp 1345 else: 1346 return objects.pop()
1347
1348 - def _decompose(self, objects):
1349 """ 1350 1351 Perform the process called "ball decomposition" by Peter 1352 Yianilos. 1353 1354 `objects` has to be an iterable that yields objects applicable 1355 to the metric function in use. The return value is a tuple of 1356 two lists: one list that contains all elements having a distance 1357 smaller than the median distance to this node's value, the 1358 second list contains all objects whose distance is equal to or 1359 larger than the median distance of all given `objects` to this 1360 node's value. 1361 1362 """ 1363 dist_per_obj = list() 1364 for obj in objects: 1365 distance = self._get_dist(obj) 1366 if distance == 0: 1367 self._values.append(obj) 1368 else: 1369 dist_per_obj.append( (distance, obj) ) 1370 if dist_per_obj: 1371 self._median = VPTree.determine_median(zip(*dist_per_obj)[0]) 1372 left = [ obj for dist, obj in dist_per_obj 1373 if dist < self._median ] 1374 right = [ obj for dist, obj in dist_per_obj 1375 if dist >= self._median ] 1376 else: 1377 left, right = None, None 1378 return left, right
1379 1380 @staticmethod
1381 - def determine_median(numbers):
1382 """ 1383 1384 Determine the median from a sequence of numbers (or anything 1385 else that can be ``sorted()``). 1386 1387 This does not use an optimal ``O(n)`` algorithm but instead 1388 relies on CPython's speed when sorting (``O(n log(n))``). 1389 1390 """ 1391 return sorted(numbers)[ len(numbers) / 2 ]
1392
1393 - def _get_child_candidates(self, distance, min_dist, max_dist):
1394 if self._leftchild and distance - max_dist < self._median: 1395 yield self._leftchild 1396 if self._rightchild and distance + max_dist >= self._median: 1397 yield self._rightchild
1398
1399 - def __children(self):
1400 return [child for child in (self._leftchild, self._rightchild) 1401 if child]
1402 children = property(__children)
1403
1404 1405 #class BSTree(object): 1406 # 1407 # class Sector(object): 1408 # """ 1409 # 1410 # * _values contains objects with distance == 0 to each other 1411 # 1412 # * _radius is the largest distance of all objects to this node's 1413 # values. 1414 # 1415 # * _child is a reference to a BSTree. 1416 # 1417 # * _node is a reference to the BSTree this Sector belongs to. 1418 # 1419 # """ 1420 # 1421 # __slots__ = [ '_values', '_radius', '_child', '_node' ] 1422 # 1423 # def __init__(self, node, *values): 1424 # self._node = node 1425 # self._values = list(values) 1426 # 1427 # def insert(self, obj): 1428 # if not self._values: 1429 # self._values = [obj] 1430 # return self._node 1431 # else: 1432 # distance = MetricTree._get_dist(self, obj) 1433 # self._radius = max(self._radius, distance) 1434 # return self._child.insert(obj) 1435 # 1436 # def __contains__(self, obj): 1437 # return MetricTree.__contains__(self, obj) 1438 # 1439 # 1440 # __slots__ = [ 1441 # '_parent', '_size', '_func', '_height', '_num_nodes', 1442 # '_left_sector', '_right_sector' 1443 # ] 1444 # 1445 # def __init__(self, objects=None, func=None, parent=None): 1446 # self._left_sector = Sector() 1447 # self._right_sector = Sector() 1448 # super(BSTree, self).__init__(objects, func, parent) 1449 # 1450 # def construct(self, objects, func): 1451 # split = ([], []) 1452 # if not self._left_values or not self._right_values: 1453 # pass # TODO 1454 # for obj in objects: 1455 # to_left = this._func(self._left_values[0]) 1456 # to_right = this._func(self._right_values[0]) 1457 # 1458 # def decompose(self, objects): 1459 # return objects.pop() 1460 # 1461 # def pick_good_sector_values(self, objects): 1462 # return objects.pop(), objects.pop() 1463 # 1464 # def __nonzero__(self): 1465 # pass 1466 # 1467 # def __iter__(self): 1468 # pass 1469 # 1470 # def __repr__(self): 1471 # pass 1472 1473 -class UnindexableObjectError(Exception):
1474 """ 1475 1476 This Exception is thrown when the call to the distance function in 1477 use by a metric space throws an exception. This should not happen 1478 during regular usage and is a hint that either your distance 1479 function is buggy or you tried to index objects which your distance 1480 function cannot handle. 1481 1482 """ 1483
1484 - def __init__(self, msg, e):
1485 self.orig_exception = e 1486 Exception.__init__(self, msg)
1487 1488 orig_exception = None 1489 """ The exception that has triggered this exception. """ 1490 1491 pass
1492
1493 1494 -def levenshtein(s1, s2):
1495 """ 1496 1497 Compute Levenshtein (or "edit") distance between two strings. 1498 Levenshtein distance between two strings ``x`` and ``y`` is defined 1499 as the minimal number of operations on single characters that it 1500 takes to transform ``x`` into ``y`` (or vice versa). 1501 1502 Allowed operations are: 1503 1504 - Insertion, 1505 - Deletion and 1506 - Replacement 1507 1508 of a single character. 1509 1510 Levenshtein distance has all the properties of a strictly positive 1511 metric. That means for all x, y, z it holds: 1512 1513 - x == y <=> levenshtein(x, y) == 0 1514 - x != y <=> levenshtein(x, y) > 0 (positivity) 1515 - levenshtein(x, y) == levenshtein(y, x) (symmetry) 1516 - levenshtein(x, z) <= levenshtein(x, y) + levenshtein(y, z) 1517 1518 The upper bound for Levenshtein distance is the length of the longer 1519 string: ``max(len(x), len(y))``. A general lower bound is 1520 ``abs(len(x) - len(y))``. This is the case where one string is the 1521 pre-/postfix of the other one. 1522 1523 Incidentally, this implementation not only works for strings, but 1524 for all types of sequences. Objects in the given sequences are 1525 compared for equality using '==' to determine whether an edit 1526 operation is needed. 1527 1528 """ 1529 # This function is designed for Psyco 1530 if s1 == s2: return 0 # this is fast in Python 1531 if len(s1) > len(s2): 1532 s1, s2 = s2, s1 1533 r1 = range(len(s2) + 1) 1534 r2 = [0] * len(r1) 1535 i = 0 1536 for c1 in s1: 1537 r2[0] = i + 1 1538 j = 0 1539 for c2 in s2: 1540 if c1 == c2: 1541 r2[j+1] = r1[j] 1542 else: 1543 a1 = r2[j] 1544 a2 = r1[j] 1545 a3 = r1[j+1] 1546 if a1 > a2: 1547 if a2 > a3: 1548 r2[j+1] = 1 + a3 1549 else: 1550 r2[j+1] = 1 + a2 1551 else: 1552 if a1 > a3: 1553 r2[j+1] = 1 + a3 1554 else: 1555 r2[j+1] = 1 + a1 1556 j += 1 1557 aux = r1; r1 = r2; r2 = aux 1558 i += 1 1559 return r1[-1]
1560
1561 -def tokenizer(textfile, separator=None):
1562 """ 1563 1564 Simple generator that tokenizes sequences of strings (or file-like 1565 objects) into uppercase words using the optional `separator`. This 1566 `separator` is passed directly to str.split so ``split``'s behaviour 1567 concerning `None` as separator applies here as well. 1568 1569 After the last token has been returned, there is an attempt to close 1570 `textfile`. If this yields an `AttributeError`, it will be silently 1571 ignored. 1572 1573 """ 1574 for line in textfile: 1575 for token in line.split(separator): 1576 yield token.upper() 1577 try: 1578 textfile.close() 1579 except AttributeError, e: 1580 pass
1581 1582 dist_ctr = 0 1583 1584 try: 1585 import psyco 1586 psyco.bind(levenshtein) 1587 psyco.bind(VPTree) 1588 psyco.bind(BKTree) 1589 psyco.bind(FastBKTree) 1590 except ImportError, e: 1591 sys.stderr.write("Psyco not available, proceeding without it.\n") 1592 1593 if __name__ == '__main__': 1594 #import codecs, time 1595 #dumpfilename = "vptree.dump" 1596 #f = codecs.open(sys.argv[1], encoding="iso-8859-1") 1597 #names = tokenizer(f) 1598 #t = BKTree() 1599 #t.construct(names, levenshtein) 1600 #f.close() 1601 #print type(t) 1602 #s = raw_input("press enter") 1603 #print "t height: %s" % t.height 1604 #print "t size: %s" % len(t) 1605 #print "t nodes %s" % t.num_nodes 1606 1607 ##print "results: %s " % t.search("MARTIN", 1) 1608 1609 #f = codecs.open(sys.argv[1], encoding="iso-8859-1") 1610 #names = list(tokenizer(f)) 1611 #random.shuffle(names) 1612 1613 #start = time.time() 1614 #for i in range(0, 10): 1615 # t.search(names.pop(), 1) 1616 #end = time.time() 1617 #print end - start 1618 pass 1619 1620 # vim: set et ts=4 sw=4 tw=72 nu: 1621