Module mspace
[hide private]
[frames] | no frames]

Source Code for Module mspace

   1  #!/usr/bin/python -tt 
   2  # -*- coding: utf-8 -*- 
   3   
   4  # mspace.py - An implementation of metric space indexes for efficient 
   5  # similarity search 
   6  # 
   7  # Copyright (C) 2007s2012 Jochen Spieker 
   8  # 
   9  # This program is free software; you can redistribute it and/or modify 
  10  # it under the terms of the GNU General Public License version 2 as 
  11  # published by the Free Software Foundation. 
  12  # 
  13  # This program is distributed in the hope that it will be useful, but 
  14  # WITHOUT ANY WARRANTY; without even the implied warranty of 
  15  # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU 
  16  # General Public License for more details. 
  17  # 
  18  # You should have received a copy of the GNU General Public License 
  19  # along with this program; if not, write to the Free Software 
  20  # Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307, 
  21  # USA. 
  22   
  23  """ 
  24   
  25  Metric Spaces 
  26  ============== 
  27   
  28  .. contents:: 
  29   
  30  Introduction 
  31  ------------- 
  32   
  33  This module provides efficient similarity search by implementing 
  34  different data structures for indexing metric spaces.  You can use it to 
  35  index an arbitrary number of objects which you later want to search for 
  36  objects "similar" to a given other object. 
  37   
  38  For this module to be useful, you need a function with metric properties 
  39  (see below for a definition) which you can apply to the type of objects 
  40  you want to search for. The only pre-defined metric funtion in this 
  41  module is `levenshtein`, which is a common metric to calculate the 
  42  (dis)similarity of two strings. 
  43   
  44  Beware that the indexing phase might take a long time, depending on the 
  45  number of objects you want to index and the runtime of your metric 
  46  distance function. Indexing only pays off when you are planning to do a 
  47  lot of searches. 
  48   
  49  Definitions 
  50  ............ 
  51   
  52  A metric space is a pair ``(S, f)`` where ``S`` is an abritrary 
  53  (possibly infinite) set of objects and ``f`` is a function which returns 
  54  a "distance" for any two objects ``x``, ``y`` in ``S``. This distance is 
  55  a number used to express the similarity between the two objects. The 
  56  larger the distance, the more they differ. A distance of zero means they 
  57  are considered equal. 
  58   
  59  The function ``f`` has to be a "metric", i.e. the following conditions 
  60  have to be met for all ``x``, ``y``, ``z`` in ``S``: 
  61   
  62  -  ``f(x, y) >= 0`` 
  63   
  64     (positivity) 
  65   
  66  -  ``f(x, y) == f(y, x)`` 
  67   
  68     (symmetry) 
  69   
  70  -  ``f(x, z) <= f(x, y) + f(y, z)`` 
  71   
  72     (triangle inequality) 
  73   
  74  This definition adheres to most people's intuitive understanding of the 
  75  distance between two points in a Euclidian space (of arbitrary 
  76  dimensionality).  [#euclid]_ Imagine you have three points in a 
  77  2-dimensional space like this:: 
  78   
  79      A-------B 
  80       \     / 
  81        \   / 
  82         \ / 
  83          C 
  84   
  85  It is evident that the distance between two of these points is 
  86  necessarily larger than (or equal to) zero, and that the direction you 
  87  take for measuring (from ``A`` to ``B`` or vice versa) has no influence 
  88  on the distance. The third condition, the triangle inequality is easy to 
  89  grasp, too: if you know the distances between ``A`` and ``B`` and 
  90  between ``B`` and ``C``, you can be sure that their sum is equal to or 
  91  larger than the distance between ``A`` and ``C``. Voila, there's your 
  92  metric space. 
  93   
  94  In many cases, metric distance functions take a long time to compute. 
  95  Because of this, data structures used to index metric spaces are 
  96  designed to minimize the number of distance computations necessary for 
  97  searching the whole space by using the properties of its metric function 
  98  (most notably the triangle inequality) to discard as many objects as 
  99  possible after every computation. 
 100   
 101  .. [#euclid] In fact, Euclidian spaces are one type of metric space. 
 102   
 103   
 104  Use cases & caveats 
 105  .................... 
 106   
 107  Metric distance functions can be defined for objects other than points 
 108  in Euclidian space, too.  One very well known distance function is the 
 109  Levenshtein distance, which is applicable to arbitrary strings. Since it 
 110  is quite popular and indexing strings is one of the major use cases for 
 111  metric spaces, an implementation of this function is included in this 
 112  module. Other metric distance functions for strings are the 
 113  Damerau-Levenshtein variant and Hamming distance. These can be used for 
 114  a variety of tasks, such as spell-checking, record linkage (or 
 115  "deduplication"), error detection etc. 
 116   
 117  Please note that indexing a large number of objects usually takes a 
 118  considerable amount of time *and* memory. Indexing only pays off if you 
 119  are going to do *a lot* of searches in the index. If you just want to 
 120  search a set of objects for all objects similar to one or a few other 
 121  objects, you are better off comparing them sequentially by hand with 
 122  your distance function. 
 123   
 124  Installation 
 125  ............. 
 126   
 127  .. _Psyco: http://psyco.sourceforge.net/ 
 128   
 129  This file is a stand-alone module for Python 2.4 and later which you 
 130  only have to save somewhere in your ``PYTHONPATH``. [#pythonpath]_ No 
 131  installation is necessary. The only tight dependencies are the modules 
 132  ``sys``, ``random`` and ``weakref`` which are most probably already 
 133  included in your Python installation. 
 134   
 135  If you are using this on a 32 Bit i386 machine running Windows or Linux, 
 136  you probably also want to install Psyco_ which speeds up execution times 
 137  quite noticeably (in exchange for only a slightly increased memory 
 138  usage). Psyco will be used automatically if it is available. 
 139   
 140  .. [#pythonpath] Run ``python -c "import sys; print sys.path"`` to learn 
 141    where Python currently looks for modules. 
 142   
 143  The most current version is always available from this site: 
 144   
 145  http://well-adjusted.de/mspace.py/ 
 146   
 147  The project is currently kept in SVN and may be obtained from the 
 148  following URL: 
 149   
 150  svn://svn.well-adjusted.de/mspace/trunk 
 151   
 152  Basic usage 
 153  ------------ 
 154   
 155  Index creation 
 156  ............... 
 157   
 158  Say you have a dictionary file containing one or more words per line. 
 159  You can make an index of all the words in it by doing: 
 160   
 161  .. python:: 
 162   
 163      >>> import mspace 
 164      >>> mydict = file('dictionary.txt') 
 165      >>> words = mspace.tokenizer(mydict) 
 166      >>> index = mspace.VPTree(words, mspace.levenshtein) 
 167   
 168  You can delay the time-comsuming construction phase by creating the 
 169  index without any parameters and explicitly calling ``construct`` at a 
 170  later time: 
 171   
 172  .. python:: 
 173   
 174      >>> index = mspace.VPTree() 
 175      >>> index.construct(words, mspace.levenshtein) 
 176   
 177  Please note that the index' content is dismissed on every call to 
 178  ``construct``. The index is built from scratch with only the objects and 
 179  distance function passed as parameters. 
 180   
 181  Performing range-searches 
 182  .......................... 
 183   
 184  After you have built the index (and got yourself some coffee if the 
 185  index is large) you may search it for all objects having distance to a 
 186  given other object between arbitrary bounds: 
 187   
 188  .. python:: 
 189   
 190      >>> index.range_search('WOOD', min_dist=0, max_dist=1) 
 191      ['GOOD', 'WOOD', 'WOOL', 'MOOD'] 
 192   
 193  In this case, you received four results: the object you searched for 
 194  (apparently it has been in your dictionary, too) and two other objects 
 195  "GOOD" and "WOOL", both of which have the requested maximum distance of 
 196  one to your query object. Result lists are unordered and do not contain 
 197  information about their contents' distance to the query object. If you 
 198  need this, you have to calculate it by hand. 
 199   
 200  Both ``min_dist`` and ``max_dist`` are optional parameters defaulting to 
 201  zero. Thus, if you leave them out, a search for objects which are 
 202  exactly equal (as defined by your distance function) to the query object 
 203  is performed. For historical reasons, and since range-searching with a 
 204  minimum distance of zero and a maximum greater than zero is quite 
 205  common, there is a shortcut to search with ``min_dist=0``: 
 206   
 207  .. python:: 
 208   
 209      >>> index.search('WOOD', 1) 
 210      ['GOOD', 'WOOD', 'WOOL', 'MOOD'] 
 211   
 212   
 213  Performing nearest-neighbour-searches 
 214  ...................................... 
 215   
 216  The second type of search you can perform is "k-nearest-neighbour" 
 217  search. It returns a given number of objects from the index that are 
 218  closest to the query object. Search results are guaranteed to never 
 219  contain an object with a distance to the query object that is larger 
 220  than that of any other object in the tree. 
 221   
 222  Result lists are composed of ``(object, distance)`` tuples, sorted 
 223  ascendingly by the distance. If you don't specify a maximum number of 
 224  matches to return, it defaults to one: 
 225   
 226  .. python:: 
 227   
 228      >>> index.nn_search('WOOD') 
 229      [('WOOD', 0)] 
 230      >>> index.nn_search('WOOD', 5) 
 231      [('WOOD', 0), ('WOOL', 1), ('GOOD', 1), ('MOOD', 1), ('FOOT', 2)] 
 232   
 233  Note that the index may contain other objects with a distance of two to 
 234  the query object ``'WOOD'`` (e.g. ``'FOOL'``). They have just been cut 
 235  off because a maximum of five objects has been requested. You have no 
 236  influence on the choice of representatives that is made. 
 237   
 238  Note that you must not expect this method to always return a list of 
 239  length ``num`` because you might ask for more objects than are currently 
 240  in the index. 
 241   
 242   
 243  Advanced usage 
 244  --------------- 
 245   
 246  Indexing complex objects 
 247  ......................... 
 248   
 249  If you have "complex" objects which you want to index by only one 
 250  specific attribute, you can write a simple wrapper around `levenshtein` 
 251  (or some other function applicable to the attribute) which extracts this 
 252  attribute from your objects and returns the distance between their 
 253  attributes' value.  This way you can search for and retrieve complex 
 254  objects from the index while comparing only a single attribute 
 255   
 256  .. python:: 
 257   
 258      >>> import mspace 
 259      >>> 
 260      >>> class Record(object): 
 261      ...     def __init__(self, firstname, surname): 
 262      ...         self._firstname, self._surname = firstname, surname 
 263      ... 
 264      >>> def firstnameLev(r1, r2): 
 265      ...     return mspace.levenshtein(r1._firstname, r2._firstname) 
 266      ... 
 267      >>> index = mspace.VPTree(listOfRecords, firstnameLev) 
 268      >>> rec = Record("Paul", "Brady") 
 269      >>> index.search(rec, 2) 
 270      [<Record: 'Paula Bean'>, <Record: 'Raoul Perky'>, <Record: 'Paul Simon'>] 
 271   
 272  Of course you can also use more advanced techniques like writing a 
 273  function factory which returns a function that extracts arbitrary 
 274  attributes from your objects and applies a metric function to it: 
 275   
 276  .. python:: 
 277   
 278      >>> def metric_using_attr(metric, attr): 
 279      ...     def f(one, other, attr=attr): 
 280      ...         attr1, attr2 = getattr(one, attr), getattr(other, attr) 
 281      ...         return metric(attr1, attr2) 
 282      ...     return f 
 283      ... 
 284      >>> metric = metric_using_attr(mspace.levenshtein, "_firstname") 
 285      >>> metric( Record("Paul", "Brady"), Record("Paul", "Simon") ) 
 286      0 
 287   
 288  (Note that the inner function ``f`` in this example deviates from the 
 289  standard protocol by accepting an optional third parameter.  This is 
 290  done here only to pull the name ``attr`` into the inner function's 
 291  namespace to save some time when looking up its associated value.  No 
 292  index structure in this module will ever call its distance function with 
 293  more than two parameters anyway, so that whatever you passed as ``attr`` 
 294  to ``metric_using_attr`` when creating your function will be used when 
 295  this function gets called by an index. Got it?) 
 296   
 297  Reverse indexes 
 298  ................ 
 299   
 300  Of course you can use a similar technique to avoid full indexes and 
 301  instead create an index which only contains references to your data (in 
 302  a database, text file, somewhere on the net etc.). Your distance 
 303  function would then use the supplied values to fetch the objects to 
 304  compare from your data source.  But, I cannot stress this enough, your 
 305  distance function's performance is absolutely crucial to efficient 
 306  indexing and searching. Therefore you should make sure that your way of 
 307  accessing the data is really fast. 
 308   
 309  Choice of data structure 
 310  ------------------------- 
 311   
 312  This module currently gives you two data structures to choose from. 
 313  While the underlying algorithms are different, their search results 
 314  (given the same set of objects to index, distance function and maximum 
 315  distance) are exactly the same. If you find a scenario where this is not 
 316  the case, please let me know because this would be a serious bug. 
 317   
 318  Capabilities 
 319  ............. 
 320   
 321  The algorithms implemented in this module have different capabilities 
 322  which are displayed in the table below: 
 323   
 324  +--------------------+-------+-------+ 
 325  |                    |VPTree |BKTree | 
 326  +====================+=======+=======+ 
 327  |non-discrete metric |     X |       | 
 328  +--------------------+-------+-------+ 
 329  |insertion           |       |     X | 
 330  +--------------------+-------+-------+ 
 331  |deletion            |       |       | 
 332  +--------------------+-------+-------+ 
 333   
 334  The first row is about the value returned by the distance function the 
 335  index uses. BKTrees are not prepared to handle non-discrete distances 
 336  (for example floats) and will most probably perform really bad when they 
 337  occur. 
 338   
 339  The other two rows describe what you can do with the indexes *after* 
 340  they have been constructed. Please note that neither of them may shrink 
 341  over time. Only BKTrees are able to handle insertion efficiently. 
 342   
 343  Performance 
 344  ............ 
 345   
 346  [In short: you are probably better off using BKTrees unless you need a 
 347  non-discrete metric. When in doubt, test both options.] 
 348   
 349  Obviously, the whole point of this module is to speed up the process of 
 350  finding objects similar to one another. And, as you can imagine, the 
 351  different algorithms perform differently when exposed to a specific 
 352  problem. 
 353   
 354  Here's an example: I took the file ``american-english-large`` from the 
 355  Debian package ``wamerican-large`` and did some time measurements of 
 356  indexing and searching a random subset of it. The machine I used is an 
 357  1.1GHz Pentium3 running Python 2.4 from Debian stable **with Psyco 
 358  enabled**. Please note that the following numbers have been determined 
 359  completely unscientifical. I only show them to give you an idea of what 
 360  to expect. For serious benchmarking, I should have used the ``timeit`` 
 361  module. Nevertheless, this is it: 
 362   
 363   
 364  +------+------------------------+------------------------+ 
 365  |      |BKTree                  |VPTree                  | 
 366  +======+======+=========+=======+=======+========+=======+ 
 367  |size  |time  |per node |height |time   |per node|height | 
 368  +------+------+---------+-------+-------+--------+-------+ 
 369  | 5000 |  2.92|0.000584 |11     |   7.40|0.001480|     21| 
 370  +------+------+---------+-------+-------+--------+-------+ 
 371  |10000 |  6.32|0.000632 |14     |  16.02|0.001602|     22| 
 372  +------+------+---------+-------+-------+--------+-------+ 
 373  |15000 |  9.95|0.000663 |14     |  28.35|0.001890|     24| 
 374  +------+------+---------+-------+-------+--------+-------+ 
 375  |20000 | 13.70|0.000685 |14     |  41.40|0.002070|     24| 
 376  +------+------+---------+-------+-------+--------+-------+ 
 377  |25000 | 17.46|0.000699 |15     |  50.63|0.002025|     25| 
 378  +------+------+---------+-------+-------+--------+-------+ 
 379  |30000 | 21.81|0.000727 |15     |  55.47|0.001849|     25| 
 380  +------+------+---------+-------+-------+--------+-------+ 
 381  |35000 | 25.77|0.000736 |16     |  64.43|0.001841|     26| 
 382  +------+------+---------+-------+-------+--------+-------+ 
 383  |40000 | 29.40|0.000735 |16     |  75.45|0.001886|     26| 
 384  +------+------+---------+-------+-------+--------+-------+ 
 385  |45000 | 41.28|0.000917 |16     |  96.36|0.002141|     26| 
 386  +------+------+---------+-------+-------+--------+-------+ 
 387  |50000 | 37.62|0.000752 |16     |  95.31|0.001906|     28| 
 388  +------+------+---------+-------+-------+--------+-------+ 
 389   
 390   
 391  This table shows construction times (total and per node in seconds) for 
 392  data sets of a given size. Additionally, you can see the height 
 393  [#height]_ of the trees.  Apparently, BKTrees can be constructed a lot 
 394  faster than VPTrees. Both of them need an increasing time per node with 
 395  growing data sets. This is expected since construction complexity is 
 396  ``O(n log(n))`` in both cases. 
 397   
 398  +---------------+-----------------+-----------------+ 
 399  |               |BK, size: 50,000 |VP, size: 50,000 | 
 400  +------+--------+-------+---------+-------+---------+ 
 401  |k     |results |  time |  # dist |  time |  # dist | 
 402  +------+--------+-------+---------+-------+---------+ 
 403  |0     |   0.89 |0.0008 |    8.14 | 0.0019|    17.28| 
 404  +------+--------+-------+---------+-------+---------+ 
 405  |1     |   4.07 |0.1670 | 1583.77 | 0.2801|  1933.64| 
 406  +------+--------+-------+---------+-------+---------+ 
 407  |2     |  38.10 |1.1687 |10353.31 | 1.4413| 13845.67| 
 408  +------+--------+-------+---------+-------+---------+ 
 409  |3     | 304.57 |2.5614 |22202.28 | 3.0497| 27514.51| 
 410  +------+--------+-------+---------+-------+---------+ 
 411  |4     |1584.86 |3.8727 |32376.54 | 4.1518| 36877.62| 
 412  +------+--------+-------+---------+-------+---------+ 
 413  |5     |5317.03 |4.4182 |39616.04 | 4.9935| 42720.38| 
 414  +------+--------+-------+---------+-------+---------+ 
 415   
 416  This table shows the runtime of searches done in the trees (in seconds), 
 417  the number of distance calculations done and the number of results for 
 418  growing error tolerance. All values are given as average over 100 random 
 419  searches (from the same file, but not necessarily from the set of 
 420  indexed objects).  As you can see, search runtime (tightly connected to 
 421  the number of distance calculations) literally explodes for larger 
 422  values of ``k``.  This growth only fades when an increased part of the 
 423  complete search space is examined (the number of distance calculations 
 424  equals the number of nodes compared with the query object). 
 425   
 426  As you can see, too, long search runtimes for large values of ``k`` 
 427  don't actually hurt usability very much since you only get a usable 
 428  number of results for small ``k`` anyway. This is of course due to the 
 429  nature of the data set and the distance function used in this example. 
 430  Your application may vary greatly. 
 431   
 432  You also have to keep in mind that the dictionary I used contains almost 
 433  no duplicates. If you used the words of a real document as your data 
 434  set, your tree would have significantly less nodes than the number of 
 435  words in your document since you typically repeat a lot of words very 
 436  often. A quick test with my diploma thesis revealed only 4400 distinct 
 437  words in a document with 14,000 words (including LaTeX commands). This 
 438  makes searching much faster because equal objects (as defined by the 
 439  distance function) are only evaluated once. 
 440   
 441  .. [#height] If you look closely, you can see that the VPTree's height 
 442     doesn't grow continuously when the data set grows. The reason for 
 443     that phenomenon is that this implementation does not try very hard to 
 444     pick a good vantage point which is the key factor to get 
 445     well-balanced trees. So, in some cases, a vantage point may be chosen 
 446     that may result in trees which are higher than strictly necessary. 
 447   
 448  Optimization potential (or: Why the f*ck is this so slow!?) 
 449  ------------------------------------------------------------ 
 450   
 451  Ok, you have tried to index your small dictionary file and start to 
 452  wonder why this easy task takes several minutes. Here are a couple of 
 453  (possible) reasons: 
 454   
 455  - Indexing takes ``O(n log(n))`` for all data structures currently 
 456    implemented. So yes, doubling the number of indexed objects will 
 457    necessarily more than double your runtime for indexing, sorry. 
 458   
 459  - Python is slow. I have written an implementation very similar to this 
 460    one in Java which is significantly faster (by a factor of about 15 to 
 461    25, even when using Psyco_!).  But the java implementation eats more 
 462    memory and unfortunately I cannot release it under a free license. 
 463   
 464  - Recursion is slow in Python. Recursion is the most natural way to 
 465    create and search trees and this code uses it a lot. Most of the 
 466    recursion in this module is "tail recursion", but the Python 
 467    interpreter is not able to optimize it into loops. 
 468   
 469    [Note: as of revision 60, searching in both trees and inserting into 
 470    BKTrees has been rewritten using loops instead of recursion. 
 471    Perfomance gain is quite small, though.] 
 472   
 473  - The distance distribution in your metric space is too dense. This 
 474    leads to the data structures being unable to discard large parts of 
 475    the indexed space while searching. In pathological cases, you may end 
 476    up with data structures resembling linked lists. 
 477   
 478  - Your distance function is non-discrete, i.e. it returns floats. How 
 479    well this is supported depends on the index structure in use. 
 480   
 481  - Your metric function is very expensive. Remember that this function 
 482    has to be called *very often* during indexing. You may use this 
 483    module's attribute ``dist_ctr`` to get an idea how often this is done. 
 484    It is incremented by one on every call to your distance function and 
 485    is otherwise unused. 
 486   
 487  - You are searching for non-similar objects. If you, for example,  have 
 488    an index of strings with an average length of six characters and you 
 489    are continuously searching for strings with a maximum distance of 
 490    three or more, do not expect the search to be significantly faster 
 491    than linear testing of the complete search space. It may even be 
 492    slower. 
 493   
 494  Possible solutions include: 
 495   
 496  - Rewrite in C. Since I cannot be bothered to learn C, someone else 
 497    would have to do this. 
 498   
 499  - Use Pyrex or Psyco_. Quick tests with Psyco suggest that indexing 
 500    becomes about 3-4 times faster. This is as easy as doing an ``import 
 501    psyco; psyco.full()`` in your program. Read Psyco's manual for tuning 
 502    options. 
 503   
 504    [Note: as of about revision 46, on module loading an attempt is made 
 505    to import and enable Psyco for levenshtein and the tree classes. If it 
 506    doesn't succeed, you'll get a performance warning on ``stderr`` but 
 507    everything will continue to work flawlessly.] 
 508   
 509  - Pickle trees for static data sets. Pickling and unpickling indexes 
 510    using Python's standard ``pickle`` module is quite fast. But beware 
 511    that a pickled index takes a lot more space than your plain data. 
 512   
 513  - Rewrite tree construction and searching using loops. I am not sure 
 514    whether this will be faster, but it will probably take less memory 
 515    than the current approach and fix the problem with Python's recursion 
 516    limit. I fear that code readibility will suffer, though. The current 
 517    implementations are quite close to the original descriptions of the 
 518    algorithms. 
 519   
 520    [Note: partially done, see above.] 
 521   
 522  - Optimize the distance function in use. Since I do not know your 
 523    distance function, I cannot help you with this. 
 524   
 525    As for `levenshtein`: the algorithm implemented in this module is the 
 526    classical dynamic programming variant with ``O(|n|*|m|)`` time 
 527    complexity (and ``O(min(|n|, |m|))`` for space).  Other algorithms for 
 528    Levenshtein try not to compute the whole matrix but only the values 
 529    around the diagonal of the matrix that are strictly necessary. By 
 530    doing this, the average ``O(|n|*|m|)`` of the classical dynamic 
 531    programming algorithm could be improved to something like 
 532    ``O(k*|n|)``. While it isn't obvious how much wall clock time this 
 533    approarch saves in reality, it may be in fact a whole lot faster 
 534    because ``k`` is usually only a fraction of the length of the input 
 535    strings. 
 536   
 537    And if you take a close look at the algorithms, you may realize that 
 538    searching may be sped up by using an algorithm that doesn't actually 
 539    compute the distance between two strings, but one that only answers 
 540    whether two strings have a distance less than or equal to a specified 
 541    maximum distance. This can be done in ``O(k^2)`` and should boost 
 542    search performance quite noticeably. 
 543   
 544    .. _Citeseer: http://citeseer.ist.psu.edu/navarro99guided.html 
 545   
 546   
 547    If you are interested in any of these optimizations, I suggest reading 
 548    Gonzalo Navarro's excellent survey "Guided Tour To Approximate String 
 549    Matching" which you can get from Citeseer_. 
 550   
 551    Currently, this module always uses the same function for searching and 
 552    indexing. If you want to test your own implementations, you have to 
 553    replace your index object's ``_func`` attribute after indexing and 
 554    before searching. 
 555   
 556   
 557  Contact 
 558  -------- 
 559   
 560  For comments, patches, flames and hugs send mail to: 
 561  jrspieker@well-adjusted.de. 
 562   
 563  """ 
 564   
 565  __docformat__ = 'restructuredtext en' 
 566  __author__    = '$Author: jrschulz $' 
 567  __version__   = '$Revision: 119 $' 
 568  __date__      = '$Date: 2012-09-17 02:13:52 +0200 (Mon, 17 Sep 2012) $' 
 569  __license__   = "GPL" 
 570   
 571  import sys, random, weakref 
572 573 -class MetricTree(object):
574 """ 575 576 Base class for metric space indexes which have tree-like properties. 577 It's implementation is not complete enough to make it actually 578 usable, but it should make it quite easy to implement other indexing 579 and searching strategies. In an ideal case, all you have to do is to 580 derive from this class, implement the methods `construct`, 581 `_get_child_candidates` and optionally `insert` and make `children` 582 an attribute or property returning all child nodes of the current 583 node. If you need an ``__init__`` method, you should call 584 ``super(YourClass, self).__init__(objects, func, parent)`` in its 585 end as well. 586 587 Instances of this class (and its subclasses, of course) are supposed 588 to be recursive data structures. This means that they may be treated 589 like the root of a tree whether they have a parent or not. This 590 means every node has: 591 592 - a list of `_values` with every element of that list having a 593 distance of zero to all other elements in the list. Because of 594 triangle inequality, you only need to check this condition against 595 one element of this list. 596 597 - a `_parent` (`None` for root nodes) 598 599 - a list of `children` (empty list for leaf nodes) 600 601 - a `_height` (one for trees consisting of only one node) 602 603 - a number of nodes `_num_nodes` (the total number of nodes in this 604 subtree) 605 606 - a `_size` (the total number of objects stored in this subtree) 607 608 Note that in many (most? some?) cases, implementors don't have to 609 fiddle with these values directly. Instead, there are some helper 610 methods like `_incr_node_counter` and `_incr_size` that may be 611 called at appropriate times. If in doubt, just implement what you 612 strictly need first and then take a look whether node, size and 613 height counting works as expected (and whether you care). 614 615 As a bonus, this class implements some of python's magic methods so 616 that you can iterate over its instances' content and use the ``in`` 617 operator to test for the existence of an object using the distance 618 function. 619 620 """ 621 622 _parent = None 623 """ 624 625 The parent of this node. ``None`` for root nodes. 626 627 """ 628 629 _values = list() 630 """ 631 632 A list of values in this subtree. All objects in this list are 633 considered equal by the distance function in use. 634 635 """ 636 637 _size = 0 638 """ 639 640 The number of objects in this subtree. 641 642 """ 643 644 _func = 0 645 """ 646 647 The distance function used for indexing and searching. It has to 648 accept any two objects from the list of objects you are indexing as 649 arguments and return a non-negative integer (or long). 650 651 """ 652 653 _height = 0 654 """ 655 """ 656 657 _num_nodes = 0 658 """ 659 """ 660
661 - def __init__(self, objects=None, func=None, parent=None):
662 """ 663 664 Create a new metric tree. If ``objects`` and ``func`` are given, 665 the given objects will be indexed immediately using the distance 666 function which makes it possible to immediately start so search 667 for other objects in it. Otherwise, you have to call `construct` 668 later in order to make use of this metric tree. 669 670 """ 671 self._values = list() 672 self._size = 0 673 self._height = 1 674 self._func = func 675 self.parent = parent 676 self._num_nodes = 0 677 self._incr_node_counter() 678 self._calculate_height() 679 if objects and func: 680 self.construct(objects, func)
681
682 - def range_search(self, obj, min_dist=0, max_dist=0):
683 """ 684 685 Return a list of all objects in this subtree whose distance to 686 `obj` is at least `min_dist` and at most `max_dist`. `min_dist` 687 and `max_dist` have to satisfy the condition ``0 <= min_dist <= 688 max_dist``. 689 690 If this metric tree is empty, an empty list is returned, 691 regardless of whether this tree has been previously constructed 692 or not. 693 694 If calling the distance function fails for any reason, 695 `UnindexableObjectError` will be raised. 696 697 """ 698 assert( 0 <= min_dist <= max_dist ) 699 if not self: return list() 700 result, candidates = list(), [self] 701 while candidates: 702 node = candidates.pop() 703 distance = node._get_dist(obj) 704 if min_dist <= distance <= max_dist: 705 result.extend(node._values) 706 candidates.extend( node._get_child_candidates( 707 distance, min_dist, max_dist)) 708 return result
709
710 - def search(self, obj, max_dist):
711 """ 712 713 Equivalent to range_search(obj, min_dist=0, max_dist). 714 715 """ 716 return self.range_search(obj, max_dist=max_dist)
717
718 - def nn_search(self, obj, num=1):
719 """ 720 721 Perform a k-nearest-neighbour search and return a sorted list of 722 at most ``num`` objects together with their distance to the 723 query object (``(object, distance)`` tuples). Sorting is done by 724 the distance (ascending). Of course, `num` has to be an ``int`` 725 (or ``long``, for that matter) larger than zero. 726 727 For obvious reasons, this method cannot return more objects than 728 are currently present in the tree. That means, if ``num > 729 len(self)``, only ``len(self)`` pairs of ```(object, distance)`` 730 will be returned. 731 732 If this metric tree is empty, an empty list is returned, 733 regardless of whether this tree has been previously constructed 734 or not. 735 736 If several objects in this tree share the same maximum distance 737 to `obj`, no guarantee is given about which of these objects 738 will be returned and which are left out. 739 740 If calling the distance function fails for any reason, 741 `UnindexableObjectError` will be raised. 742 743 """ 744 if num < 1: return list() 745 if not self: return list() 746 neighbours = list() # list of (value, distance) tuples 747 candidates = [self] 748 def cmp_neighbours(a, b): return cmp(a[1], b[1]) 749 while candidates: 750 cand = candidates.pop() 751 distance = cand._get_dist(obj) 752 candidate_is_neighbour = bool([1 for n in neighbours 753 if distance < n[1]]) 754 if candidate_is_neighbour or len(neighbours) < num: 755 neighbours.extend([(v, distance) for v in cand._values]) 756 neighbours.sort(cmp_neighbours) 757 if len(neighbours) > num: 758 neighbours = neighbours[:num] 759 elif len(neighbours) == num: 760 max_dist = neighbours[-1][1] 761 if max_dist == 0: 762 break 763 candidates.extend( 764 node._get_child_candidates( distance, 0, max_dist)) 765 else: 766 candidates.extend(node.children) 767 return neighbours
768
769 - def _get_child_candidates(self, distance, min_dist, max_dist):
770 """ 771 772 Return a sequence of child nodes that may contain objects with a 773 distance difference between (inclusive) ``min`` and ``max`` to a 774 certain query object. Note that the query object is *not* 775 passed to this method. Instead, ``distance`` is the query 776 object's previously calculated distance to this node. 777 778 """ 779 raise NotImplementedError()
780
781 - def construct(self, objects, func):
782 """ 783 784 (Re)Index this space with the given ``objects`` using the 785 distance function ``func``. Previous contents will be 786 discarded. ``objects`` has to be a sequence or an iterable. The 787 distance function ``func`` needs to be applicable to all objects 788 contained in ``objects``. 789 790 If calling the distance function fails for any reason, 791 `UnindexableObjectError` will be raised. 792 793 """ 794 raise NotImplementedError()
795
796 - def insert(self, obj):
797 """ 798 799 Insert a single object into the metric tree. Returns ``self``, 800 i.e. the tree itself. 801 802 This method may not be implemented by all subclasses of 803 `MetricTree` since not all data structures allow to do this 804 efficiently. `NotImplementedError` will be raised when this is 805 the case. 806 807 If calling the distance function fails for any reason, 808 `UnindexableObjectError` will be raised. 809 810 """ 811 raise NotImplementedError()
812
813 - def is_root(self):
814 """ 815 816 Answer whether this node is the root of a tree (i.e. it has no 817 parent). 818 819 """ 820 return self.parent is None
821
822 - def is_leaf(self):
823 """ 824 825 Answer whether this node is a leaf node (i.e. it has no 826 children) 827 828 """ 829 return not self.children
830
831 - def __children(self):
832 """ 833 834 A sequence of this node's children. 835 836 The possible number of children per node is 837 implementation-dependent. Leaf nodes return an empty sequence. 838 839 """ 840 raise NotImplementedError()
841 children = property(__children) 842
843 - def __num_nodes(self):
844 """ 845 846 The number of nodes in this tree. 847 848 This may be different from the number of objects contained in 849 this tree in cases where two or more of these objects are 850 considered equal by the distance function in use (i.e., for two 851 objects ``p`` and ``q``, calling ``self._func(p, q)`` returns 852 ``0`` and when the tree is empty, i.e. there is one node (the 853 root node) but it doesn't contain any values. 854 855 """ 856 return self._num_nodes
857 num_nodes = property(__num_nodes) 858 859
860 - def __height(self):
861 """ 862 863 The height of this tree. 864 865 Empty trees have a height of ``0``, trees containing one or more 866 objects have a height ``>= 1``. 867 868 """ 869 return self._height
870 height = property(__height) 871
872 - def __parent(self):
873 """ 874 875 The parent of this node. `None` if this node is the root 876 of a tree. 877 878 """ 879 return self._parent
880
881 - def __set_parent(self, parent):
882 """ 883 884 Set the parent of this node. 885 886 Parent references are stored as "weak references" to avoid 887 circular references which the garbage collector cannot dissolve 888 by itself. 889 890 """ 891 if parent is None: 892 self._parent = None 893 else: 894 # the callback ensures that the weakref is deleted 895 # as soon as the parent node disappears so that this 896 # node recognizes it is the new root. 897 callback = lambda proxy: self.__set_parent(None) 898 self._parent = weakref.proxy(parent, callback)
899 900 parent = property(__parent, __set_parent) 901
902 - def _incr_size(self, incr=1):
903 """ 904 905 Increment the size counter for this node and all its parents 906 recursively. 907 908 Should be called whenever an object is inserted into the tree. 909 910 """ 911 def f(node, incr=incr): node._size += incr 912 self._apply_upwards(f)
913
914 - def _calculate_height(self, recursive=True):
915 """ 916 917 Set this node's height to one and (if `recursive` is `True`) 918 propagate this change upwards in the tree. 919 920 """ 921 self._height = height = 1 922 if recursive: 923 node = self.parent 924 while node is not None: 925 height += 1 926 if node._height < height: 927 node._height = height 928 node = node.parent 929 else: 930 node = None
931
932 - def _incr_node_counter(self, incr=1):
933 """ 934 935 Increment the node counter for this node and all its parents 936 recursively. 937 938 Should be called whenever a new child of this node is created. 939 940 """ 941 def f(node, incr=incr): node._num_nodes += incr 942 self._apply_upwards(f)
943
944 - def _apply_upwards(self, func, **args):
945 """ 946 947 Helper method to apply a function to this node and all its 948 parents recursively. The given function must accept one node as 949 the first parameter and may accept arbitrary keyword parameters 950 as well. 951 952 """ 953 node = self 954 func(node, **args) 955 while not node.is_root(): 956 node = node.parent 957 func(node, **args)
958
959 - def _get_dist(self, obj):
960 """ 961 962 Apply this node's distance function to the given object and one 963 of this node's values. 964 965 Raises `UnindexableObjectError` when distance computation fails. 966 967 """ 968 global dist_ctr 969 try: 970 distance = self._func(self._values[0], obj) 971 dist_ctr += 1 972 except IndexError, e: 973 sys.stderr.write("Node is empty, cannot calculate distance!\n") 974 raise e 975 except Exception, e: 976 raise UnindexableObjectError(e, "Cannot calculate distance" 977 + " between objects %s and %s using %s" \ 978 % (self._values[0], obj, self._func)) 979 return distance
980
981 - def __iter__(self):
982 """ 983 984 A generator that yields all objects in this node and its 985 children by doing recursive pre-order traversal. Implementors 986 might choose to use another traversal method which better suits 987 their data structure. 988 989 Note that objects are returned in no specific order. 990 991 This implementation will always return objects in the same order 992 as long as the tree's content does not change and the 993 implementation of `children` always returns the children in the 994 same order. If `children` is not implemented at all, 995 `NotImplementedError` will be raised. 996 997 """ 998 for obj in self._values: 999 yield obj 1000 for child in self.children: 1001 for obj in child: 1002 yield obj
1003 1004 itervalues = __iter__ 1005
1006 - def iternodes(self):
1007 """ 1008 1009 A generator that yields all nodes in this subtree by doing 1010 recursive pre-order traversal. Implementors might choose to use 1011 another traversal method which better suits their data 1012 structure. 1013 1014 Note that objects are returned unordered. 1015 1016 This implementation will always return nodes in the same order 1017 as long as the tree's content does not change and the 1018 implementation of `children` always returns the children in the 1019 same order. If `children` is not implemented at all, 1020 `NotImplementedError` will be raised. 1021 1022 """ 1023 yield self 1024 for child in self.children: 1025 for node in child.iternodes(): 1026 yield node
1027
1028 - def __nonzero__(self):
1029 """ 1030 1031 Return True if this node contains any objects. 1032 1033 """ 1034 return len(self._values) > 0
1035
1036 - def __len__(self):
1037 """ 1038 1039 Return the number of objects in this subtree 1040 1041 """ 1042 return self._size
1043
1044 - def __contains__(self, item):
1045 """ 1046 1047 Search for objects with a distance of zero to `item` and return 1048 True if something is found, otherwise False. 1049 1050 Note that this does **not** check for object identity! Instead, 1051 the definition of equality is delegated to the distance function 1052 in use. 1053 1054 """ 1055 return len(self.range_search(item)) > 0
1056
1057 - def __repr__(self):
1058 if self: 1059 return str(self.__class__) + ": " + str(self._values[0]) 1060 else: 1061 return "<empty node>"
1062
1063 1064 -class BKTree(MetricTree):
1065 """ 1066 1067 "Burkhard-Keller Trees" are unbalanced multiway trees and may grow 1068 over time. They belong to the first general solutions to index 1069 arbitrary metric spaces. 1070 1071 Every node in a BKTree stores a list of objects which all have a 1072 distance of zero to each other, i.e. all objects are considered to 1073 be equal by the distance function in use. Additionally, every node 1074 keeps a dictionary. In this dictionary, every child is stored, 1075 referenceable by its distance to its parent. 1076 1077 Essentially, every node in a BKTree divides its data set ``S`` into 1078 subsets ``S^i`` so that every element in ``S^i`` has the distance 1079 ``i`` to one object arbitrarily picked from ``S`` (and stored in 1080 this node). For each ``S^i``, a new child node is created and its 1081 parent keeps a reference to this child together with its distance. 1082 1083 Insertion of a single object ``o`` in a node ``n`` is quite easy: 1084 1085 1. If ``n`` is empty, store ``o`` in ``n``. That's it. 1086 1087 2. Otherwise, calculate its distance to ``o``. 1088 1089 3. If the distance is zero, append this object to the list of 1090 objects in this node and return. 1091 1092 4. Otherwise, look for a child of ``n`` with the calculated distance 1093 to ``n``. If there is such a child, go back to 1. with this child 1094 being the new ``n``. 1095 1096 5. Otherwise, create a new node containing ``o`` and store it as a 1097 child of ``n`` with it's calculated distance. 1098 1099 Searching is done recursively by first calculating the distance of 1100 the query object to the current node and then using the triangle 1101 inequality to determine which children may contain other search 1102 results. 1103 1104 Runtime complexity for the construction phase is ``O(n log(n))``, 1105 searching is ``O(n^a)`` where ``0 <= a <= 1``. Space requirement is 1106 ``O(n)``. 1107 1108 This implementation is close to the original description of the 1109 algorithm and can only handle discrete distances. If your distance 1110 function returns floating point values, it will appear to work at 1111 indexing time but will most probably be horribly slow when 1112 searching. 1113 1114 """ 1115 1116 __slots__ = [ '_parent', '_values', '_size', '_func', '_height', 1117 '_num_nodes', '_children' ] 1118
1119 - def __init__(self, objects=None, func=None, parent=None):
1120 self._children = {} 1121 if callable(objects): 1122 random.shuffle(list(objects())) 1123 super(BKTree, self).__init__(objects, func, parent)
1124
1125 - def construct(self, objects, func):
1126 self._func = func 1127 self._children = {} 1128 for o in objects: 1129 self.insert(o) 1130 return self
1131
1132 - def insert(self, obj):
1133 if not self: 1134 self._values.append(obj) 1135 self._incr_size() 1136 return self 1137 node = self 1138 while True: 1139 distance = node._get_dist(obj) 1140 if distance == 0: 1141 node._values.append(obj) 1142 node._incr_size() 1143 break 1144 child = node._children.get(distance, None) 1145 if child is None: 1146 child = BKTree([obj], node._func, node) 1147 node._children[distance] = child 1148 break 1149 node = child 1150 return self
1151
1152 - def _get_child_candidates(self, distance, min_dist, max_dist):
1153 assert( min_dist <= max_dist ) 1154 return (child for dist, child in self._children.iteritems() 1155 if distance - max_dist <= dist <= distance + max_dist)
1156
1157 - def __children(self):
1158 children = self._children 1159 sorted_dists = sorted(children.keys()) 1160 return [children[dist] for dist in sorted_dists]
1161 children = property(__children)
1162
1163 -class FastBKTree(object):
1164 """ 1165 1166 Simpler version of BKTree. The tree's content is kept in nested 1167 dictionaries where a "node"'s value is simply stored at key -1 and 1168 its children are stored using their distance to the value as the 1169 key. Example: 1170 1171 self.root = { -1: 'ABC', 1172 0: { -1: 'ABC' } 1173 1: { -1: 'ABCD', 1174 1: { -1: 'ABCE' } 1175 } 1176 2: { -1: 'A', 1177 1: { -1: 'B', 1178 1: { -1: 'C' } 1179 } 1180 } 1181 } 1182 1183 """ 1184 1185
1186 - def __init__(self, objects, metric):
1187 self.root = None 1188 self.metric = metric 1189 self.size = 0 1190 self.height = 0 1191 self.num_nodes = 0 1192 for o in objects: 1193 self.insert(o)
1194
1195 - def insert(self, obj):
1196 global dist_ctr 1197 #value_of = lambda node: node[-1] 1198 #new_node = lambda obj: { -1: obj } 1199 self.size += 1 1200 if self.root is None: 1201 self.root = { -1: obj } 1202 return 1203 node = self.root 1204 while True: 1205 distance = self.metric(node[-1], obj) 1206 dist_ctr += 1 1207 if distance in node: 1208 node = node[distance] 1209 else: 1210 node[distance] = { -1: obj } 1211 break
1212
1213 - def range_search(self, obj, min_dist=0, max_dist=0):
1214 global dist_ctr 1215 candidates = [self.root] 1216 result = [] 1217 while candidates: 1218 node = candidates.pop() 1219 distance = self.metric(node[-1], obj) 1220 dist_ctr += 1 1221 if min_dist <= distance <= max_dist: 1222 result.append(node[-1]) 1223 candidates.extend( child for dist, child in node.iteritems() 1224 if distance - max_dist <= dist <= distance + max_dist 1225 and dist != -1) 1226 return result
1227
1228 - def search(self, obj, max_dist):
1229 return self.range_search(obj, min_dist=0, max_dist=max_dist)
1230
1231 - def nn_search(self, obj, num=1):
1232 if not self: return list() 1233 global dist_ctr 1234 neighbours = list() # list of (value, distance) tuples 1235 candidates = [self.root] 1236 def cmp_neighbours(a, b): return cmp(a[1], b[1]) 1237 while candidates: 1238 cand = candidates.pop() 1239 distance =self.metric(node[-1], obj) 1240 dist_ctr += 1 1241 candidate_is_neighbour = any(1 for n in neighbours 1242 if distance < n[1]) 1243 if candidate_is_neighbour or len(neighbours) < num: 1244 neighbours.append(cand[-1]) 1245 neighbours.sort(cmp_neighbours) 1246 if len(neighbours) > num: 1247 neighbours = neighbours[:num] 1248 elif len(neighbours) == num: 1249 max_dist = max([n[1] for n in neighbours]) 1250 candidates.extend( 1251 child for dist, child in node.iteritems() 1252 if distance - max_dist <= dist <= distance + max_dist 1253 and dist != -1) 1254 else: 1255 candidates.extend(child for child, dist in 1256 node.iteritems() if child != 1) 1257 return neighbours
1258 1259
1260 - def __len__(self):
1261 return self.size
1262
1263 1264 -class VPTree(MetricTree):
1265 1266 """ 1267 1268 .. _Jeffrey K. Uhlmann: http://www.pnylab.com/pny/papers/vptree/vptree/ 1269 1270 .. _Peter Yianilos: http://www.intermemory.net/pny/papers/vptree/vptree.ps 1271 1272 A "Vantage Point Tree" is a static, balanced, binary tree which can 1273 be used to index metric spaces with a non-discrete distance 1274 function. It has been independently described by both `Jeffrey K. 1275 Uhlmann`_ (1991) and `Peter Yianilos`_ (1993). 1276 1277 Construction is done by picking one object (the "vantage point") 1278 from the set of objects to index and then determining the distance 1279 of all remaining objects to the vantage point. The median distance 1280 is calculated and the set of remaining objects is split in two: 1281 objects whose distance to the vantage point is smaller than the 1282 median distance and the rest (objects whose distance to the vantage 1283 point is larger than or equal to the median). Yianilos called this 1284 process "ball decomposition". For both of the resulting sets new 1285 child nodes, called "left" and "right", are created recursively. The 1286 vantage point and the previously calculated median distance are 1287 stored in the current node. 1288 1289 When searching a node, the distance between the current node's value 1290 and the search object is calculated. If it is ``<= k`` (the given 1291 maximum distance), this node's value is added to the list of search 1292 results. Searching proceeds recursively into the left subtree if 1293 ``distance - k < median`` and into the right subtree if ``distance + 1294 k >= median``. 1295 1296 Since the construction phase takes advantage of the median of all 1297 distances, VPTrees do a fairly good job of balancing their subtrees. 1298 Only the fact that you have to put all objects whose distance to the 1299 vantage point equals the median distance always have to be put on 1300 the same side makes VPTrees tend to hang to the side containing 1301 these objects when using discrete distance functions. 1302 1303 Runtime complexity for the construction phase is ``O(n log(n))``, 1304 searching is ``O(log(n))``. Space requirement is ``O(n)``. 1305 1306 """ 1307 1308 __slots__ = [ '_parent', '_values', '_size', '_func', '_height', 1309 '_num_nodes', '_median', '_leftchild', '_rightchild' ] 1310
1311 - def __init__(self, objects=None, func=None, parent=None):
1312 self._median = None 1313 self._leftchild = None 1314 self._rightchild = None 1315 super(VPTree, self).__init__(objects, func, parent)
1316
1317 - def construct(self, objects, func):
1318 self._func = func 1319 if objects: 1320 if self.is_root(): 1321 # when building the root of the tree, we make sure `objects` 1322 # is a shuffled list to improve VP picking and make 1323 # decomposing easier. 1324 objects = list(objects) 1325 #random.shuffle(objects) 1326 self._values = [self._pick_VP(objects)] 1327 left, right = self._decompose(objects) 1328 del objects # we don't need that list anymore so release it 1329 # before doing recursive calls 1330 self._incr_size(len(self._values)) 1331 if left: 1332 self._leftchild = VPTree(left, func, self) 1333 if right: 1334 self._rightchild = VPTree(right, func, self) 1335 return self
1336
1337 - def _pick_VP(self, objects):
1338 # this probably makes no sense whatsoever, simply pop()ing would 1339 # do just as well, I guess. Need to think about good strategies. 1340 if len(objects) > 15: 1341 sample = objects[:5] 1342 max_diff = -1 1343 vp = None 1344 for o in sample: 1345 dists = [ self._func(other, o) for other in sample 1346 if other is not o] 1347 diff = max(dists) - min(dists) 1348 if diff > max_diff: 1349 max_diff, vp = diff, o 1350 objects.remove(vp) 1351 return vp 1352 else: 1353 return objects.pop()
1354
1355 - def _decompose(self, objects):
1356 """ 1357 1358 Perform the process called "ball decomposition" by Peter 1359 Yianilos. 1360 1361 `objects` has to be an iterable that yields objects applicable 1362 to the metric function in use. The return value is a tuple of 1363 two lists: one list that contains all elements having a distance 1364 smaller than the median distance to this node's value, the 1365 second list contains all objects whose distance is equal to or 1366 larger than the median distance of all given `objects` to this 1367 node's value. 1368 1369 """ 1370 dist_per_obj = list() 1371 for obj in objects: 1372 distance = self._get_dist(obj) 1373 if distance == 0: 1374 self._values.append(obj) 1375 else: 1376 dist_per_obj.append( (distance, obj) ) 1377 if dist_per_obj: 1378 self._median = VPTree.determine_median(zip(*dist_per_obj)[0]) 1379 left = [ obj for dist, obj in dist_per_obj 1380 if dist < self._median ] 1381 right = [ obj for dist, obj in dist_per_obj 1382 if dist >= self._median ] 1383 else: 1384 left, right = None, None 1385 return left, right
1386 1387 @staticmethod
1388 - def determine_median(numbers):
1389 """ 1390 1391 Determine the median from a sequence of numbers (or anything 1392 else that can be ``sorted()``). 1393 1394 This does not use an optimal ``O(n)`` algorithm but instead 1395 relies on CPython's speed when sorting (``O(n log(n))``). 1396 1397 """ 1398 return sorted(numbers)[ len(numbers) / 2 ]
1399
1400 - def _get_child_candidates(self, distance, min_dist, max_dist):
1401 if self._leftchild and distance - max_dist < self._median: 1402 yield self._leftchild 1403 if self._rightchild and distance + max_dist >= self._median: 1404 yield self._rightchild
1405
1406 - def __children(self):
1407 return [child for child in (self._leftchild, self._rightchild) 1408 if child]
1409 children = property(__children)
1410
1411 1412 #class BSTree(object): 1413 # 1414 # class Sector(object): 1415 # """ 1416 # 1417 # * _values contains objects with distance == 0 to each other 1418 # 1419 # * _radius is the largest distance of all objects to this node's 1420 # values. 1421 # 1422 # * _child is a reference to a BSTree. 1423 # 1424 # * _node is a reference to the BSTree this Sector belongs to. 1425 # 1426 # """ 1427 # 1428 # __slots__ = [ '_values', '_radius', '_child', '_node' ] 1429 # 1430 # def __init__(self, node, *values): 1431 # self._node = node 1432 # self._values = list(values) 1433 # 1434 # def insert(self, obj): 1435 # if not self._values: 1436 # self._values = [obj] 1437 # return self._node 1438 # else: 1439 # distance = MetricTree._get_dist(self, obj) 1440 # self._radius = max(self._radius, distance) 1441 # return self._child.insert(obj) 1442 # 1443 # def __contains__(self, obj): 1444 # return MetricTree.__contains__(self, obj) 1445 # 1446 # 1447 # __slots__ = [ 1448 # '_parent', '_size', '_func', '_height', '_num_nodes', 1449 # '_left_sector', '_right_sector' 1450 # ] 1451 # 1452 # def __init__(self, objects=None, func=None, parent=None): 1453 # self._left_sector = Sector() 1454 # self._right_sector = Sector() 1455 # super(BSTree, self).__init__(objects, func, parent) 1456 # 1457 # def construct(self, objects, func): 1458 # split = ([], []) 1459 # if not self._left_values or not self._right_values: 1460 # pass # TODO 1461 # for obj in objects: 1462 # to_left = this._func(self._left_values[0]) 1463 # to_right = this._func(self._right_values[0]) 1464 # 1465 # def decompose(self, objects): 1466 # return objects.pop() 1467 # 1468 # def pick_good_sector_values(self, objects): 1469 # return objects.pop(), objects.pop() 1470 # 1471 # def __nonzero__(self): 1472 # pass 1473 # 1474 # def __iter__(self): 1475 # pass 1476 # 1477 # def __repr__(self): 1478 # pass 1479 1480 -class UnindexableObjectError(Exception):
1481 """ 1482 1483 This Exception is thrown when the call to the distance function in 1484 use by a metric space throws an exception. This should not happen 1485 during regular usage and is a hint that either your distance 1486 function is buggy or you tried to index objects which your distance 1487 function cannot handle. 1488 1489 """ 1490
1491 - def __init__(self, msg, e):
1492 self.orig_exception = e 1493 Exception.__init__(self, msg)
1494 1495 orig_exception = None 1496 """ The exception that has triggered this exception. """ 1497 1498 pass
1499
1500 1501 -def levenshtein(x, y):
1502 """ 1503 1504 Compute Levenshtein (or "edit") distance between two strings. 1505 Levenshtein distance between two strings ``x`` and ``y`` is defined 1506 as the minimal number of operations on single characters that it 1507 takes to transform ``x`` into ``y`` (or vice versa). 1508 1509 Allowed operations are: 1510 1511 - Insertion, 1512 - Deletion and 1513 - Replacement 1514 1515 of a single character. 1516 1517 Levenshtein distance has all the properties of a strictly positive 1518 metric. That means for all x, y, z it holds: 1519 1520 - x == y <=> levenshtein(x, y) == 0 1521 - x != y <=> levenshtein(x, y) > 0 (positivity) 1522 - levenshtein(x, y) == levenshtein(y, x) (symmetry) 1523 - levenshtein(x, z) <= levenshtein(x, y) + levenshtein(y, z) 1524 1525 The upper bound for Levenshtein distance is the length of the longer 1526 string: ``max(len(x), len(y))``. A general lower bound is 1527 ``abs(len(x) - len(y))``. This is the case where one string is the 1528 pre-/postfix of the other one. 1529 1530 Incidentally, this implementation not only works for strings, but 1531 for all types of sequences. Objects in the given sequences are 1532 compared for equality using '==' to determine whether an edit 1533 operation is needed. 1534 1535 """ 1536 if x == y: 1537 return 0 # equal strings have distance zero 1538 m, n = len(x), len(y) 1539 if not (m and n): # if one of the lengths is zero, 1540 return m or n # we can return the other one. 1541 if n > m: 1542 x, y = y, x # switch m and n so that n 1543 m, n = n, m # is the smaller one (saves space for cur and prev) 1544 cur = range(n+1) 1545 prev = [0] * (n+1) 1546 for i, char_x in enumerate(x): 1547 prev[0] = i+1 1548 for j, char_y in enumerate(y): 1549 if char_x == char_y: 1550 prev[j+1] = cur[j] 1551 else: 1552 prev[j+1] = min(prev[j], cur[j], cur[j+1]) + 1 1553 cur, prev = prev, cur 1554 return cur[-1]
1555
1556 -def tokenizer(textfile, separator=None):
1557 """ 1558 1559 Simple generator that tokenizes sequences of strings (or file-like 1560 objects) into uppercase words using the optional `separator`. This 1561 `separator` is passed directly to str.split so ``split``'s behaviour 1562 concerning `None` as separator applies here as well. 1563 1564 After the last token has been returned, there is an attempt to close 1565 `textfile`. If this yields an `AttributeError`, it will be silently 1566 ignored. 1567 1568 """ 1569 for line in textfile: 1570 for token in line.split(separator): 1571 yield token.upper() 1572 try: 1573 textfile.close() 1574 except AttributeError, e: 1575 pass
1576 1577 dist_ctr = 0 1578 1579 try: 1580 import psyco 1581 psyco.bind(levenshtein) 1582 psyco.bind(VPTree) 1583 psyco.bind(BKTree) 1584 psyco.bind(FastBKTree) 1585 except ImportError, e: 1586 sys.stderr.write("Psyco not available, proceeding without it.\n") 1587 1588 if __name__ == '__main__': 1589 import codecs, time 1590 dumpfilename = "vptree.dump" 1591 f = codecs.open(sys.argv[1], encoding="iso-8859-1") 1592 names = tokenizer(f) 1593 t = BKTree() 1594 t.construct(names, levenshtein) 1595 f.close() 1596 print type(t) 1597 s = raw_input("press enter") 1598 print "t height: %s" % t.height 1599 print "t size: %s" % len(t) 1600 print "t nodes %s" % t.num_nodes 1601 1602 #print "results: %s " % t.search("MARTIN", 1) 1603 1604 f = codecs.open(sys.argv[1], encoding="iso-8859-1") 1605 names = list(tokenizer(f)) 1606 random.shuffle(names) 1607 1608 start = time.time() 1609 for i in range(0, 10): 1610 t.search(names.pop(), 1) 1611 end = time.time() 1612 print end - start 1613 1614 # vim: set et ts=4 sw=4 tw=72 nu: 1615