Algorithm Implementation/Strings/Levenshtein distance

The implementations of the Levenshtein algorithm on this page are illustrative only. Applications will, in most cases, use implementations which use heap allocations sparingly, in particular when large lists of words are compared to each other. The following remarks indicate some of the variations on this and related topics:


 * Most implementations use one- or two-dimensional arrays to store the distances of prefixes of the words compared. In most applications the size of these structures is previously known. This is the case, when, for instance the distance is relevant only if it is below a certain maximally allowed distance (this happens when words are selected from a dictionary to approximately match a given word). In this case the arrays can be preallocated and reused over the various runs of the algorithm over successive words.
 * Using a maximum allowed distance puts an upper bound on the search time. The search can be stopped as soon as the minimum Levenshtein distance between prefixes of the strings exceeds the maximum allowed distance.
 * Deletion, insertion, and replacement of characters can be assigned different weights. The usual choice is to set all three weights to 1. Different values for these weights allows for more flexible search strategies in lists of words.

C
The above can be optimized to use O(min(m,n)) space instead of O(mn). The key observation is that we only need to access the contents of the previous column when filling the matrix column-by-column. Hence, we can re-use a single column over and over, overwriting its contents as we proceed.

C++
Please note, that using  isn't efficient here, since you don't need a 2 dimensional array.

Here's another implementation with usage of only $$\sim~ min(m, n)$$ memory (also it uses fact that $${lev}_{a, b}(i - 1, j - 1) \leq lev_{a, b}(i, j)$$, so it is not needed to take minimum out of 3 possibilities):

Here is implementation of generalized Levenshtein distance with different costs of insertion, deletion and replacement:

C#
An implementation with reduced memory usage

Damerau-Levenshtein distance is computed in the asymptotic time O ((max + 1) * min (first.length, second.length ))

Clojure
Another implementation using transient data structure, inspired by Common Lisp version:

Delphi
A simple implementation that can certainly be improved upon.

F#
The inlined min function gives a big speed boost. and here's a slightly faster lazy version.

Go
This version uses dynamic programming with time complexity of $$O(mn)$$ where $$m$$ and $$n$$ are lengths of $$a$$ and $$b$$, and the space complexity is $$n+1$$ of integers plus some constant space(i.e. $$O(n)$$).

Groovy
This version is based on the Java version below

Haskell
Tested with GHCi. and here's a slightly faster version.

For large strings, using arrays is much faster As recursively defined array: And finally: fast but cryptic implementation

JavaScript
Slow recursive version:

Faster approach using dynamic programming (source)

Java
Not recursive and faster

Julia
Straightforward recursive implementation. Do not use for large strings!

Test with

MySQL
This implementation seems to be broken; but I am not entirely sure how to fix it.

The implementation below, thanks to Jason Rust at Code Janitor, seems to be more complete.

Objective-C
Or with extensions "and C flavor"

PHP
Please note that there is a standard library call levenshtein in PHP as of version 4.0.1. It is limited to comparing strings of no more than 255 characters in length, however, limiting its utility.

plpgsql
Copied the PHP version using a single dimensional array

Python
The first version is a Dynamic Programming algorithm, with the added optimization that only the last two rows of the dynamic programming matrix are needed for the computation:

Second version:

(Note that while very compact, the runtime of this implementation is really poor, making it unusable in practical usage. One way to make it significantly faster would be to add a  decorator, but it does not make it optimal due to the many string copies).

Third version (works):

(Note that while compact, the runtime of this implementation is relatively poor.)

4th version: (Note this implementation is O(N*M) time and O(M) space, for N and M the lengths of the two sequences.)

5th, a vectorized version of the 1st, using NumPy. About 40% faster, on my test case. However, this numpy vectorized Python version incorrectly returns 5 for  (which should be 4 for two 'a' insertions + 2 'd' deletions). The reason is that the deletion step  does not handle the case that the minimum for   could be   due to two consecutive deletions as it is in this example. See Discussion. (Note this implementation only works if the weight does not depend on the character edited.)

6th version, from Wikipedia article on Levenshtein Distance; Iterative with two matrix rows.

R/S+
"(Note) this is just one of many implementations of the Levenshtein Distance, but I've not been able to get the others to work

Ruby
This Ruby version is simple, but extremely slow, though it works with any Array with elements that implement '=='.

This memoized version is significantly faster

And another memoized version

A faster implementation for the Levenshtein distance of strings (using C bindings) is available here

Rust
A simple implementation in Rust (beta2), transposed from the C version. It is a little more general than the bare minimum, in fact it accepts any type that can be converted to &[u8].

You can test it live on Rust Playpen.

Imperative version
A functional version would likely be far more concise.

VBScript
This version is identical to JavaScript and PHP implementations in this article.

Visual Basic for Applications (no Damerau extension)
This version is identical to JavaScript and PHP implementations in this article. I had problems when I tried to use the other VBA implementation in this article, so I had to adopt the version below.

Application.WorksheetFunction.Min method is Excel-specific. If you implement it with other VBA-enabled applications, uncomment the conditional block and comment out the Application.WorksheetFunction.Min line.

MapBasic
This version is identical to VB implementations in this article.

Teslock
This is the Levenshtein distance calculation in Teslock Machine Language. .declare singlecall virtual LevenshteinDistance[args(2) string s1, string s2]: out unsigned int main (   define unsigned int constant "cost_ins" == 1;    define unsigned int constant "cost_del" == 1;    define unsigned int constant "cost_sub" == 1;

define unsigned int variable "n1" == calculate::string_operations>length("s1"); define unsigned int variable "n2" == calculate::string_operations>length("s2");

define unsigned int_array(calculate::string_operations>array_instantiation("p")>fixed_length("n2", ++1)); define unsigned int_array(calculate::string_operations>array_instantiation("q")>fixed_length("n2", ++1)); define unsigned int_array(calculate::string_operations>array_instantiation("r")>variable_length);

p>array_vector>0 == 0; loop_for>finalized(define unsigned int variable "j" == 1)>break_condition("j" <= "n2")>forward_condition(++j) (       p>array_vector>j == p>array_vector>j::access>+constants::cost_ins;    )

q>array_vector>0 == 0; loop_for>finalized(define unsigned int variable "i" == 1)>break_condition("i" <= "n1")>forward_condition(++i) (       q>array_vector>0 == p>array_vector>0::access>+constants::cost_del;        loop for>finalized(define unsigned int variable "j" == 1)>break_condition("j" <= "n2")>forward_condition(++j)        ( define unsigned int variable "d_del" == p>array_vector>j::access>+constants::cost_del; define unsigned int variable "d_ins" == q>array_vector>j[delegate_handle::j::access>-1]>+constants::cost_ins; define unsigned int veriable "d_sub" == p>array_vector>j[delegate_handle::j::access>-1]>+logical_operations>xor_result["s1"::access>"s1"[delegate_handle::j::access>-1]?0>return_handle_as_result constants::"cost_sub"]; q>array_vector>j == dll_extern::math_interop_singlecall(min)::[args(3) (d_del, d_ins), d_sub; )       local>"r" == "p"(self_typecast::ignore_unsafe_condition);        local>"p" == "q"(self_typecast);        local>"q" == "r"(self_typecast);        ) logical_result(param p::singlecall)::define>"return" == p>array_vector>"n2"; )

Abap
Please note that there is a standard ABAP function distance which calculates the Levenshtein distance without need for the below coding.