Talk:Algorithm Implementation/Strings/Dice's coefficient

These algorithms kind of work, but I'm not 100%. - Francis Tyers 15:01, 28 March 2007 (UTC)

Thanks for your code, it was very helpful. Two notes about the C++ code:

 The  loops shouldn't use "unsigned" because the code hangs on empty strings. The code is easy to read and perfect for illustrative purposes. However, for production code comparing large numbers of strings, performance can be dramatically improved by removing all memory allocations. This is possible because the bigram analysis can be done directly on the source strings by using raw pointers to either a 16-bit value (for a pair of single byte characters) or to a 32-bit value (for a pair of Unicode characters.) There is no need to create the arrays  and  .  - Jim Beveridge 23:59, 20 August 2007 (UTC)


 * Thanks for the feedback. Yes, I agree that an optimisation would be to use pointers instead of actually calculating the bigram arrays, but as you say it is more for illustration than actual production use :)


 * Second point (your point 1.), I hadn't realised that, and actually it might make sense just to check the input strings first for zero lengths, and then return 0 if either string is 0 length. - Francis Tyers 15:42, 14 November 2007 (UTC)


 * I would like to note that these implementations are quite incorrect (at least the Python version is, and I assume it is derived from the C++ version). Try using it with two strings where multiple occurrences of a element exist and you'll see what I mean: ('aaa', 'aaa') gives you 2.0, while a number between 0.0 and 1.0 is required. A fixed Python version can be found below:

The text on Wikipedia isn't very clear on what should be done with multiple occurrences of an element, but as a set can have only one occurrence of an element and the text mentions the intersection being done between sets the above code should be closer to the algorithm. Alex de Landgraaf 12:33, 25 May 2008


 * You're right, it should be done on sets. I've changed the code to reflect this. The C++ version appears to have already been updated to fix this. - Francis Tyers (talk) 14:22, 7 June 2008 (UTC)


 * I used something close to the Simon White variant. Basically a pair is removed from the bigram list (not set) as it is matched. This avoid a problem where "gc" vs "gcgc" evaluates as 100%, instead of the correct 50%. My Javascript implementation fakes the removal by setting the bigram array element to null. I have not yet experimented with a fast "proper" Array removal (e.g. something derived for the John Ressig Array.remove) yet. Maybe that could be faster. Stormrose (discuss • contribs) 11:58, 6 March 2012 (UTC)


 * No, the wiki page is incorrect. According to Adamson and Boreham, Brew and McKelvie and Kondrak , the set of bigrams is actually a multi-set of bigrams. Adamson and Boreham make it perfectly clear that "Multiple occurrences of the same [bigram] in one word were treated as distinct, ...". This can easily be seen if one looks at the similarity between "GG" and "GGGGG", which should obviously not be 1. I suggest all implementations are adapted. Jfresen (discuss • contribs) 17:46, 10 December 2012 (UTC)

I've added a Javascript implementation. It relies on Array.push and String.substr which are standard. I did try a version that used Array.indexOf instead of the inner "j" loop but that performed slower for the size strings used and Array.indexOf requires more recent Javascript libraries. Stormrose (discuss • contribs) 11:58, 6 March 2012 (UTC)

PHP Implementation
I'm new to wikibooks and I couldn't figure out how to add this to the actual page. This is a PHP implementation taken from the second Python implementation:

function dice_coefficient($a, $b) { if (!strlen($a) || !strlen($b)) return 0.0; if (strlen($a) === 1) $a .= '.'; if (strlen($b) === 1) $b .= '.';

$a_bg = array; $b_bg = array; for ($i=0; $i < strlen($a)-1; $i++) $a_bg[] = substr($a, $i, 2); for ($i=0; $i < strlen($b)-1; $i++) $b_bg[] = substr($b, $i, 2);

$overlap = count(array_intersect($a_bg, $b_bg)); return $overlap * 2.0/(count($a_bg) + count($b_bg)); }

Clarify what the algorithm should do, before implementing
Shortly the Javascript implementation was changed to "sets", which changes the functionality with multiple occurences of the same substring (this may be true in other language impementations). For comparing strings, an Implementation with sets is not useful, as we have to count eqal bigrams each time they occur. Try the current (26. July 2021) Javascript implementation, and the revised (21. November 2020). They produce different output for example for the arguments "gg" and "ggggg", or "aaaaaaa" and "axaaaaa", I add a warning in the header ! Perhaps some "Test cases" with correct results should be added in the wiki. -- See comment of User Jfresen ! --Rilaf (discuss • contribs) 04:30, 26 July 2021 (UTC)