jckarter

everyone already knows i'm a dog

the swift programming language is my fault to some degree. mostly here to see dogs, shitpost, fix old computers, and/or talk about math and weird computer programming things. for effortposts check the #longpost pinned tag. asks are open.


email
mailto:joe@duriansoftware.com
discord
jckarter

tef
@tef

for some reason that escapes me, i've been nerd sniped by diff algorithms. i think a large part of it is that most of the papers explaining it are absolutely terrible. assuming there is documentation.

the mcilroy paper is "here are four single letter variables let's party", the myers diff paper is pretty ok except for "here's some new terminology ambiguously defined", patience diff is a livejournal post, and histogram diff is a comment in a java file

and the comment is wrong, too

[Histogram diff] behaves exactly like Bram Cohen's patience diff whenever there is unique common element available between the two sequences.

It doesn't. I found a counter example.


Diff, to recap, can be defined as "finding the smallest edit sequence to transform one file into another", or you can also say "finding the longest common subsequence shared by both files", as the smallest amount of edits leads to the longest amount of shared content.

Patience diff isn't a complete diff algorithm. It's about producing nicer output that's not necessarily optimal in size. The big idea is that instead of finding the LCS across the entire files, we find a LCS of unique lines instead. The notion is that leaving the unique lines unchanged in a diff gives better output.

Histogram diff is somewhat similar. The code explains itself like this:

* The basic idea of the algorithm is to create a histogram of occurrences for
 * each element of sequence A. Each element of sequence B is then considered in
 * turn. If the element also exists in sequence A, and has a lower occurrence
 * count, the positions are considered as a candidate for the longest common
 * subsequence (LCS). After scanning of B is complete the LCS that has the
 * lowest number of occurrences is chosen as a split point. The region is split
 * around the LCS, and the algorithm is recursively applied to the sections
 * before and after the LCS.
 * <p>
 * By always selecting a LCS position with the lowest occurrence count, this
 * algorithm behaves exactly like Bram Cohen's patience diff whenever there is a
 * unique common element available between the two sequences. When no unique
 * elements exist, the lowest occurrence element is chosen instead. This offers
 * more readable diffs than simply falling back on the standard Myers' O(ND)
 * algorithm would produce.

The problem is? This isn't what the code does. I think a better explanation of histogram diff is the following:

  • Create a histogram of the first file. This is a hash table keyed by lines to a list of line numbers, and the total number of occurrences.
  • We now iterate through the second file, and check for lines that exist in both files
  • For every matched line, we check to see if the files have the same lines above and below the match point to generate a list of substrings that appear in both file.
  • For each substring, we use the histogram of the first file to find the lowest occurring element within the substring.
  • We return the longest substring which contains the least common line in the first file
  • We then use this substring to split the files into two, and repeat the process
  • If the first file has a line that repeats more than 64 times, give up to avoid worst case behaviour.
  • We can also optimize this approach, keeping track of the best split point we've found so far and skip over matching lines that have a higher occurrence count, only updating if there's a longer common substring, or a common substring with a more unique line in the first file.

The comment is wrong: The code finds the longest common substring, not the longest common subsequence.

For example: The longest common subsequence of say "dog cat bike biscuit" and "dog juice bike biscuit cat" is "dog bike biscuit". It isn't a contiguous list of words. When you want a contiguous list, you want the longest common substring. In the example above, it's "bike biscuit".

The other thing to note is that the code doesn't find the least common longest substring across both files, but the longest common substring that has the least common element in the first file.

Which is a bit of a mouthful, sure, but this means we can create two inputs where histogram diff acts differently to patience diff, despite having unique common elements to choose from.

% cat a
2
cat
1
dog
2

% cat b
1
dog
2
cat
2
dog
1

The unique common element between these two files is "cat". Patience diff will calculate the longest common substring of unique elements, which is "cat"

Histogram diff, on the other hand, only calculates uniqueness in the first input. Here, "cat", "dog" and "1" are all unique. It then finds common substrings, and "dog 2" is the longest substring in both files, that contains a unique element ("dog" in this case).

This means patience will split the first file up into "2", "cat", and " 1 dog 2". Histogram diff will split the file into "2 cat 1" and "dog 2".

Which eventually leads to completely different output.

% git diff --patience a b
diff --git a/a b/b
index eb9810a..eae195b 100644
--- a/a
+++ b/b
@@ -1,6 +1,8 @@
+1
+dog
 2
 cat
-1
-dog
 2
+dog
+1
 
% git diff --histogram a b
diff --git a/a b/b
index eb9810a..eae195b 100644
--- a/a
+++ b/b
@@ -1,6 +1,8 @@
-2
-cat
 1
 dog
 2
+cat
+2
+dog
+1

In other words: Histogram diff does not behave like patience diff unless there's exactly one unique element between both files, and this unique element is the only unique element in the first file. Histogram diff also does not find the longest common subsequence either.

The comment, and subsequent explanations of the algorithm are all entirely wrong. Thanks Google.

Oh, and while I was looking all of this up, I found a bug in the other meyers diff paper (A File Comparison Program, Myers & Miller). It has some code to constrain the diagonal search that has "hit last row, don't look to the left", which can happen twice, which causes the code to skip over valid searches, or return the wrong path if you initialised your memory with zero.

Thanks Myers, too.


You must log in to comment.

in reply to @tef's post:

people that have been nerd sniped by parsing algorithms are at high risk of being nerd sniped by related fields like diff algorithms and compression algorithms

I wish diff algorithms would do a better job picking where to do splits, it seems like inevitably function bodies get separated from their signatures when you do a refactor because the diff algorithm sees the common { line or something

  1. yep. i already got nerd sniped by parsing, badly

  2. patience diff seems to do a better job, what with picking unique shared points across the files, but it doesn't always do better. histogram says it does similar, and imagine in many cases it does, but it's still "find longest overlap", which again, doesn't work when you have multiple equal ones to choose from

myers diff? well, the thing about implementations is they bail out as soon as they find a single midpoint, but in theory, you could have multiple different midpoints to choose from at a given depth. it would be relatively straight forward to make it continue along to return the longest middle snake, rather than the first middle snake it finds equidistant number of edits from the ends of the string.

that might help? but, well, once again, you're still left with the issue where there's multiple possible choices for a longest overlap.

this is why git now has a "sliding diff" optimization that notices when "actually we could move this change one line down" when the prefixes/suffixes of a common section are the same, which does improve matters somewhat

but, well

there's still several cases where a longer edit script is less confusing, and there's no obvious good way to handle it.

i guess you can scan through the script and go "a longer section of adds and removals is better than an interspersed one", but now, your diff isn't always showing what changed

even so, it might be pleasing to handle blank lines in a "don't bother matching these" way