tef

bad poster & mediocre photographer

  • they/them

someone pointed out that i've been doing a diff-of-diffs, so i figured i might as well do another progress report

this time: what are the diff algorithms in git, why are they different


what is a diff, anyway

to recap: diff is about producing a list of changes between two files, and there's several ways to go about it. the first, and perhaps most obvious way is to search for the smallest amount of edits you need to make, the second way, which might not be too obvious, is to search for the longest common subsequence, an ordered list of lines that haven't changed between the two files.

it's worth pointing out here that when i say "subsequence" i don't mean "substring. a substring is a contiguous list of lines in common, and a subsequence is just lines in increasing order, sometimes with gaps.

anyway, the point is this: you can calculate the list of changes in two different ways, one is to find the shortest list of edits, one is to find the longest list of common lines, and given one, you can calculate the other.

the thing that makes diff algorithms different from longest common subsequence algorithms is the inputs they work on. a diff algorithm usually works on text files, source code, where there's a very small list of changes, and a very long list of common lines.

longest common subsequences are usually optimised for the other case: a very very large list of changes, and only a few common elements. lcs algorithms are also regularly implemented for things with a very small list of possible elements, or a fixed alphabet like a genetic sequence, where elements can be A,G,C, or T. diff, by comparison, has to work on files where there is no fixed set of possible lines.

myers diff (aka, git's minimal diff)

with that in mind, we can jump into a quick overview of diff algorithms in git, and let's pick the easiest one to start with, "minimal diff". git's minimal diff is actually myers diff, which for the people who haven't opened a wikipedia tab yet, is a breadth first greedy algorithm.

the underlying idea is that we can start at the beginning of both files, and try adding in a line, or deleting a line if the first lines are different, and racing ahead to the first difference if the lines are the same. as a result, when your two inputs have lots of lines in common, myers diff races to the finish line in absolutely no time.

the running time of the basic algorithm turns out to be O(nd), which is a fancy way of saying "it's more quadratic when there's a lot of edits" and "less quadratic when there's almost no edits to find." this is why myers diff is so popular with searching for changes in prose or code, and there's several tricks going on behind the scenes to make things faster, too.

the first trick is hard to explain in text, but i'm too lazy to draw a graph. the long and short of it is that by searching in units of "make an edit, check for common substrings", we can generate a range of possible places we should have reached, and then quickly look back at where we last got to. trust me, it's much easier with a graph, but the important part is that we don't look forward to see what edit comes next in the breadth first search, we look back to see which was the most efficient edit to make, and then search for common substrings.

i only mention this because it tripped me up when i read through the source code, asking myself "WHERE'S THE GODDAM BREADTH FIRST SEARCH THEN, HUH?", but it isn't really that important to understanding the general vibes of the algorithm.

what is important is that myers diff is used in a divide and conquer way. by searching from both ends of the file, we can find a common substring roughly in the centre of the file (or, more technically, the same edit distance from either end), and then use that to bisect the inputs, and repeat the search.

this means we don't have to store every possible breadth first path we take in order to calculate the list of edits. we find the middle point, then the next two mid points between that and the end of the file, and so on, and so on, until we have such small pieces that calculating the edits is easy, and then build up the complete list of edits.

so in summary: breadth first greedy algorithm, goes much faster when the files have things in common, and uses a divide and conquer strategy to avoid filling up memory with possible edits along the way (which does affect the runtime but no-one seems to mind too much)

the last important thing to note here is that myers diff tries to find the shortest diff output, not necessarily the most pleasing. which may come as no surprise to those who have run diff and cursed at the screen.

one example of this is when you add in a new function to a file, and diff doesn't go "aha, there is the new function", it matches the "}" at the end of one function to a "}" at the end of another, and you end up with some real whackytown frolics in the output.

in other words: myers diff doesn't care which substrings overlap, only that the shortest one is picked. when it has multiple possible choices, it always picks the first one it finds. not everyone is happy about this, which is why patience diff came about.

patience diff

patience diff is a relatively simple idea: what if we match up the unique lines in both files first, use that to find common substrings, and then break up the files into smaller pieces.

in the bad example above, patience skips over starting a substring with a '}', and instead latches on to a unique line, usually a function definition, and as a result, produces far more pleasing output.

it's worth mentioning that patience diff isn't really a proper diff algorithm, it's more like a preprocessing step. if there's no unique elements, you need to use something else to finish the job.

that said, let's look at the algorithm:

  1. make a list of unique lines in file 1.
  2. make a list of unique lines in file 2.
  3. take the second list, and number each unique line in file 2, with the position it appears in line 1.
  4. find the longest increasing subsequence.

ok ok, that last step is a bit "now draw the owl", but there is a nice algorithm to do it for you, "patience sort".

other people have explained it better, but in essence, you play a game of patience with the lines in file 2, forming stacks of decreasing sequences. when you run out of cards, the number of stacks is the length of the longest increasing subsequence, and through careful backpointers, you can then calculate it.

the running time of patience sort is nlogn, n for the amount of cards you place, and logn to find the best stack to place it on with binary search. which is a pretty good, runtime, given it never goes quadratic, but often patience sort feels much slower on large files than myers diff.

in other words, patience diff works best when you only have a handful of lines in common. when you have a lot of unique lines in common, myers diff will beat it, hands down, every time. even though the output won't always be as nice.

which brings us neatly to histogram diff

histogram diff

histogram diff is another "not quite a diff algorithm" preprocessing step, invented by the jgit authors to try to make patience diff a little more performant.

the idea is that instead of doing myers diff, finding a midpoint the same edit distance from both ends of the file, we use a midpoint that patience diff might have chosen.

let's just dive into the algorithm:

  1. create a histogram of file 1, but bail out if it gets too big.
  2. for each line in file 2, check for common substrings
  3. return the substring with the most unique elements in file 1, and if there's more than one to choose from, return the longest substring
  4. bail out to myers diff when things get rough.

the algorithm's pretty fast. no logn's in sight, but it isn't perfect.

it sacrifices finding a short, minimal diff, and it doesn't always pick the same midpoints that patience finds. it's quite possible to have unique lines in file A that aren't unique in file B, and pick something that patience wouldn't even look at. on the other hand, histogram diff still charges on when there's a mostly unique substring to find.

in other words, histogram diff is more of a heuristic, than a preprocessing step. it picks a "good enough midpoint" to continue on, which leads to much faster running times

which brings us to the last type of diff in git, confusingly called "myers"

git's myers diff (a heuristic diff)

if we can heuristic up patience diff to approximate a good solution, we can definitely do the same for myers diff. instead of searching for the first midpoint that's the same edit distance from either end, git's diff just picks the first midpoint that's good enough.

the algorithm isn't much to look at:

  1. try and do myers diff
  2. if you find something close enough to the middle of the files, use that to bisect
  3. but if it's too long, we just give up and pick whatever

the specifics of the heuristics aren't exactly formally defined, but they're mostly just magic numbers, rather than being data dependent like in histogram. it really does turn out that most people do not care about a "minimal diff", just one that's "vaguely readable" you don't have to wait around for.

more importantly: you can always tidy up the output afterwards. which brings us nicely to the final secret of git's diff

git's other heuristics

we've tried heuristics during preprocessing, we've tried heurstics during the search, the only place left for fuckery is postprocessing, and git already does that too.

again, it's an attempt to make diff output more patience-alike.

to recap, patience diff picks unique lines to avoid "matching halfway through a function", but we can also just look at the match we got, and slide it up and down to see if there's a better place to put it.

instead of asking "do we have the most unique lines in this match", which sometimes works, git asks "are we starting and ending at a different indent", which is much easier to calculate, and works pretty well.

it turns out sometimes it's much quicker to find a good approximation and tidy things up, than it is to do the perfect thing from the outset.

so well in fact, since git 2.4, the "--indent-heuristic" option is turned on by default.

one last thing: mcilroy diff

diff didn't always work like this, myers wasn't always the default. hunt/mcilroy wrote the first implementation of diff, which instead of finding "the smallest edit script", it directly calculated the longest common subsequence.

in some ways, hunt/mcilroy diff is like patience diff, except considering all possible matches, rather than just the unique ones. there's a bit more work needed to make patience sort handle it, but roughly speaking, it's a very similar process underneath.

hunt/mcilroy uses a lot more memory than patience diff, or myers, but if there's a very very small longest common subsequence/a very long edit sequence, it will find the answer way way faster than myers diff. this rarely happens in practice, outside of testing dna substrings.

so you might be asking: why can't we have our cake and eat it? why can't we use myers diff when the files are similar, and hunt/mcilroy diff when the files are distinct? well, we could, but we still have to calculate the distinctness to begin with, which requires running one algorithm or the other to know which one would be the fastest.

well, almost. there are forms of the lcs algorithm that kinda suck equally at long and short sequences. in theory we could use one of them to begin with, find a good midpoint, and then bisect using a different diff algorithm depending on the measured overlap

in practice? it's an awful lot of code for not an awful lot of gain. heuristics are cheap and make the code run fast, and in the end, diff isn't really about finding the shortest output

it's about finding the most readable output, as fast as possible. it just turns out that the shortest output is often the most readable.

final thoughts

i do wonder if there's still room for improvement

could we improve patience diff by looking at unique n-grams, over unique lines? finding unique sets of pairs or triples of lines? would it make things more readable?

or what if we could do a bisecting version of patience diff, that isn't as sloppy as histogram diff? something that considers uniqueness in both files? would it be worth the extra runtime? would it speed things up compared to plain old patience?

perhaps we could use the histogram to pick which diff algorithm to use? if there's a few overlaps, hunt/mcilroy, if there's many, myers diff, and maybe using patience inbetween to even things out.

alternatively, what if we could get myers diff to pick better midpoints, rather than bailing out as soon as it finds one good candidate? maybe we could pick from the possible midpoints using a histogram?

what if we tried trisecting the file when we can't find a good midpoint quickly? rather than just picking one "good enough candidate", we pick one candidate for the forward, and one candidate for the backwards search. given we'll know the edit distance length for two of the pieces, we might be able to make better choices?

and finally, what if there's better metrics than just indentation or uniqueness for human readable output? preprocessing and postprocessing is cheap and we should be doing more of it.


You must log in to comment.

in reply to @tef's post:

I hear a lot of people talking about “language specific diffs” but nobody doing it, presumably because it’s hard. I wonder if some language specific callbacks in those heuristics would be both easy and beneficial (like marking lines that start declarations, over just indentation).

the problem with heuristics is that the 80/20 rule tends to mean "it works most of the time and when it sucks, it sucks horribly, badly, no good"

that's the thing about patience and the sliders, they work pretty well, even when they don't