tef

bad poster & mediocre photographer

  • they/them

it's very technical but the simple answer i was looking for is "it's breadth first search tef, but then that's expensive, so people developed tricks to avoid having to store every possible path"

so, for my benefit alone, here's me writing up what i learned today


so diff, right, you have "you love butts" and "i love cats" and you want something that gives you "(-you) (+i) (=love) (-butts) (+cats)" back. this is called an edit sequence

i mean, you could have an edit sequence like "(+i) (+love) (+cats) (-you) (-love) (-butts)", or even "(-you) (-love) (-butts) (+i) (+love) (+cats)" or even "(+i) (-you) (+love) (-love) (+cats) (-butts)", but these are all longer edits than the most obvious one.

a long edit sequence isn't very useful at showing you what has changed. that's why people want to find "the shortest edit sequence"

so, how do you find this edit sequence? well, you could try depth first search.

so you have string one "(-you) (-love) (-butts)" and string two "(+i) (+love) (+cats)" and you keep track of how far you've progressed through each. at each possible pair of positions, you pick one of three actions "do nothing, the words are the same", or more likely "add in the word we wanted" or "delete the word we do not want."

so it might go "ok, (+i) and then (=love), so now i pick between adding (+cats) or deleting (-you)", then from either, working out that it has to then do the other one to get to the finish.

and after say "(+i) (-you)" it realises that both strings have the next word "(love)", that means there's a shorter possible path. you could add in some score so that adding or editing increases the score by 1, and keeping the same word does nothing. with this score, you can compare potential routes to see which is best.

you only need to store your current route of edits, and the best one you've found so far, and it's score. this will eventually find the shortest edit sequence, and it doesn't take up a lot of memory, either.

congratulations you have written diff. it will work, but it won't be fast.

the diff tools you use employ a different technique. a combination of breadth first search and trickery.

with plain old breadth first search. you keep track of every possible edit/change at each step, before deepening the search. which as you might imagine, uses up a lot of memory. for example, at step 1 i could do "(+i)" "(-you)", and then at step two it could try "(-you)" or "(+love)" or "(+i)" or "(-butts)"

so how do we reduce the memory cost? myers diff uses three ticks at the same time.

first: don't store the path you take, just the lowest score you've found at some point in the edit sequence. i.e "after deleting n from string1, and adding m from string2, n is the lowest number of changes" is enough.

second: search from either end of the file to generate the start and end of the edit sequence, and stop in the middle.

third: now you have the middle point of the edit sequence, you can split the search in two, and repeat to find the next midpoints and so on.

it's kinda ridiculous, but it works out.

if you only store the score at each point, it's an awful awful lot less memory than every possible route. searching from either end means you can quickly find a good middle point without having to store the routes. then breaking up the inputs in two means you have much less possible routes to consider.

so, that's diff!

  1. skip over the front and end of the two files if they're the same.
  2. keep a big 2d array of scores, indexed by how far you are through each string
  3. breadth first search through every possible edit from either end of the files
  4. now you have a midpoint, split the strings into two pieces, and run the algorithm on each part

edit: well, almost. there's one more trick myers diff uses, but it's very tricky. let's first imagine that we're diffing two completely different strings, with no common elements, just to understand how we're searching things

say we compare "a a a" to "b b b", and we could end up with "-a -a -a +b +b +b" or "+b +b +b -a -a -a" or any permutation. let's imagine picking a path through these two strings.

at the beginning you can pick -a or +b, then after that you could have -a -a, +b +b, or either -a +b or -b +a. the important thing to notice here is that after two steps, we really only have three options. it doesn't matter if we add then remove, or remove then add, the paths cost the same, and we end up at the same position in both strings.

ok? got that? paths overlap. right, then, well, there's more numerical trickery.

now let's say we use two variables to keep track of our search, but instead of saving how far we've processed either string, let's save the depth (or how many steps we've made so far), and an offset k, which we'll define as (deletes - adds)

(if you like, you can draw a graph with one string on one axis and one string on the other, and the line y=x is "these strings are the same", and the line y=x+1 is "i had to add one thing" and the line y=x-1 is "i had to remove one thing", and we can call these k=1 and k=-1.)

so at step 1, we can make one addition or one deletion, or D=1, k=+1 or -1 and at step 2, we can make another addition, or deletion, and we'd end up with D=2, k=2, 0, or -2 and step 3, we can end up with D=3, k=-3, -1, +1, +3

now, well, we add in an array, and it'll be called V, and V[k] stores how far you're through deleting a string, and then we can work out the adds we've made by doing (deletes-k)

with me so far?

god.

now in a normal breadth first search, we'd chose to edit or delete at each point in turn, and unfortunately we'd end up with multiple overlaps as we end up in the same place. what myers diff does is to work out the points we could arrive at, then choose if deleting or adding would get us the lowest score.

you see, given a depth D, you can work out which k's you should land on. then you can look back at the array V and decide if you added or deleted to get there.

this is the bit that i tripped over in the code, i was looking for "try add, then try deleting", but instead it's "if adding was cheaper to get here, do that, or do a delete".

right. ok.

we can then, well, add back in common subsequences, and modifiy our algorithm to fit.

so we have a list of V, a Depth, and a k, and after we work out the new k points we could get to at D, and after choosing if we added or deleted to get there, we can check to see if the two strings have matching characters, and then update V[k] to say how far along we got.

which works out because we're making the same amounts of adds and deletions, ugh, so stays the same.

right? right?

in essence we've moved from "after making one edit, we could be at either of these two points", to "hey, after making one edit, we should be at this potential set of points in the string, because we might have found a matching subsequence."

i honestly understand the diagrams in the paper now and i can see why they're there but jesus this would have been a lot simpler if they'd labeled it a bit better

edit over

the next trick is patience diff

the idea is to do more "divide and conquer", by finding unique strings that occur in either file, find the longest common subsequence of unique strings that occur in both, and then split the files up into pieces

until you don't have any unique lines, then you just run normal diff.

ohhhh.


You must log in to comment.

in reply to @tef's post: