Put the knife down and take a green herb, dude. (c) Ruffin Bailey 2001-2021

I've been using git for a while now. It's not bad. The distributed "everyone's a server" setup is really pretty impressive. It's the Gnutella of version control, and that's not necessarily a bad thing.

The one place where I've been unimpressed is in its conflict management. I don't know why, but kdiff3 and SourceTree won't play nicely together on my boxen, and I've had WinMerge save stuff wonkily on Windows as well using git from the command line. With all the merging you're going to be doing in git (no CVS invites merging and conflicting quite like git), conflict management becomes Job 2, second only to coding.

Once all goes to heck when I'm diffing and merging, I usually just want to grab a conflict file and start editing the problems out, save over whatever was there, and then commit normally without any mergetool wonkiness. WinMerge does this -- sorta. You can only edit the "Yours" file, which is bizarre. If you want to use edits from the "Theirs" file, great. But if you want to use anything from yours, you'll never get the pop-up saying the files are identical. Theirs is read-only. Bizarre. As if there's no such thing as temp files or a Documents folder to put those files.

I don't think WinMerge handles things when you turn on the original in your conflict files (another example) either. Edit: I'm right. Here's the sample conflict file:
Shared head <<<<<<< HEAD:file.txt Code changed by A ||||||| merged common ancestors Original code ======= Code changed by B >>>>>>> 77976da35a11db4580b80ae27e8d65caf5208086:file.txt Shared endAnd here are the results. Wrong.

I'd really like to have mergetool calls from git work with WinMerge and kdiff3 so I'm not kludging my workflow, and I'd like to handle conflict files with the original. So what do I do? Starting writing my own conflict file handler, of course. There's no git mergetool written in C#, I don't believe, so that's a legitimate need, right? I'll at least let that push me away from learning enough C to fix WinMerge. ;^) But first, I just want something that'll let me handle conflict files in a dependable fashion.

After putting in a few fun hours over a few weeks (maybe 4-6 horus, sadly enough), I finally have a line-by-line diff engine that seems to be working okay. I need to add something to create markup for intraline changes, but I'm on the way. Which means it's finally time to slow down and see who else has already invented a better wheel in C#, right?

And though I missed it in my initial Googling, as if on cue, it turns out that there is a much better wheel:

This library implements Myer's diff algorithm which is generally considered to be the best general-purpose diff. A layer of pre-diff speedups and post-diff cleanups surround the diff algorithm, improving both performance and output quality.

This library also implements a Bitap matching algorithm at the heart of a flexible matching and patching strategy.

Okay, well, actually it's an overkill wheel. It's not precisely set up to do line-by-line diffs (though the docs say they're "really easy" to do, which I don't doubt). And I don't think I need fuzzy matching -- do I? I mean, you want to see if the code's the same or not, FULL STOP, right?

Now the dude who wrote it (that is, who wrote the original, which was ported from Java to various other langs by others) has a very geeky and very cool blog, and seems like a great, conscientious fellow, so it's certainly nothing personal. That is to say, this guy looks like he's doing good stuff, so I want to use his code, all other things equal. It's just that here, I just want an insanely accessible codebase. /shrug Leaves me wondering if not my using the lib is for simplicity's sake or simply my abuse of The Programmer's Prerogative.

I'm going to put it here so I don't forget about it, and keep plugging away without it. I'd really like to foreground readability in the unconflict codebase, so that a relatively new but skilled programmer can take a look and very quickly understand what's going on without having to plod through it all and create a mental map first. The above lib is 2300 lines of this:

   /**
     * Parse a textual representation of patches and return a List of Patch
     * objects.
     * @param textline Text representation of patches.
     * @return List of Patch objects.
     * @throws ArgumentException If invalid input.
     */
    public List<Patch> patch_fromText(string textline) {
      List<Patch> patches = new List<Patch>();
      if (textline.Length == 0) {
        return patches;
      }
      string[] text = textline.Split('\n');
      int textPointer = 0;
      Patch patch;
      Regex patchHeader
          = new Regex("^@@ -(\\d+),?(\\d*) \\+(\\d+),?(\\d*) @@$");
      Match m;
      char sign;
      string line;
      while (textPointer < text.Length) {
        m = patchHeader.Match(text[textPointer]);
        if (!m.Success) {
          throw new ArgumentException("Invalid patch string: "
              + text[textPointer]);
        }
        patch = new Patch();
        patches.Add(patch);
        patch.start1 = Convert.ToInt32(m.Groups[1].Value);
        if (m.Groups[2].Length == 0) {
          patch.start1--;
          patch.length1 = 1;
        } else if (m.Groups[2].Value == "0") {
          patch.length1 = 0;
        } else {
          patch.start1--;
          patch.length1 = Convert.ToInt32(m.Groups[2].Value);
        }

        patch.start2 = Convert.ToInt32(m.Groups[3].Value);
        if (m.Groups[4].Length == 0) {
          patch.start2--;
          patch.length2 = 1;
        } else if (m.Groups[4].Value == "0") {
          patch.length2 = 0;
        } else {
          patch.start2--;
          patch.length2 = Convert.ToInt32(m.Groups[4].Value);
        }
        textPointer++;

        while (textPointer < text.Length) {
          try {
            sign = text[textPointer][0];
          } catch (IndexOutOfRangeException) {
            // Blank line?  Whatever.
            textPointer++;
            continue;
          }
          line = text[textPointer].Substring(1);
          line = line.Replace("+", "%2b");
          line = HttpUtility.UrlDecode(line, new UTF8Encoding(false, true));
          if (sign == '-') {
            // Deletion.
            patch.diffs.Add(new Diff(Operation.DELETE, line));
          } else if (sign == '+') {
            // Insertion.
            patch.diffs.Add(new Diff(Operation.INSERT, line));
          } else if (sign == ' ') {
            // Minor equality.
            patch.diffs.Add(new Diff(Operation.EQUAL, line));
          } else if (sign == '@') {
            // Start of next patch.
            break;
          } else {
            // WTF?
            throw new ArgumentException(
                "Invalid patch mode '" + sign + "' in: " + line);
          }
          textPointer++;
        }
      }
      return patches;
    }

(Honestly, that's not all that bad, but I think the point -- that it takes a pretty hefty mind rethreading effort before the code is skimable -- stands. Ah, good ole m.Groups, you know? Didn't we set m to patchHeader.Match(text[textPointer]); earlier? Ah, perfect sense. But wasn't patches your cat's name?)

Some day, I'd like to write Moore's Relational Database that uses plain text files to store the information and whose engine was written just as cleanly as I'm proposing here. If it's too slow, too bad. Eventually the processors will catch up with us. But I want it to be easily readable to the point that it can be used heuristically. So little operational code can be. As a self-test, I'll try to keep this unconflict codebase pretty clean as I slowly, painfully find free time to work on it.

EDIT: Looking back over, it sounds like I'm critiquing the beautiful code from the "Diff Match and Patch" lib. I'm not. It's honestly wonderful stuff, which is part of why I'd like to keep this noteToSelf so that I don't lose it. I've even, about a decade ago, taken a few months off to challenge myself to write an app with nobody to blame and no deadlines to force myself to compromise to see how close I could get to well-commented, exceptionally encapsulated code. Eventually, though, I got impatient, picked a few places to take shortcuts (not bad ones, but shortcuts), and BAM -- OMGWTFBBQ!!!1! the stuff's no longer idealistically pure.

The Diff Match and Patch library's code actually stays pretty pure -- excellently factored, afaict, fast, whitespaced, well thought out stuff that even bothers, most of the time, to have descriptive variable names. Doing large things (or small things in a well thought out, slightly complex way) requires some interesting compromises by virtue of being code, I think.

We'll see how well I do, but let me go ahead and admit my code here won't be anything as nice as "Diff Match and Patch"'s. The DMP lib is, afaict after a while looking through it, The Right Way to do diffs. And Matches. And Patches. Meow.

(Ha, I should probably also submit a patch for // text.split('\n') would would temporarily double our memory footprint..)

Labels: c#, git, unconflict

title: Put the knife down and take a green herb, dude.	descrip: One feller's views on the state of everyday computer science & its application (and now, OTHER STUFF) who isn't rich enough to shell out for www.myfreakinfirst-andlast-name.com Using 89% of the same design the blog had in 2001.
FOR ENTERTAINMENT PURPOSES ONLY!!! Back-up your data and, when you bike, always wear white. As an Amazon Associate, I earn from qualifying purchases. Affiliate links in green.

Monday, July 16, 2012
A better mousetrap - "Diff, Match and Patch libraries for Plain Text" I've been using git for a while now. It's not bad. The distributed "everyone's a server" setup is really pretty impressive. It's the Gnutella of version control, and that's not necessarily a bad thing. The one place where I've been unimpressed is in its conflict management. I don't know why, but kdiff3 and SourceTree won't play nicely together on my boxen, and I've had WinMerge save stuff wonkily on Windows as well using git from the command line. With all the merging you're going to be doing in git (no CVS invites merging and conflicting quite like git), conflict management becomes Job 2, second only to coding. Once all goes to heck when I'm diffing and merging, I usually just want to grab a conflict file and start editing the problems out, save over whatever was there, and then commit normally without any mergetool wonkiness. WinMerge does this -- sorta. You can only edit the "Yours" file, which is bizarre. If you want to use edits from the "Theirs" file, great. But if you want to use anything from yours, you'll never get the pop-up saying the files are identical. Theirs is read-only. Bizarre. As if there's no such thing as temp files or a Documents folder to put those files. I don't think WinMerge handles things when you turn on the original in your conflict files (another example) either. Edit: I'm right. Here's the sample conflict file: `Shared head <<<<<<< HEAD:file.txt Code changed by A \|\|\|\|\|\|\| merged common ancestors Original code ======= Code changed by B >>>>>>> 77976da35a11db4580b80ae27e8d65caf5208086:file.txt Shared end`And here are the results. Wrong. I'd really like to have mergetool calls from git work with WinMerge and kdiff3 so I'm not kludging my workflow, and I'd like to handle conflict files with the original. So what do I do? Starting writing my own conflict file handler, of course. There's no git mergetool written in C#, I don't believe, so that's a legitimate need, right? I'll at least let that push me away from learning enough C to fix WinMerge. ;^) But first, I just want something that'll let me handle conflict files in a dependable fashion. After putting in a few fun hours over a few weeks (maybe 4-6 horus, sadly enough), I finally have a line-by-line diff engine that seems to be working okay. I need to add something to create markup for intraline changes, but I'm on the way. Which means it's finally time to slow down and see who else has already invented a better wheel in C#, right? And though I missed it in my initial Googling, as if on cue, it turns out that there is a much better wheel: This library implements Myer's diff algorithm which is generally considered to be the best general-purpose diff. A layer of pre-diff speedups and post-diff cleanups surround the diff algorithm, improving both performance and output quality. This library also implements a Bitap matching algorithm at the heart of a flexible matching and patching strategy. Okay, well, actually it's an overkill wheel. It's not precisely set up to do line-by-line diffs (though the docs say they're "really easy" to do, which I don't doubt). And I don't think I need fuzzy matching -- do I? I mean, you want to see if the code's the same or not, FULL STOP, right? Now the dude who wrote it (that is, who wrote the original, which was ported from Java to various other langs by others) has a very geeky and very cool blog, and seems like a great, conscientious fellow, so it's certainly nothing personal. That is to say, this guy looks like he's doing good stuff, so I want to use his code, all other things equal. It's just that here, I just want an insanely accessible codebase. /shrug Leaves me wondering if not my using the lib is for simplicity's sake or simply my abuse of The Programmer's Prerogative. I'm going to put it here so I don't forget about it, and keep plugging away without it. I'd really like to foreground readability in the unconflict codebase, so that a relatively new but skilled programmer can take a look and very quickly understand what's going on without having to plod through it all and create a mental map first. The above lib is 2300 lines of this: /** * Parse a textual representation of patches and return a List of Patch * objects. * @param textline Text representation of patches. * @return List of Patch objects. * @throws ArgumentException If invalid input. / public List<Patch> patch_fromText(string textline) { List<Patch> patches = new List<Patch>(); if (textline.Length == 0) { return patches; } string[] text = textline.Split('\n'); int textPointer = 0; Patch patch; Regex patchHeader = new Regex("^@@ -(\\d+),?(\\d) \\+(\\d+),?(\\d*) @@$"); Match m; char sign; string line; while (textPointer < text.Length) { m = patchHeader.Match(text[textPointer]); if (!m.Success) { throw new ArgumentException("Invalid patch string: " + text[textPointer]); } patch = new Patch(); patches.Add(patch); patch.start1 = Convert.ToInt32(m.Groups[1].Value); if (m.Groups[2].Length == 0) { patch.start1--; patch.length1 = 1; } else if (m.Groups[2].Value == "0") { patch.length1 = 0; } else { patch.start1--; patch.length1 = Convert.ToInt32(m.Groups[2].Value); } patch.start2 = Convert.ToInt32(m.Groups[3].Value); if (m.Groups[4].Length == 0) { patch.start2--; patch.length2 = 1; } else if (m.Groups[4].Value == "0") { patch.length2 = 0; } else { patch.start2--; patch.length2 = Convert.ToInt32(m.Groups[4].Value); } textPointer++; while (textPointer < text.Length) { try { sign = text[textPointer][0]; } catch (IndexOutOfRangeException) { // Blank line? Whatever. textPointer++; continue; } line = text[textPointer].Substring(1); line = line.Replace("+", "%2b"); line = HttpUtility.UrlDecode(line, new UTF8Encoding(false, true)); if (sign == '-') { // Deletion. patch.diffs.Add(new Diff(Operation.DELETE, line)); } else if (sign == '+') { // Insertion. patch.diffs.Add(new Diff(Operation.INSERT, line)); } else if (sign == ' ') { // Minor equality. patch.diffs.Add(new Diff(Operation.EQUAL, line)); } else if (sign == '@') { // Start of next patch. break; } else { // WTF? throw new ArgumentException( "Invalid patch mode '" + sign + "' in: " + line); } textPointer++; } } return patches; } (Honestly, that's not all that bad, but I think the point -- that it takes a pretty hefty mind rethreading effort before the code is skimable -- stands. Ah, good ole `m.Groups`, you know? Didn't we set `m` to `patchHeader.Match(text[textPointer]);` earlier? Ah, perfect sense. But wasn't `patches` your cat's name?) Some day, I'd like to write Moore's Relational Database that uses plain text files to store the information and whose engine was written just as cleanly as I'm proposing here. If it's too slow, too bad. Eventually the processors will catch up with us. But I want it to be easily readable to the point that it can be used heuristically. So little operational code can be. As a self-test, I'll try to keep this unconflict codebase pretty clean as I slowly, painfully find free time to work on it. EDIT: Looking back over, it sounds like I'm critiquing the beautiful code from the "Diff Match and Patch" lib. I'm not. It's honestly wonderful stuff, which is part of why I'd like to keep this noteToSelf so that I don't lose it. I've even, about a decade ago, taken a few months off to challenge myself to write an app with nobody to blame and no deadlines to force myself to compromise to see how close I could get to well-commented, exceptionally encapsulated code. Eventually, though, I got impatient, picked a few places to take shortcuts (not bad ones, but shortcuts), and BAM -- OMGWTFBBQ!!!1! the stuff's no longer idealistically pure. The Diff Match and Patch library's code actually stays pretty pure -- excellently factored, afaict, fast, whitespaced, well thought out stuff that even bothers, most of the time, to have descriptive variable names. Doing large things (or small things in a well thought out, slightly complex way) requires some interesting compromises by virtue of being code, I think. We'll see how well I do, but let me go ahead and admit my code here won't be anything as nice as "Diff Match and Patch"'s. The DMP lib is, afaict after a while looking through it, The Right Way to do diffs. And Matches. And Patches. Meow. (Ha, I should probably also submit a patch for `// text.split('\n') would would temporarily double our memory footprint.`.) Labels: c#, git, unconflict posted by ruffin at 7/16/2012 12:19:00 AM

<< Older \| Newer >>

x MarkUpDown is the best Markdown editor for professionals on Windows 10. It includes two-pane live preview, in-app uploads to imgur for image hosting, and MultiMarkdown table support. Features you won't find anywhere else include... MarkUpDown Multiline Table & Bootstrap Grid support. Beautiful Easy Actions that keep the Markdown flowing. HTML paste to paste HTML source into your documents. You've wasted more than $15 of your time looking for a great Markdown editor. Stop looking. MarkUpDown is the app you're looking for. Learn more or head over to the 'Store now!