| 1 | // Copyright 2022 The Go Authors. All rights reserved. |
|---|---|
| 2 | // Use of this source code is governed by a BSD-style |
| 3 | // license that can be found in the LICENSE file. |
| 4 | |
| 5 | // package lcs contains code to find longest-common-subsequences |
| 6 | // (and diffs) |
| 7 | package lcs |
| 8 | |
| 9 | /* |
| 10 | Compute longest-common-subsequences of two slices A, B using |
| 11 | algorithms from Myers' paper. A longest-common-subsequence |
| 12 | (LCS from now on) of A and B is a maximal set of lexically increasing |
| 13 | pairs of subscripts (x,y) with A[x]==B[y]. There may be many LCS, but |
| 14 | they all have the same length. An LCS determines a sequence of edits |
| 15 | that changes A into B. |
| 16 | |
| 17 | The key concept is the edit graph of A and B. |
| 18 | If A has length N and B has length M, then the edit graph has |
| 19 | vertices v[i][j] for 0 <= i <= N, 0 <= j <= M. There is a |
| 20 | horizontal edge from v[i][j] to v[i+1][j] whenever both are in |
| 21 | the graph, and a vertical edge from v[i][j] to f[i][j+1] similarly. |
| 22 | When A[i] == B[j] there is a diagonal edge from v[i][j] to v[i+1][j+1]. |
| 23 | |
| 24 | A path between in the graph between (0,0) and (N,M) determines a sequence |
| 25 | of edits converting A into B: each horizontal edge corresponds to removing |
| 26 | an element of A, and each vertical edge corresponds to inserting an |
| 27 | element of B. |
| 28 | |
| 29 | A vertex (x,y) is on (forward) diagonal k if x-y=k. A path in the graph |
| 30 | is of length D if it has D non-diagonal edges. The algorithms generate |
| 31 | forward paths (in which at least one of x,y increases at each edge), |
| 32 | or backward paths (in which at least one of x,y decreases at each edge), |
| 33 | or a combination. (Note that the orientation is the traditional mathematical one, |
| 34 | with the origin in the lower-left corner.) |
| 35 | |
| 36 | Here is the edit graph for A:"aabbaa", B:"aacaba". (I know the diagonals look weird.) |
| 37 | ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ |
| 38 | a | ___/‾‾‾ | ___/‾‾‾ | | | ___/‾‾‾ | ___/‾‾‾ | |
| 39 | ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ |
| 40 | b | | | ___/‾‾‾ | ___/‾‾‾ | | | |
| 41 | ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ |
| 42 | a | ___/‾‾‾ | ___/‾‾‾ | | | ___/‾‾‾ | ___/‾‾‾ | |
| 43 | ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ |
| 44 | c | | | | | | | |
| 45 | ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ |
| 46 | a | ___/‾‾‾ | ___/‾‾‾ | | | ___/‾‾‾ | ___/‾‾‾ | |
| 47 | ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ |
| 48 | a | ___/‾‾‾ | ___/‾‾‾ | | | ___/‾‾‾ | ___/‾‾‾ | |
| 49 | ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ |
| 50 | a a b b a a |
| 51 | |
| 52 | |
| 53 | The algorithm labels a vertex (x,y) with D,k if it is on diagonal k and at |
| 54 | the end of a maximal path of length D. (Because x-y=k it suffices to remember |
| 55 | only the x coordinate of the vertex.) |
| 56 | |
| 57 | The forward algorithm: Find the longest diagonal starting at (0,0) and |
| 58 | label its end with D=0,k=0. From that vertex take a vertical step and |
| 59 | then follow the longest diagonal (up and to the right), and label that vertex |
| 60 | with D=1,k=-1. From the D=0,k=0 point take a horizontal step and the follow |
| 61 | the longest diagonal (up and to the right) and label that vertex |
| 62 | D=1,k=1. In the same way, having labelled all the D vertices, |
| 63 | from a vertex labelled D,k find two vertices |
| 64 | tentatively labelled D+1,k-1 and D+1,k+1. There may be two on the same |
| 65 | diagonal, in which case take the one with the larger x. |
| 66 | |
| 67 | Eventually the path gets to (N,M), and the diagonals on it are the LCS. |
| 68 | |
| 69 | Here is the edit graph with the ends of D-paths labelled. (So, for instance, |
| 70 | 0/2,2 indicates that x=2,y=2 is labelled with 0, as it should be, since the first |
| 71 | step is to go up the longest diagonal from (0,0).) |
| 72 | A:"aabbaa", B:"aacaba" |
| 73 | ⊙ ------- ⊙ ------- ⊙ -------(3/3,6)------- ⊙ -------(3/5,6)-------(4/6,6) |
| 74 | a | ___/‾‾‾ | ___/‾‾‾ | | | ___/‾‾‾ | ___/‾‾‾ | |
| 75 | ⊙ ------- ⊙ ------- ⊙ -------(2/3,5)------- ⊙ ------- ⊙ ------- ⊙ |
| 76 | b | | | ___/‾‾‾ | ___/‾‾‾ | | | |
| 77 | ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ -------(3/5,4)------- ⊙ |
| 78 | a | ___/‾‾‾ | ___/‾‾‾ | | | ___/‾‾‾ | ___/‾‾‾ | |
| 79 | ⊙ ------- ⊙ -------(1/2,3)-------(2/3,3)------- ⊙ ------- ⊙ ------- ⊙ |
| 80 | c | | | | | | | |
| 81 | ⊙ ------- ⊙ -------(0/2,2)-------(1/3,2)-------(2/4,2)-------(3/5,2)-------(4/6,2) |
| 82 | a | ___/‾‾‾ | ___/‾‾‾ | | | ___/‾‾‾ | ___/‾‾‾ | |
| 83 | ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ |
| 84 | a | ___/‾‾‾ | ___/‾‾‾ | | | ___/‾‾‾ | ___/‾‾‾ | |
| 85 | ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ |
| 86 | a a b b a a |
| 87 | |
| 88 | The 4-path is reconstructed starting at (4/6,6), horizontal to (3/5,6), diagonal to (3,4), vertical |
| 89 | to (2/3,3), horizontal to (1/2,3), vertical to (0/2,2), and diagonal to (0,0). As expected, |
| 90 | there are 4 non-diagonal steps, and the diagonals form an LCS. |
| 91 | |
| 92 | There is a symmetric backward algorithm, which gives (backwards labels are prefixed with a colon): |
| 93 | A:"aabbaa", B:"aacaba" |
| 94 | ⊙ -------- ⊙ -------- ⊙ -------- ⊙ -------- ⊙ -------- ⊙ -------- ⊙ |
| 95 | a | ____/‾‾‾ | ____/‾‾‾ | | | ____/‾‾‾ | ____/‾‾‾ | |
| 96 | ⊙ -------- ⊙ -------- ⊙ -------- ⊙ -------- ⊙ --------(:0/5,5)-------- ⊙ |
| 97 | b | | | ____/‾‾‾ | ____/‾‾‾ | | | |
| 98 | ⊙ -------- ⊙ -------- ⊙ --------(:1/3,4)-------- ⊙ -------- ⊙ -------- ⊙ |
| 99 | a | ____/‾‾‾ | ____/‾‾‾ | | | ____/‾‾‾ | ____/‾‾‾ | |
| 100 | (:3/0,3)--------(:2/1,3)-------- ⊙ --------(:2/3,3)--------(:1/4,3)-------- ⊙ -------- ⊙ |
| 101 | c | | | | | | | |
| 102 | ⊙ -------- ⊙ -------- ⊙ --------(:3/3,2)--------(:2/4,2)-------- ⊙ -------- ⊙ |
| 103 | a | ____/‾‾‾ | ____/‾‾‾ | | | ____/‾‾‾ | ____/‾‾‾ | |
| 104 | (:3/0,1)-------- ⊙ -------- ⊙ -------- ⊙ --------(:3/4,1)-------- ⊙ -------- ⊙ |
| 105 | a | ____/‾‾‾ | ____/‾‾‾ | | | ____/‾‾‾ | ____/‾‾‾ | |
| 106 | (:4/0,0)-------- ⊙ -------- ⊙ -------- ⊙ --------(:4/4,0)-------- ⊙ -------- ⊙ |
| 107 | a a b b a a |
| 108 | |
| 109 | Neither of these is ideal for use in an editor, where it is undesirable to send very long diffs to the |
| 110 | front end. It's tricky to decide exactly what 'very long diffs' means, as "replace A by B" is very short. |
| 111 | We want to control how big D can be, by stopping when it gets too large. The forward algorithm then |
| 112 | privileges common prefixes, and the backward algorithm privileges common suffixes. Either is an undesirable |
| 113 | asymmetry. |
| 114 | |
| 115 | Fortunately there is a two-sided algorithm, implied by results in Myers' paper. Here's what the labels in |
| 116 | the edit graph look like. |
| 117 | A:"aabbaa", B:"aacaba" |
| 118 | ⊙ --------- ⊙ --------- ⊙ --------- ⊙ --------- ⊙ --------- ⊙ --------- ⊙ |
| 119 | a | ____/‾‾‾‾ | ____/‾‾‾‾ | | | ____/‾‾‾‾ | ____/‾‾‾‾ | |
| 120 | ⊙ --------- ⊙ --------- ⊙ --------- (2/3,5) --------- ⊙ --------- (:0/5,5)--------- ⊙ |
| 121 | b | | | ____/‾‾‾‾ | ____/‾‾‾‾ | | | |
| 122 | ⊙ --------- ⊙ --------- ⊙ --------- (:1/3,4)--------- ⊙ --------- ⊙ --------- ⊙ |
| 123 | a | ____/‾‾‾‾ | ____/‾‾‾‾ | | | ____/‾‾‾‾ | ____/‾‾‾‾ | |
| 124 | ⊙ --------- (:2/1,3)--------- (1/2,3) ---------(2:2/3,3)--------- (:1/4,3)--------- ⊙ --------- ⊙ |
| 125 | c | | | | | | | |
| 126 | ⊙ --------- ⊙ --------- (0/2,2) --------- (1/3,2) ---------(2:2/4,2)--------- ⊙ --------- ⊙ |
| 127 | a | ____/‾‾‾‾ | ____/‾‾‾‾ | | | ____/‾‾‾‾ | ____/‾‾‾‾ | |
| 128 | ⊙ --------- ⊙ --------- ⊙ --------- ⊙ --------- ⊙ --------- ⊙ --------- ⊙ |
| 129 | a | ____/‾‾‾‾ | ____/‾‾‾‾ | | | ____/‾‾‾‾ | ____/‾‾‾‾ | |
| 130 | ⊙ --------- ⊙ --------- ⊙ --------- ⊙ --------- ⊙ --------- ⊙ --------- ⊙ |
| 131 | a a b b a a |
| 132 | |
| 133 | The algorithm stopped when it saw the backwards 2-path ending at (1,3) and the forwards 2-path ending at (3,5). The criterion |
| 134 | is a backwards path ending at (u,v) and a forward path ending at (x,y), where u <= x and the two points are on the same |
| 135 | diagonal. (Here the edgegraph has a diagonal, but the criterion is x-y=u-v.) Myers proves there is a forward |
| 136 | 2-path from (0,0) to (1,3), and that together with the backwards 2-path ending at (1,3) gives the expected 4-path. |
| 137 | Unfortunately the forward path has to be constructed by another run of the forward algorithm; it can't be found from the |
| 138 | computed labels. That is the worst case. Had the code noticed (x,y)=(u,v)=(3,3) the whole path could be reconstructed |
| 139 | from the edgegraph. The implementation looks for a number of special cases to try to avoid computing an extra forward path. |
| 140 | |
| 141 | If the two-sided algorithm has stop early (because D has become too large) it will have found a forward LCS and a |
| 142 | backwards LCS. Ideally these go with disjoint prefixes and suffixes of A and B, but disjointness may fail and the two |
| 143 | computed LCS may conflict. (An easy example is where A is a suffix of B, and shares a short prefix. The backwards LCS |
| 144 | is all of A, and the forward LCS is a prefix of A.) The algorithm combines the two |
| 145 | to form a best-effort LCS. In the worst case the forward partial LCS may have to |
| 146 | be recomputed. |
| 147 | */ |
| 148 | |
| 149 | /* Eugene Myers paper is titled |
| 150 | "An O(ND) Difference Algorithm and Its Variations" |
| 151 | and can be found at |
| 152 | http://www.xmailserver.org/diff2.pdf |
| 153 | |
| 154 | (There is a generic implementation of the algorithm the the repository with git hash |
| 155 | b9ad7e4ade3a686d608e44475390ad428e60e7fc) |
| 156 | */ |
| 157 |
Members