Codes for Deletion and Insertion Channels With Segmented Errors

We consider deletion channels and insertion channels under an additional segmentation assumption: the input consists of disjoint segments of b consecutive bits, with at most one error per segment. Under this assumption, we demonstrate simple and computationally efficient deterministic encoding and decoding schemes that achieve a high provable rate even under worst case errors. We also consider more complex schemes that experimentally achieve higher rates under random error.


I. INTRODUCTION
Channels that allow deletions and insertions are remarkably challenging.For example, the capacity of the binary i.i.d.deletion channel, where n bits are sent and each bit is deleted with probability d, remains unknown, despite substantial recent progress [4], [5].Even the case where n bits are sent and just one bit is deleted provides many interesting open problems [8].While some attempts have been made to design coding schemes for such channels, the work has not led to provable performance guarantees and still seems far from optimal.
In this paper, we consider deletion and insertion channels under an additional segmentation assumption about the location of the errors.Specifically, we assume that the input is naturally grouped in consecutive segments of b consecutive bits, and there is at most one error in each segment.For example, if our segments consist of eight bits, and at most one deletion occurs per segment, on the input 0001011100101111, which consists of two segments, it would be possible that the fourth and eleventh bits were deleted, so that the received sequence would be 00001110001111, but not that last two bits were deleted, leaving 00010111001011.
We emphasize that the segments are implicit, and that no segment markers appear in the received sequence.Our goal is to develop efficient codes in this setting.
This additional assumption appears quite natural for many practical settings.Consider the case of disk drives, a commonly given example for synchronizations errors.Deletions may occur because of a timing mismatch between the device reading the data and the data layout.In such situations, there might naturally be a minimal gap between deletions, as the drift caused by the timing error may require several reading several additional bits before the timing error yields a further deletion.Our model encompasses the case where there is such a minimal gap, although it can also allow nearby deletions that cross a segment boundary.Our model would therefore also include settings such as when data is naturally written out in segments (e.g., bytes) by a writer that might erroneously delete a bit per segment, because of timing or other issues, and the reader must deal with the resulting bit sequence.
Another compelling motivation for considering channels with segmentation is the existing theoretical challenges in handling random or worst-case insertions and deletions.Considering channels with additional assumptions may yield insight into the more general problem.
We find that the segmentation assumption greatly simplifies the problem of dealing with insertions or deletions.Our primary result demonstrates a deterministic coding scheme inspired by the idea of prefix coding in compression.Our coding scheme allows for left-to-right decoding of a message, as long as a small amount of lookahead (corresponding to the next segment) is available.The scheme has provable performance guarantees under the segmentation assumption, even with adversarially chosen errors.As an example, with segments of eight bits (one byte), allowing up to one adversarial deletion per segment this scheme provides a code with a rate of 44.8%.The same result holds if we instead allow up to one adversarial insertion per segment.Our coding scheme is computationally simple and quite amenable to use in hardware.We believe the resulting transmission rates prove sufficiently high to be useful in practical settings.
We also consider extensions of our approach to give schemes that provide larger transmission rates under random errors, again with the assumption of at most one error per segment.The idea is to allow some ambiguity in the decoding, and then incorporate check bits and checksums to resolve the ambiguities subsequently.Here our results are experimental, but as an example, again with segments of length one byte, we can achieve rates above 54% with very low error rates.Such schemes, however, also take additional computation time over our simpler schemes.Because of space limitations, we only present our deterministic scheme; results from our extended approach appear in the full version of the paper [6].
While our results are generally incomparable with previous results because of our additional assumptions, we note that previous experimental approaches to channels with insertions and deletions generally allowed much fewer errors with nontrivial block error rates [2], [3], [7].Codes of rate 50% handling only deletions or insertions at a rate of 2 to 6 percent are typical.We believe the performance as well as the simplicity of our schemes represents an advance over previous work.

A. The Communication Model
Formally, our channel transmits binary streams of fixed length n, where n is known to the sender and receiver.We write the input as X = x 1 x 2 . . .x n .We use the notation X(j, k) to refer to the substring x j x j+1 . . .x k , and similarly for other bit sequences.For the segmented deletion channel, the received sequence Y = y 1 y 2 . . .y m is obtained by deletion a number of bits from the input sequence, under the following condition: at most one bit from each set of bits X(bi +1, b(i + 1)) can be deleted by the channel for i = 0, . . ., n/b − 1.
(For convenience we assume that b divides n evenly.)We use s i = X(bi + 1, b(i + 1)) to refer to the bits constituting the ith segment in X, but we also abuse notation and use s i to refer to the corresponding received bits in Y where the meaning is clear.We say the ith segment s i starts at position y if the first undeleted bit of the ith segment occurs at position y .We emphasize that our scheme functions for any set of deletions satisfying the properties of the segmented deletion channel.
The case where b = n, so that there is just one segment and hence just one deletion, has been considered extensively [8].Of particular interest is the class of Varshamov-Tenengolts codes, or VT codes [10].The VT code V T a (n) consists of all binary vectors x 1 x 2 . . .x n satisfying n i=1 ix i ≡ a(modn + 1).
With a VT code, any single deletion can be corrected without error.The codes V T 0 (n) are in fact optimal codes for n up to 9; see [8], [9] for more details.

B. Encoding and Decoding for Deletions
In order to explain the reasoning behind the choices made for our encoding and decoding schemes, we walk through step by step showing how the properties we require arise naturally by first principle considerations.
In our encoding scheme, each segment will consist of one of a set of a b-bit codewords C. We refer to C as a code, even though strictly speaking the code for this channel consists of a concatenation of segments with each coming from C. We use the same set C for every segment, although this is not a requirement of our approach.For u ∈ C, let D 1 (u) be the set of all (b − 1)-bit strings that can be obtained by deleting one bit from u.We refer to D 1 (u) as the set of first order descendants of u, or just the descendants of u where the meaning is clear.This follows the notation used in [8].We also use D The code C is said to be 1-deletion correcting if D 1 (u) = D 1 (v) for all u, v ∈ C with u = v.As mentioned previously, such codes are treated extensively in [8].It is natural that we will want our code C to have this property.
To see why, we start to explain our decoding process.Our decoder will work from left-to-right, decoding one segment at a time.Decoding a segment will only require access to the next 2b − 1 bits in sequence.Consider what might happen as we start from the left on the received sequence Y .The first b − 1 bits reveal the value of the first segment; indeed, in general, when C is 1-deletion correcting, if k is the starting position of a segment, then by examining bits Y (k, k + b − 1), we can determine the codeword associated with the segment.But there may be some ambiguity as to whether a bit was deleted from the segment or not, so the decoder cannot determine whether to extract the first b − 1 or first b bits.For example, if the segments are eight bits, and the first two segments are the strings 00000000 and 00001110, then if the received sequence began with 00000000001110, it would be a mistake to extract 8 bits for the first segment.(As 10 of the first 12 zeroes remain, we can see that one 0 was deleted from each segment.)Doing so would actually remove a bit from the subsequent segment.In general, we may not be sure whether the next segment starts at y b−1 or y b .If we did not control this ambiguity, it could increase as we continue decoding; the third segment could conceivably start at y 2b−2 , y 2b−1 , or y 2b , and so on.
We therefore arrange our code so that this cannot happen.At each step, there will potentially remain some ambiguity; we maintain the invariant the next segment may start at one of at most two positions, y k or y k+1 .This ambiguity is then resolved at the end of the received sequence.
Because our decoder works in this fashion, it is clear that we only need to consider how the decoder works locally.That is, given (Y, i, k) where Y is the received string, i is the segment to be decoded, k is starting position such that the ith segment must start in position k or k + 1, we wish to decode the ith segment and determine an appropriate new position k such that (i + 1)st segment starts at k or k + 1 .We can then iterate through Y to recover X. (It should be clear in what follows that at some points in our algorithm we may have no ambiguity, so that we know the ith segment must start in some position k.The algorithm could be optimized for such situations.We do not consider such optimizations here, as they do not affect our analysis.) Suppose that we have segment s i starting at position k.There are two cases to consider.
• Case 1: There is no deletion s i .In this case, the segment ends at y k+b−1 , and There is exactly one deletion in s i .In this case, the segment ends at y k+b−2 , and Y (k ). Optimistically, we might hope that by restricting our codebook we can determine which case holds at each point, in which case we can decode segment by segment with no ambiguity.The following provides an equivalent way of viewing this restriction.For a string x of length k > 1, let prefix(x) be the first k − 1 bits of x, and similarly define suffix(x) be the last k − 1 bits of x.For a set S of strings let prefix(S) = ∪ x∈S prefix(x) and define suffix(S) similarly.Then for our code C we can require that for all u, v ∈ C with In Case 1, we have , and in Case 2, we have It seems that we have chosen our code so that we can distinguish Case 1 and Case 2, but this is not quite the case.The problem is the bits Y k+b,k+2b−3 can indeed be in both prefix(D 1 (C)) and suffix(D 1 (C)); they simply cannot be in prefix(D 1 (u)) and suffix(D 1 (v)) for some u = v in our code.There is nothing, however, that prevents these bits from being in both prefix(D 1 (u)) and suffix(D 1 (u)) for some u ∈ C.Moreover, this specific ambiguity seems unavoidable; for any u ∈ C, if we delete the first and last bit, we obtain a subsequence that is both in prefix(D 1 (u)) and suffix(D 1 (u)).
Notice, though, that under this restriction, the bits Y (k + b, k + 2b − 3) do determine the segment s i+1 ; that is, there is not ambiguity in what the next segment is, just where it starts and begins.By restricting our codewords slightly further, we can guarantee that this ambiguity does not increase from step to step.We prove this now.
Theorem 2.1: Consider the segmented deletion channel with segment length b.Let C be a subset of {0, 1} b with the following properties: suffix D 1 (v) = ∅; • any string of the form a * (ba) * or a * (ba) * b, where a, b ∈ {0, 1}, is not in C. (Here a * is regular expression notation.)Then, using C as the code for each segment, there exists a linear time decoding scheme for the segmented deletion channel that looks ahead only O(b) bits to decode each block.
Proof: We follow the outline of our discussion.We decode segment by segment, with the invariant that when decoding the ith segment, we know it starts either at position k or position k + 1 in Y .The possible ending positions of the ith segment are y k+b−2 , y k+b−1 , or y k+b .We must eliminate either the first or third possibility to maintain our invariant, and we must recover the ith segment.
We consider two cases.The simple case is when only one , then the ith segment cannot start at y k+1 and must start at y k .In this case we can determine s i from Y (k, k+b−2) and the next segment starts either at y k+b−1 or y k+b .The argument is similar if We now show using our final assumption on the codewords that the next segment starts either at y k+b−1 or y k+b (but not y k+b+1 ).Assume the next segment starts at y k+b+1 .Then s i must be the subsequence Y (k + 1, k + b).Further, as Y (k, k + b − 2) ∈ D 1 (s i ), we have that there exists j with k − 1 ≤ j ≤ k + b − 2 and a bit z such that y k y k+1 ...y j zy j+1 ...y k+b−2 = y k+1 ...y k+b . ( (When j = k − 1, the left hand side is zy k y k+1 ...y k+b−2 .)Comparing bit by bit, we have But then s i is of the form a * (ba) * or a * (ba) * b, contradicting our assumption.
The restriction on C to exclude certain strings is an unfortunate byproduct of our approach.We emphasize, however, that of the 2 b possible codewords, only O(b) of them are excluded.Hence we would expect that this restriction would not dramatically reduce the possible size of the code.
Given these restrictions, finding a valid C for a given segment size b corresponds naturally to an independent set problem, similar to those for 1-bit deletion codes [8].We take the underlying graph where there is a vertex for each possible codeword, and two codewords are connected by an edge if they cannot simultaneously be in the code according to our restrictions.A valid code corresponds to an independent set on this graph, and we therefore seek a maximum independent set.For small b this can be done by exhaustive calculation, and for larger b heuristic techniques can be used to find large codes.In general, proving optimality for such independent set problems can be difficult; related results appear in [1], [9].
We have exhaustively checked to find optimal codes for b = 8 and 9, shown in Figure 1.When b = 8, so that segments are bytes, the (unique) optimal code contains 12 codewords, corresponding to a rate of slightly more than 44.8%.It is worth noting that even if segment markers were given at the receiving end, and an optimal 1-deletion correcting code is used per segment, the maximal such code has only 30 codewords [8], corresponding to a rate of slightly more than 61.3%.Our rate of 44.8% is over over 73% of this benchmark.For b = 9 we found 28 different codes consisting of 20 codewords.Hence for b = 9 the rate is over 48%; comparing to the 52 codewords for an optimal 1-deletion correcting code for one segment,  our codes achieves over 75% of this rate.We conjecture that the rates for optimal codes satisfying the conditions of Theorem 2.1 increase with b.We would also like for the ratio between the size of these codes and the optimal 1-deletion correcting codes to increase with b, and for both these ratios to converge to 1, but these conjectures may be too optimistic.
The inherent limitations of exhaustive search prevents us from finding optimal codes for larger values of b.Indeed, [9] reports on the difficulties of finding independent sets for similar graphs arising from coding problems.Nevertheless, we find that using simple randomized greedy heuristics yields codes with good rates.For example, when b = 16, so segments are two bytes, we have found a code with 740 codewords, giving a rate of approximately 59.57%, by using a simple greedy strategy: repeatedly choose a remaining element of minimal degree, and delete the element and all of its neighbors from the graph.
Our decoding algorithm is particularly amenable to hardware implementation.One possible implementation (in pseudocode) is given as procedure LOCAL-DECODE in Figure 2.Each membership check could be performed by a lookup table, as could the DECODE operation, which decodes sequences to obtain a segment value.While the rates grow larger as b increases, the computational problem of finding a code grows, as do the corresponding size of the lookup tables.
For larger values of b, the lookup tables can be avoided, at the cost of more computation and perhaps some loss of rate.Specifically, the class of VT codes provide an example of 1deletion correcting codes with a simple decoding algorithms [8].If one restricts oneself to a code that is a subset of a VT code meeting the required conditions, then one can use the decoding mechanism for VT codes in place of lookup operations.Subsets of VT codes have the further advantage that they are smaller than the entire set of possible codewords, making the search for appropriate maximal independent sets that yield codes easier.On the other hand, restricting oneself to subsets of VT codes will generally reduce the rate.

C. Encoding and Decoding for Insertions
Our approach works entirely similarly for the segmented insertion channel.In this model, the channel transmits a binary stream of fixed length n, given by X = x 1 x 2 . . .x n .The received sequence Y = y 1 y 2 . . .y m is obtained by inserting a number of bits into the input sequence, under the following condition: at most one bit is added in each segment of bits X(bi + 1, b(i + 1)) for i = 0, . . ., n/b − 1.The bit can be inserted before or after any bit in the sequence.(Note that under this model we can have two bits inserted in a row, but only on either side of a segment boundary.) As before, under our encoding scheme, each segment will consist of one of a fixed set of a b-bit codewords C. Paralleling our previous notation, let I 1 (u) be the set of all (b + 1)-bit strings that can be obtained by inserting one bit into u, and We first show the corresponding version of Theorem 2.1 modified for insertion channels.We then prove something more subtle: our resulting codes for segmented insertion channels and segmented deletion channels are entirely the same.
Theorem 2.2: Consider the segmented insertion channel with segment length b.Let C be a subset of {0, 1} b with the following properties: Then, using C as the code for each segment, there exists a linear time decoding scheme for the segmented insertion channel that looks ahead only O(b) bits to decode each block.
Proof: The proof follows the same pattern as Theorem 2.1.We decode segment by segment, with the invariant that when decoding the ith segment, we know it starts either ISIT2007, Nice, France, June 24 -June 29, 2007 at position k or position k + 1 in Y .The possible ending positions of the ith segment are y k+b−1 , y k+b , or y k+b+1 .We must eliminate either the first or third possibility to maintain our invariant, and we must recover the ith segment.
As before, the simple case is when only one of Y (k, k + b) and Y (k + 1, k + b + 1) is in I 1 (C).In this case we can determine s i and the two possible starting points of the next segment.
If instead both . These bits determine the segment s i .Our additional assumption on the codewords of C will suffice to bound the ambiguity at the next step.
Theorem 2.2 shows that we can solve a similar independent set problem to find codes for the segmented insertion channel.In fact, however, the codes obtained under Theorem 2.1 and Theorem 2.2 are actually the same.To demonstrate this requires the following straightforward lemma: Lemma 2.1: and (abbreviating prefix by pre and suffix by suf) pre(D 1 (u))∩suf(D 1 (v)) = ∅ ↔ pre(I 1 (u))∩suf(I 1 (v)) = ∅.(4) Note that, from this lemma, we have that the conditions of Theorem 2.1 and Theorem 2.2 are in fact equivalent, and hence a code derived by Theorem 2.1 for the segmented deletion channel would also be suitable for the segmented insertion channel (and vice versa).
The case where i = j follows similarly, as does the other direction of the equivalence.

III. CONCLUSION
We have introduced the segmented deletion channel and the segmented insertion channel, new variations of insertion/deletion models motivated by timing considerations.We have demonstrated how to develop codes that allow for greedy left-to-right decoding for these segmented channels, based on controlling the inherent ambiguity.We have shown that such codes can achieve relatively high rates even under adversarial errors satisfying the segmentation condition.Our approach is sufficiently general that it should be applicable to similar channels.In the full paper [6], we further discuss extensions that achieve higher rates under less severe, non-adversarial conditions by allowing more ambiguity in the parsing process; such schemes naturally require significantly more complexity.