Suffix array construction (or, more generally, suffix sorting) is an important task of finding the lexical order of all sub-strings of a given string. It is heavily used in data indexing and BWT compression, which I've been doing for a long time to date.

First of all, it was surprising to me to discover that the subject is possible. By

**linear time**I mean O(N) asymptotic execution time, where N is the input array size. By

**no extra space**I mean that the only memory required is for the input array (N bytes), suffix array (4N bytes) and some constant storage (O(1) bytes).

Induction sort (SA-IS) is the key. Authors did a great work of exploring it. The major breakthrough was to use the induction to sort LMS sub-strings, in addition to using it afterwards to recover the order of all other strings. The only thing missing was a proper memory requirement for the algorithm. Authors claim 2N is the worst-case additional memory. Let's see how we can narrow this value down.

Proof

For an input string of size N, let's define M to be the total number of LMS-type suffixes. The memory (

*R(N,M)*machine words) required for a recursive call consists of three sections:

- Suffix storage = M
- Data storage = M
- Radix storage =
*K(M)*, where*K(M)*is the number of unique LMS sub-strings

*R(N,M) = M + M + K(M) <= 3M/2*

Now, let's split LMS sub-strings according to their length. There can be LMS of size 1. There are

*L2*of size 2,

*L3*of size 3, and so on. Let's now define M and N in terms of these numbers:

*M = Sum(i){ Li }*

*N = Sum(i){ i * Li }*

We know that LMS of different size can not be equal. Therefore, we can safely split

*K(M)*:

*K(M) = Sum(i){ K(Li) } = K(L2) + K(L3) + ... + K(Li) + ...*

*K(Li) <= Li*

We know that the maximum number of unique sub-strings of size 2 can not exceed 2^16, which is a rather small number. For the convenience let's name

*a=L2*and

*b=L3+L4+...= Sum(i>2){ Li }*. It is the time to give a better upper bound to

*R(N,M)*:

*M = a+b*

*N = 2L2 + 3L3 + 4L4 + ... <= 2L2 + 3(L2+L3+...) = 2a + 3b*

*R(N,M) = 2M + K(M) = 2a+2b + K(a) + K(b) <= 2a+2b + 2^16 + b = 2a+3b + 2^16 <=*

**N+O(1)**Well, that's it. By carefully arranging the data in memory and allocating just 2^16 additional words we can perform a recursive call to SA-IS, and, therefore, construct a complete suffix array.

Code

My version of SA-IS is called A7. Advantages to Yuta's SA-IS implementation (v2.4.1):

**5N**is the worst-case memory requirement. It doesn't depend on the input.- No memory allocation/free calls inside the sorting procedure.
- Data squeezing: using smaller data types for recursion if range allows.
- Cleaner C++ code with more comments.

## No comments:

## Post a Comment