Friday, January 6, 2012

Suffix sorting in linear time with no extra space

Intro
Suffix array construction (or, more generally, suffix sorting) is an important task of finding the lexical order of all sub-strings of a given string. It is heavily used in data indexing and BWT compression, which I've been doing for a long time to date.
First of all, it was surprising to me to discover that the subject is possible. By linear time I mean O(N) asymptotic execution time, where N is the input array size. By no extra space I mean that the only memory required is for the input array (N bytes), suffix array (4N bytes) and some constant storage (O(1) bytes).

Induction sort (SA-IS) is the key. Authors did a great work of exploring it. The major breakthrough was to use the induction to sort LMS sub-strings, in addition to using it afterwards to recover the order of all other strings. The only thing missing was a proper memory requirement for the algorithm. Authors claim 2N is the worst-case additional memory. Let's see how we can narrow this value down.

Proof
For an input string of size N, let's define M to be the total number of LMS-type suffixes. The memory (R(N,M) machine words) required for a recursive call consists of three sections:

  • Suffix storage = M
  • Data storage = M
  • Radix storage = K(M), where K(M) is the number of unique LMS sub-strings

R(N,M) = M + M + K(M) <= 3M/2

Now, let's split LMS sub-strings according to their length. There can be LMS of size 1. There are L2 of size 2, L3 of size 3, and so on. Let's now define M and N in terms of these numbers:
M = Sum(i){ Li }
N = Sum(i){ i * Li }

We know that LMS of different size can not be equal. Therefore, we can safely split K(M):
K(M) = Sum(i){ K(Li) } = K(L2) + K(L3) + ... + K(Li) + ...
K(Li) <= Li

We know that the maximum number of unique sub-strings of size 2 can not exceed 2^16, which is a rather small number. For the convenience let's name a=L2 and b=L3+L4+...= Sum(i>2){ Li }. It is the time to give a better upper bound to R(N,M):
M = a+b
N = 2L2 + 3L3 + 4L4 + ... <= 2L2 + 3(L2+L3+...) = 2a + 3b
R(N,M) = 2M + K(M) = 2a+2b + K(a) + K(b) <= 2a+2b + 2^16 + b = 2a+3b + 2^16 <= N+O(1)

Well, that's it. By carefully arranging the data in memory and allocating just 2^16 additional words we can perform a recursive call to SA-IS, and, therefore, construct a complete suffix array.

Code
My version of SA-IS is called A7. Advantages to Yuta's SA-IS implementation (v2.4.1):
  • 5N is the worst-case memory requirement. It doesn't depend on the input.
  • No memory allocation/free calls inside the sorting procedure.
  • Data squeezing: using smaller data types for recursion if range allows.
  • Cleaner C++ code with more comments.

No comments:

Post a Comment