Monday, January 9, 2012

Functional thinking

I started reading a book "Learn You a Haskell". It is truly wonderful, explaining complex things in a simple friendly way. Haskell itself seems to be the functional language. It's pureness makes you feel writing math  lemmas and theorems in a formal language, it is a very different feeling from C++/Boo I used to program in.

Surprisingly enough, functional programming makes us care about the goal, or a shape of the result. You answer questions like "What should it look like? What does it consist of?". At the same time, imperative programming involves asking the question "How?" most of the time.

As a first milestone and a practical task in learning Haskell I wrote the Burrows-Wheeler Transformation (BWT). I never thought it could be implemented in just 10 lines (not counting qsort), and remain nicely readable after that:

qsort :: (Ord a) => [a] -> [a]
qsort [] = []
qsort (x:xs) = qsort left ++ [x] ++ qsort right
where left = filter (<=x) xs
right = filter (>x) xs
bwt :: String -> (Int,String)
bwt input =
let buildMx [] _ = []
buildMx (x:xs) ch = (x:xs,ch) : buildMx xs x
mx = buildMx input (last input)
sorted = qsort mx
output = map snd sorted
base = head [ i | (i,(s,_))<-zip [0..] sorted, s==input ]
in (base,output)

Friday, January 6, 2012

Suffix sorting in linear time with no extra space

Suffix array construction (or, more generally, suffix sorting) is an important task of finding the lexical order of all sub-strings of a given string. It is heavily used in data indexing and BWT compression, which I've been doing for a long time to date.
First of all, it was surprising to me to discover that the subject is possible. By linear time I mean O(N) asymptotic execution time, where N is the input array size. By no extra space I mean that the only memory required is for the input array (N bytes), suffix array (4N bytes) and some constant storage (O(1) bytes).

Induction sort (SA-IS) is the key. Authors did a great work of exploring it. The major breakthrough was to use the induction to sort LMS sub-strings, in addition to using it afterwards to recover the order of all other strings. The only thing missing was a proper memory requirement for the algorithm. Authors claim 2N is the worst-case additional memory. Let's see how we can narrow this value down.

For an input string of size N, let's define M to be the total number of LMS-type suffixes. The memory (R(N,M) machine words) required for a recursive call consists of three sections:

  • Suffix storage = M
  • Data storage = M
  • Radix storage = K(M), where K(M) is the number of unique LMS sub-strings

R(N,M) = M + M + K(M) <= 3M/2

Now, let's split LMS sub-strings according to their length. There can be LMS of size 1. There are L2 of size 2, L3 of size 3, and so on. Let's now define M and N in terms of these numbers:
M = Sum(i){ Li }
N = Sum(i){ i * Li }

We know that LMS of different size can not be equal. Therefore, we can safely split K(M):
K(M) = Sum(i){ K(Li) } = K(L2) + K(L3) + ... + K(Li) + ...
K(Li) <= Li

We know that the maximum number of unique sub-strings of size 2 can not exceed 2^16, which is a rather small number. For the convenience let's name a=L2 and b=L3+L4+...= Sum(i>2){ Li }. It is the time to give a better upper bound to R(N,M):
M = a+b
N = 2L2 + 3L3 + 4L4 + ... <= 2L2 + 3(L2+L3+...) = 2a + 3b
R(N,M) = 2M + K(M) = 2a+2b + K(a) + K(b) <= 2a+2b + 2^16 + b = 2a+3b + 2^16 <= N+O(1)

Well, that's it. By carefully arranging the data in memory and allocating just 2^16 additional words we can perform a recursive call to SA-IS, and, therefore, construct a complete suffix array.

My version of SA-IS is called A7. Advantages to Yuta's SA-IS implementation (v2.4.1):
  • 5N is the worst-case memory requirement. It doesn't depend on the input.
  • No memory allocation/free calls inside the sorting procedure.
  • Data squeezing: using smaller data types for recursion if range allows.
  • Cleaner C++ code with more comments.