Suffix Array

1. Introduction

Suffix Array is a sorted array of all suffixes of a given (usually long) text string T of length n characters (n can be in order of hundred thousands characters).

Suffix Array is a simple, yet powerful data structure which is used, among others, in full text indices, data compression algorithms, and within the field of bioinformatics.

This data structure is very related to the Suffix Tree data structure. Both data structures are usually studied together.

2. Suffix Array Visualization

The visualization of Suffix Array is simply a table where each row represents a suffix and each column represents the attributes of the suffixes.

The four (basic) attributes of each row i are:

index i, ranging from 0 to n-1,
SA[i]: the i-th lexicographically smallest suffix of T is the SA[i]-th suffix,
LCP[i]: the Longest Common Prefix between the i-th and the (i-1)-th lexicographically smallest suffixes of T is LCP[i] (we will see the application of this attribute later), and
Suffix T[SA[i]:] - the i-th lexicographically smallest suffix of T is from index SA[i] to the end (index n-1).

Some operations may add more attributes to each row and are explained when that operations are discussed.

3. å¯ç”¨çš„æ“ä½œ

åŽç¼€æ•°ç»„çš„æ‰€æœ‰å¯ç”¨æ“ä½œå¦‚ä¸‹æ‰€åˆ—ã€‚

æž„å»ºåŽç¼€æ•°ç»„ (SA) æ˜¯åŸºäºŽ Karpï¼ŒMillerï¼Œå’Œ Rosenberg (1972) çš„æƒ³æ³•çš„ O(n log n) åŽç¼€æ•°ç»„æž„å»ºç®—æ³•ï¼Œè¯¥ç®—æ³•æŒ‰ç…§å¢žé•¿é•¿åº¦ (1, 2, 4, 8, ...) å¯¹åŽç¼€çš„å‰ç¼€è¿›è¡ŒæŽ’åºã€‚
æœç´¢ åˆ©ç”¨åŽç¼€æ•°ç»„ä¸çš„åŽç¼€å·²æŽ’åºçš„äº‹å®žï¼Œå¹¶åœ¨ O(m log n) ä¸è°ƒç”¨ä¸¤ä¸ªäºŒè¿›åˆ¶æœç´¢ï¼Œä»¥æ‰¾åˆ°æ¨¡å¼å—ç¬¦ä¸² P çš„é•¿åº¦ m çš„ç¬¬ä¸€æ¬¡å’Œæœ€åŽä¸€æ¬¡å‡ºçŽ°ã€‚
æœ€é•¿å…¬å…±å‰ç¼€ (LCP) å¯ä»¥åœ¨ O(n) ä¸ä½¿ç”¨æŽ’åˆ— LCP (PLCP) å®šç†è®¡ç®—ä¸¤ä¸ªç›¸é‚»åŽç¼€ï¼ˆä¸åŒ…æ‹¬ç¬¬ä¸€ä¸ªåŽç¼€ï¼‰ä¹‹é—´çš„æœ€é•¿å…¬å…±å‰ç¼€ã€‚è¿™ä¸ªç®—æ³•çš„åå—å«åš Kasai's ç®—æ³•ã€‚
æœ€é•¿é‡å¤åä¸² (LRS) æ˜¯ä¸€ä¸ªç®€å•çš„ O(n) ç®—æ³•ï¼Œå®ƒæ‰¾åˆ°å…·æœ‰æœ€é«˜ LCP å€¼çš„åŽç¼€ã€‚
æœ€é•¿å…¬å…±åä¸² (LCS) æ˜¯ä¸€ä¸ªç®€å•çš„ O(n) ç®—æ³•ï¼Œå®ƒæ‰¾åˆ°æ¥è‡ªä¸¤ä¸ªä¸åŒå—ç¬¦ä¸²çš„å…·æœ‰æœ€é«˜ LCP å€¼çš„åŽç¼€ã€‚

3-1. Construct Suffix Array - UI

åœ¨è¿™ä¸ªå¯è§†åŒ–ä¸ï¼Œæˆ‘ä»¬å±•ç¤ºäº†åŸºäºŽKarpï¼ŒMillerå’ŒRosenbergï¼ˆ1972ï¼‰çš„æƒ³æ³•ï¼Œé€šè¿‡æŒ‰é€’å¢žé•¿åº¦ï¼ˆ1ï¼Œ2ï¼Œ4ï¼Œ8ï¼Œ...ï¼‰æŽ’åºåŽç¼€çš„å‰ç¼€ï¼Œå³æ‰€è°“çš„å‰ç¼€å€å¢žç®—æ³•ï¼Œæ¥æ£ç¡®åœ°æž„å»ºåŽç¼€æ•°ç»„çš„O(n log n)æ–¹æ³•ã€‚

æˆ‘ä»¬é™åˆ¶è¾“å…¥åªæŽ¥å—12ä¸ªï¼ˆç”±äºŽå¯ç”¨çš„ç»˜å›¾ç©ºé—´ï¼Œä¸èƒ½å¤ªé•¿ - ä½†åœ¨åŽç¼€æ ‘çš„å®žé™…åº”ç”¨ä¸ï¼Œnå¯ä»¥æ˜¯åä¸‡åˆ°ç™¾ä¸‡ä¸ªå—ç¬¦çš„é¡ºåºï¼‰å¤§å†™ï¼ˆæˆ‘ä»¬ä¼šåˆ é™¤æ‚¨çš„å°å†™è¾“å…¥ï¼‰å—æ¯å’Œç‰¹æ®Šç»ˆæ¢ç¬¦'$'å—ç¬¦ï¼ˆå³ï¼Œ[A-Z$]ï¼‰ã€‚å¦‚æžœæ‚¨æ²¡æœ‰åœ¨è¾“å…¥å—ç¬¦ä¸²çš„æœ«å°¾å†™ä¸€ä¸ªç»ˆæ¢ç¬¦'$'ï¼Œæˆ‘ä»¬å°†è‡ªåŠ¨è¿™æ ·åšã€‚å¦‚æžœæ‚¨åœ¨è¾“å…¥å—ç¬¦ä¸²çš„ä¸é—´æ”¾ä¸€ä¸ª'$'ï¼Œå®ƒä»¬å°†è¢«å¿½ç•¥ã€‚å¦‚æžœæ‚¨è¾“å…¥ä¸€ä¸ªç©ºçš„è¾“å…¥å—ç¬¦ä¸²ï¼Œæˆ‘ä»¬å°†é»˜è®¤ä¸º"GATAGACA$"ã€‚

ä¸ºäº†æ–¹ä¾¿ï¼Œæˆ‘ä»¬æä¾›äº†ä¸€äº›é€šå¸¸åœ¨åŽç¼€æ ‘/æ•°ç»„è®²åº§ä¸æ‰¾åˆ°çš„ç»å…¸æµ‹è¯•ç”¨ä¾‹è¾“å…¥å—ç¬¦ä¸²ï¼Œä½†ä¸ºäº†å±•ç¤ºè¿™ä¸ªå¯è§†åŒ–å·¥å…·çš„å¼ºå¤§ï¼Œæˆ‘ä»¬é¼“åŠ±æ‚¨è¾“å…¥ä»»ä½•æ‚¨é€‰æ‹©çš„12ä¸ªå—ç¬¦çš„å—ç¬¦ä¸²ï¼ˆä»¥å—ç¬¦'$'ç»“æŸï¼‰ã€‚

è¯·æ³¨æ„ï¼ŒLCPæ•°ç»„åˆ—åœ¨æ¤æ“ä½œä¸ä¿æŒä¸ºç©ºã€‚å®ƒä»¬å°†é€šè¿‡æœ€é•¿å…¬å…±å‰ç¼€æ“ä½œå•ç‹¬è®¡ç®—ã€‚

3-2. The Prefix Doubling Algorithm

è¿™ä¸ªå‰ç¼€å€å¢žç®—æ³•åœ¨ O(log n) è¿ä»£ä¸è¿è¡Œï¼Œå…¶ä¸æ¯æ¬¡è¿ä»£ï¼Œå®ƒæ¯”è¾ƒåå—ç¬¦ä¸² T[SA[i]:SA[i+k]] ä¸Ž T[SA[i+k]:SA[i+2*k]]ï¼Œå³ï¼Œé¦–å…ˆæ¯”è¾ƒä¸¤å¯¹å—ç¬¦ï¼Œç„¶åŽæ¯”è¾ƒå‰ä¸¤ä¸ªå—ç¬¦ä¸Žä¸‹ä¸€ä¸ªä¸¤ä¸ªå—ç¬¦ï¼Œç„¶åŽæ¯”è¾ƒå‰å››ä¸ªå—ç¬¦ä¸Žä¸‹ä¸€ä¸ªå››ä¸ªå—ç¬¦ï¼Œä¾æ¤ç±»æŽ¨ã€‚

é€šè¿‡å¯è§†åŒ–æœ€å¥½æŽ¢ç´¢æ¤ç®—æ³•ï¼Œçœ‹ ConstructSA("GATAGACA$") çš„åŠ¨ä½œã€‚

æ—¶é—´å¤æ‚åº¦ï¼šæœ‰ O(log n) å‰ç¼€å€å¢žè¿ä»£ï¼Œæ¯æ¬¡è¿ä»£æˆ‘ä»¬è°ƒç”¨ O(n) åŸºæ•°æŽ’åºï¼Œå› æ¤å®ƒåœ¨ O(n log n) ä¸è¿è¡Œ - è¶³ä»¥å¤„ç†æœ€å¤š n â‰¤ 200K å—ç¬¦çš„å…¸åž‹ç¼–ç¨‹ç«žèµ›é—®é¢˜ä¸æ¶‰åŠçš„é•¿å—ç¬¦ä¸²ã€‚

3-3. Search

After we construct the Suffix Array of T in O(n log n), we can search for the occurrence of Pattern string T in O(m log n) by binary searching the sorted suffixes to find the lower bound (the first occurrence of P as a prefix of any suffix of T) and the upper bound positions (thelast occurrence of P as a prefix of any suffix of T).

Time complexity: O(m log n) and it will return an interval of size k where k is the total number of occurrences.

For example, on the Suffix Array of T = "GATAGACA$" above, try these scenarios:

P returns a range of rows: Search("GA"), occurrences = {4, 0}
P returns one row only: Search("CA"), occurrences = {2}
P is not found in T: Search("WONKA"), occurrences = {NIL}

3-4. Longest Common Prefix (LCP)

æˆ‘ä»¬å¯ä»¥ä½¿ç”¨Kasaiç®—æ³•çš„ä¸‰ä¸ªé˜¶æ®µåœ¨O(n)æ—¶é—´å†…è®¡ç®—ä¸¤ä¸ªç›¸é‚»åŽç¼€ï¼ˆåœ¨åŽç¼€æ•°ç»„é¡ºåºä¸ï¼‰çš„æœ€é•¿å…¬å…±å‰ç¼€ï¼ˆLCPï¼‰ã€‚è¿™ä¸ªç®—æ³•åˆ©ç”¨äº†ä¸€ä¸ªä¼˜ç‚¹ï¼Œå³å¦‚æžœæˆ‘ä»¬åœ¨ä¸¤ä¸ªç›¸é‚»åŽç¼€ï¼ˆåœ¨åŽç¼€æ•°ç»„é¡ºåºä¸ï¼‰ä¹‹é—´æœ‰ä¸€ä¸ªé•¿çš„LCPï¼Œé‚£ä¹ˆå½“å…¶ç¬¬ä¸€ä¸ªå—ç¬¦è¢«ç§»é™¤æ—¶ï¼Œè¿™ä¸ªé•¿çš„LCPä¸Žå¦ä¸€ä¸ªåœ¨ä½ç½®é¡ºåºä¸çš„åŽç¼€æœ‰å¾ˆå¤šé‡å ã€‚

ç¬¬ä¸€é˜¶æ®µï¼šè®¡ç®—Phi[]çš„å€¼ï¼Œå…¶ä¸Phi[SA[i]] = SA[i-1]åœ¨O(n)ä¸ã€‚è¿™æ˜¯ä¸ºäº†å¸®åŠ©ç®—æ³•åœ¨$O(1)æ—¶é—´å†…çŸ¥é“å“ªä¸ªåŽç¼€åœ¨åŽç¼€æ•°ç»„é¡ºåºä¸ä½äºŽåŽç¼€-SA[i]ä¹‹åŽã€‚

ç¬¬äºŒé˜¶æ®µï¼šè®¡ç®—ä½ç½®é¡ºåºä¸çš„åŽç¼€-iä¸ŽåŽç¼€-Phi[i]ï¼ˆåœ¨åŽç¼€æ•°ç»„é¡ºåºä¸ä½äºŽåŽç¼€-iä¹‹åŽçš„é‚£ä¸ªï¼‰ä¹‹é—´çš„PLCP[]å€¼ã€‚å½“æˆ‘ä»¬å‰è¿›åˆ°ä½ç½®é¡ºåºä¸çš„ä¸‹ä¸€ä¸ªç´¢å¼•i+1æ—¶ï¼Œæˆ‘ä»¬å°†ç§»é™¤åŽç¼€çš„æœ€å‰é¢çš„å—ç¬¦ï¼Œä½†å¯èƒ½ä¿ç•™å¤§é‡çš„åŽç¼€-(i+1)å’ŒåŽç¼€-Phi[(i+1)]ä¹‹é—´çš„LCPå€¼ã€‚PLCPå®šç†ï¼ˆæœªè¯æ˜Žï¼‰æ˜¾ç¤ºï¼ŒLCPå€¼åªèƒ½å¢žåŠ åˆ°næ¬¡ï¼Œå› æ¤ä¹Ÿåªèƒ½å‡å°‘åˆ°æœ€å¤šnæ¬¡ï¼Œä½¿å¾—ç¬¬äºŒé˜¶æ®µçš„æ€»ä½“å¤æ‚æ€§ä¹Ÿæ˜¯O(n)ã€‚

ç¬¬ä¸‰é˜¶æ®µï¼šæˆ‘ä»¬è®¡ç®—LCP[]çš„å€¼ï¼Œå…¶ä¸LCP[i] = PLCP[SA[i]]åœ¨O(n)ä¸ã€‚è¿™äº›LCPå€¼æ˜¯æˆ‘ä»¬ç¨åŽç”¨äºŽå…¶ä»–åŽç¼€æ•°ç»„åº”ç”¨çš„å€¼ã€‚

æ—¶é—´å¤æ‚æ€§ï¼šKasaiçš„ç®—æ³•åˆ©ç”¨äº†PLCPå®šç†ï¼Œå…¶ä¸LCPå€¼çš„å¢žåŠ ï¼ˆå’Œå‡å°‘ï¼‰æ“ä½œçš„æ€»æ•°æœ€å¤šæ˜¯O(n)ã€‚å› æ¤ï¼ŒKasaiçš„ç®—æ³•æ€»ä½“ä¸Šè¿è¡Œåœ¨O(n)ä¸ã€‚å› æ¤ï¼ŒO(n log n)åŽç¼€æ•°ç»„æž„é€ ï¼ˆé€šè¿‡å‰ç¼€å€å¢žç®—æ³•ï¼‰å’Œä½¿ç”¨è¿™ä¸ªKasaiçš„ç®—æ³•è®¡ç®—LCPæ•°ç»„çš„O(n)è®¡ç®—å¯¹äºŽå¤„ç†æ¶‰åŠé•¿å—ç¬¦ä¸²çš„å…¸åž‹ç¼–ç¨‹æ¯”èµ›é—®é¢˜ï¼ˆæœ€å¤šn â‰¤ 200Kå—ç¬¦ï¼‰æ˜¯è¶³å¤Ÿçš„ã€‚

3-5. Longest Repeated Substring (LRS)

After we construct the Suffix Array of T in O(n log n) and compute its LCP Array in O(n), we can find the Longest Repeated Substring (LRS) in T by simply iterating through all LCP values and reporting the largest one.

This is because each value LCP[i] the LCP Array means the longest common prefix between two lexicographically adjacent suffixes: Suffix-i and Suffix-(i-1). This corresponds to an internal vertex of the equivalent Suffix Tree of T that branches out to at least two (or more) suffixes, thus this common prefix of these adjacent suffixes are repeated.

The longest common (repeated) prefix is the required answer, which can be found in O(n) by going through the LCP array once.

Without further ado, try LRS("GATAGACA$"). We have LRS = "GA".

It is possible that T contains more than one LRS, e.g., try LRS("BANANABAN$").
We have LRS = "ANA" (actually overlap) or "BAN" (without overlap).

3-6. Longest Common Substring (LCS)

After we construct the generalized Suffix Array of the concatenation of both strings T1$T2# of length n = n1+n2 in O(n log n) and compute its LCP Array in O(n), we can find the Longest Repeated Substring (LRS) in T by simply iterating through all LCP values and reporting the largest one that comes from two different strings.

Without further ado, try LCS("GATAGACA$", "CATA#") on the generalized Suffix Array of string T1 = "GATAGACA$" and T2 = "CATA#". We have LCS = "ATA".

4. Implementation

You are allowed to use/modify our implementation code for fast Suffix Array+LCP: sa_lcp.cpp | py | java | ml to solve programming contest problems that need it.