Suffix Array is a sorted array of all suffixes of a given (usually long) text string T of length n characters (n can be in order of hundred thousands characters).
Suffix Array is a simple, yet powerful data structure which is used, among others, in full text indices, data compression algorithms, and within the field of bioinformatics.
This data structure is very related to the Suffix Tree data structure. Both data structures are usually studied together.
The visualization of Suffix Array is simply a table where each row represents a suffix and each column represents the attributes of the suffixes.
The four (basic) attributes of each row i are:
Some operations may add more attributes to each row and are explained when that operations are discussed.
åç¼æ°ç»çææå¯ç¨æä½å¦ä¸æåã
å¨è¿ä¸ªå¯è§åä¸ï¼æ们å±ç¤ºäºåºäºKarpï¼MilleråRosenbergï¼1972ï¼çæ³æ³ï¼éè¿æéå¢é¿åº¦ï¼1ï¼2ï¼4ï¼8ï¼...ï¼æåºåç¼çåç¼ï¼å³æè°çåç¼åå¢ç®æ³ï¼æ¥æ£ç¡®å°æ建åç¼æ°ç»çO(n log n)æ¹æ³ã
æ们éå¶è¾å ¥åªæ¥å12个ï¼ç±äºå¯ç¨çç»å¾ç©ºé´ï¼ä¸è½å¤ªé¿ - ä½å¨åç¼æ çå®é åºç¨ä¸ï¼nå¯ä»¥æ¯åä¸å°ç¾ä¸ä¸ªå符ç顺åºï¼å¤§åï¼æ们ä¼å é¤æ¨çå°åè¾å ¥ï¼åæ¯åç¹æ®ç»æ¢ç¬¦'$'å符ï¼å³ï¼[A-Z$]ï¼ãå¦ææ¨æ²¡æå¨è¾å ¥å符串çæ«å°¾åä¸ä¸ªç»æ¢ç¬¦'$'ï¼æ们å°èªå¨è¿æ ·åãå¦ææ¨å¨è¾å ¥å符串çä¸é´æ¾ä¸ä¸ª'$'ï¼å®ä»¬å°è¢«å¿½ç¥ãå¦ææ¨è¾å ¥ä¸ä¸ªç©ºçè¾å ¥å符串ï¼æ们å°é»è®¤ä¸º"GATAGACA$"ã
为äºæ¹ä¾¿ï¼æ们æä¾äºä¸äºé常å¨åç¼æ /æ°ç»è®²åº§ä¸æ¾å°çç»å ¸æµè¯ç¨ä¾è¾å ¥å符串ï¼ä½ä¸ºäºå±ç¤ºè¿ä¸ªå¯è§åå·¥å ·ç强大ï¼æ们é¼å±æ¨è¾å ¥ä»»ä½æ¨éæ©ç12个å符çå符串ï¼ä»¥å符'$'ç»æï¼ã
请注æï¼LCPæ°ç»åå¨æ¤æä½ä¸ä¿æ为空ãå®ä»¬å°éè¿æé¿å ¬å ±åç¼æä½åç¬è®¡ç®ã
è¿ä¸ªåç¼åå¢ç®æ³å¨ O(log n) è¿ä»£ä¸è¿è¡ï¼å ¶ä¸æ¯æ¬¡è¿ä»£ï¼å®æ¯è¾åå符串 T[SA[i]:SA[i+k]] ä¸ T[SA[i+k]:SA[i+2*k]]ï¼å³ï¼é¦å æ¯è¾ä¸¤å¯¹å符ï¼ç¶åæ¯è¾å两个å符ä¸ä¸ä¸ä¸ªä¸¤ä¸ªå符ï¼ç¶åæ¯è¾åå个å符ä¸ä¸ä¸ä¸ªå个å符ï¼ä¾æ¤ç±»æ¨ã
éè¿å¯è§åæ好æ¢ç´¢æ¤ç®æ³ï¼ç
çå¨ä½ãæ¶é´å¤æ度ï¼æ O(log n) åç¼åå¢è¿ä»£ï¼æ¯æ¬¡è¿ä»£æ们è°ç¨ O(n) åºæ°æåºï¼å æ¤å®å¨ O(n log n) ä¸è¿è¡ - 足以å¤çæå¤ n ⤠200K å符çå ¸åç¼ç¨ç«èµé®é¢ä¸æ¶åçé¿å符串ã
After we construct the Suffix Array of T in O(n log n), we can search for the occurrence of Pattern string T in O(m log n) by binary searching the sorted suffixes to find the lower bound (the first occurrence of P as a prefix of any suffix of T) and the upper bound positions (thelast occurrence of P as a prefix of any suffix of T).
Time complexity: O(m log n) and it will return an interval of size k where k is the total number of occurrences.
For example, on the Suffix Array of T = "GATAGACA$" above, try these scenarios:
æ们å¯ä»¥ä½¿ç¨Kasaiç®æ³çä¸ä¸ªé¶æ®µå¨O(n)æ¶é´å 计ç®ä¸¤ä¸ªç¸é»åç¼ï¼å¨åç¼æ°ç»é¡ºåºä¸ï¼çæé¿å ¬å ±åç¼ï¼LCPï¼ãè¿ä¸ªç®æ³å©ç¨äºä¸ä¸ªä¼ç¹ï¼å³å¦ææ们å¨ä¸¤ä¸ªç¸é»åç¼ï¼å¨åç¼æ°ç»é¡ºåºä¸ï¼ä¹é´æä¸ä¸ªé¿çLCPï¼é£ä¹å½å ¶ç¬¬ä¸ä¸ªå符被移é¤æ¶ï¼è¿ä¸ªé¿çLCPä¸å¦ä¸ä¸ªå¨ä½ç½®é¡ºåºä¸çåç¼æå¾å¤éå ã
第ä¸é¶æ®µï¼è®¡ç®Phi[]çå¼ï¼å ¶ä¸Phi[SA[i]] = SA[i-1]å¨O(n)ä¸ãè¿æ¯ä¸ºäºå¸®å©ç®æ³å¨$O(1)æ¶é´å ç¥éåªä¸ªåç¼å¨åç¼æ°ç»é¡ºåºä¸ä½äºåç¼-SA[i]ä¹åã
第äºé¶æ®µï¼è®¡ç®ä½ç½®é¡ºåºä¸çåç¼-iä¸åç¼-Phi[i]ï¼å¨åç¼æ°ç»é¡ºåºä¸ä½äºåç¼-iä¹åçé£ä¸ªï¼ä¹é´çPLCP[]å¼ãå½æ们åè¿å°ä½ç½®é¡ºåºä¸çä¸ä¸ä¸ªç´¢å¼i+1æ¶ï¼æ们å°ç§»é¤åç¼çæåé¢çå符ï¼ä½å¯è½ä¿ç大éçåç¼-(i+1)ååç¼-Phi[(i+1)]ä¹é´çLCPå¼ãPLCPå®çï¼æªè¯æï¼æ¾ç¤ºï¼LCPå¼åªè½å¢å å°n次ï¼å æ¤ä¹åªè½åå°å°æå¤n次ï¼ä½¿å¾ç¬¬äºé¶æ®µçæ»ä½å¤ææ§ä¹æ¯O(n)ã
第ä¸é¶æ®µï¼æ们计ç®LCP[]çå¼ï¼å ¶ä¸LCP[i] = PLCP[SA[i]]å¨O(n)ä¸ãè¿äºLCPå¼æ¯æ们ç¨åç¨äºå ¶ä»åç¼æ°ç»åºç¨çå¼ã
æ¶é´å¤ææ§ï¼Kasaiçç®æ³å©ç¨äºPLCPå®çï¼å ¶ä¸LCPå¼çå¢å ï¼ååå°ï¼æä½çæ»æ°æå¤æ¯O(n)ãå æ¤ï¼Kasaiçç®æ³æ»ä½ä¸è¿è¡å¨O(n)ä¸ãå æ¤ï¼O(n log n)åç¼æ°ç»æé ï¼éè¿åç¼åå¢ç®æ³ï¼å使ç¨è¿ä¸ªKasaiçç®æ³è®¡ç®LCPæ°ç»çO(n)计ç®å¯¹äºå¤çæ¶åé¿å符串çå ¸åç¼ç¨æ¯èµé®é¢ï¼æå¤n ⤠200Kå符ï¼æ¯è¶³å¤çã
After we construct the Suffix Array of T in O(n log n) and compute its LCP Array in O(n), we can find the Longest Repeated Substring (LRS) in T by simply iterating through all LCP values and reporting the largest one.
This is because each value LCP[i] the LCP Array means the longest common prefix between two lexicographically adjacent suffixes: Suffix-i and Suffix-(i-1). This corresponds to an internal vertex of the equivalent Suffix Tree of T that branches out to at least two (or more) suffixes, thus this common prefix of these adjacent suffixes are repeated.
The longest common (repeated) prefix is the required answer, which can be found in O(n) by going through the LCP array once.
Without further ado, try
. We have LRS = "GA".It is possible that T contains more than one LRS, e.g., try
We have LRS = "ANA" (actually overlap) or "BAN" (without overlap).
After we construct the generalized Suffix Array of the concatenation of both strings T1$T2# of length n = n1+n2 in O(n log n) and compute its LCP Array in O(n), we can find the Longest Repeated Substring (LRS) in T by simply iterating through all LCP values and reporting the largest one that comes from two different strings.
Without further ado, try
on the generalized Suffix Array of string T1 = "GATAGACA$" and T2 = "CATA#". We have LCS = "ATA".You are allowed to use/modify our implementation code for fast Suffix Array+LCP: sa_lcp.cpp | py | java | ml to solve programming contest problems that need it.