程式扎記: [ Algorithm in Java ] 搜尋 : KMP 搜尋法

前言 :
字串搜尋演算法即是在一份文件中尋找一段字串。由於文件的內容可能相當龐大或者複雜，因此有效率的字串搜尋演算法經常扮演著重要的角色.

暴力法 :
我們可以定義一個索引i（或指標）用來指向文件中開始比對的起點字元。當比對過程中遇到某個字元不符合時，就移動此索引到下一個字元，並重新比對。這個類似地毯式搜尋的方法稱為暴力搜尋法. 考慮下面範例, 在 i=0 時，只有前三個字元是相符的，所以該次比對結果為失敗。下一次比對就從T[1]（i=1）開始。以此類推，當i=13時，我們找到了與P相符的字串{a, b, c, d, a, d}。總共經過34次比對 :

- 複雜度分析
假設字串P的長度為M，文件長度為N。欲找到相符字串其在最壞情形下，可能需要啟動(N-M+1)次搜尋。如果考慮字串P的長度，那麼最多共需要M(N-M+1)次字元比對，才
能確認字串搜尋的結果。一般而言，M會比N小很多，所以暴力字串搜尋法大概需要經過 MN 次字元比對!

隨著字串處理的需求日益增加，暴力搜尋法在實際應用上的表現O(MN) ，顯得差強人意。於是有些更快速的字串搜尋演算法利用字串的前置處理，減少字串比對的次數. 通常這些演算法在實際應用上，可以達到 O(M+N) 的運算複雜度.

Knuth-Morris-Pratt 演算法 : (wiki)
KMP演算法的原理在當字元比對不符時，我們可以從比對字串 P 中得知某些字元可以略過不需檢查. 如此就不必從文件T的下一個字元重新開始比對 :

- 演算法說明
KMP 演算法的關鍵在於建構一個部分相符表格，好用來計算比對字串位移的大小. 由於我們是在比對失敗時，才參考這個表格，所以此表格又被稱為失誤函數. 該函數定義一表格next[M]，當比對失敗發生在 P[j] 時，可以拿 P[next[j]] 作為下一個重新比對的起點. 以P={a,b, c, d, a, d}為例，next表格的內容為{-1, 0, 0, 0, 0, 1}.

在建構失誤函數時, 當字元比對失敗在P[j]，而且 P[j-k]…P[j-1] 正好等於 P[0]…P[k-1] 時，我們可以得到 next[j]=k, 這樣的過程請參考下面說明 :

而對應的代碼我們使用函數 buildKMP() 來建立失誤表格 (next[]), 代碼如下 :

- 函數 buildKMP() 代碼 :

view plain copy to clipboard print ?

public static int[] buildKMP(char[] ptnChars)

{

    int kmpMtx[] = new int[ptnChars.length+1];

    int i=0, j=kmpMtx[0]=-1;

    while(i

    {

        /*Reset j*/

        while(j>-1 && ptnChars[i]!=ptnChars[j])

        {

            j = kmpMtx[j]; /*Total match matched j time, let j = match_next[j]*/

        }

        i++;

        j++;

        if(i

        {

            /*If ptnChars[i]==ptnChars[j], keep increasing j (matched j time)*/

            /*Let kmpMtx[i]=kmpMtx[j], so */

            kmpMtx[i] = j>0?kmpMtx[j]:0;

        }

        else

        {

            kmpMtx[i] = j;

        }

    }

    return kmpMtx;

}

有了失誤表格, 接著我們就可以繼續 KMP 搜尋演算法的說明, 接著我們使用一個類別 KMPSearchAlg 並將目的字串傳入建構子, 接著便可以呼叫函示 search() 並傳入搜尋的 Pattern 字串 (P[]) 找尋目的字串是否含 Pattern 字串, 如果有則傳回 Pattern 字串在目的字串中出現的第一個位置, 否則傳回 -1. 完整代碼如下 :

- KMPSearchAlg.java :

view plain copy to clipboard print ?

package alg.ssearch;



public class KMPSearchAlg {

    public String str;

    public KMPSearchAlg(String oStr)

    {

        str = oStr;

    }

    public static int[] buildKMP(char[] ptnChars)

    {

        System.out.println("\t[Test] Ptn="+String.valueOf(ptnChars));

        int kmpMtx[] = new int[ptnChars.length+1];

        int i=0, j=kmpMtx[0]=-1;

        while(i

        {

            /*Reset j*/

            while(j>-1 && ptnChars[i]!=ptnChars[j])

            {

                System.out.printf("\t[Reset] j=kmpMtx[%d]=%d...\n", j, kmpMtx[j]);

                j = kmpMtx[j]; /*Total match matched j time, let j = match_next[j]*/

            }

            i++;

            j++;

            System.out.printf("\t[Test] Comparing p[i=%d]=%c with p[j=%d]=%c...", i, i'*',j,ptnChars[j]);

            if(i

            {

                /*If ptnChars[i]==ptnChars[j], keep increasing j (matched j time)*/

                /*Let kmpMtx[i]=kmpMtx[j], so */

                //kmpMtx[i] = j>0?kmpMtx[j]:0;

                kmpMtx[i] = kmpMtx[j];

                System.out.printf("(1)kmpMtx[%d]=%d\n", i, kmpMtx[i]);

            }

            else

            {

                kmpMtx[i] = j;

                System.out.printf("(2)kmpMtx[%d]=%d\n", i, kmpMtx[i]);

            }

        }

        return kmpMtx;

    }



    public int search(String ptn)

    {

        char[] oStrChars = str.toCharArray();

        char[] ptnChars = ptn.toCharArray();

        int main_index=0;

        int match_index=0;



        /*First round comparing*/

        while(main_index

        {

            if(oStrChars[main_index]==ptnChars[0])

            {

                main_index++;

                match_index=1;

                while(main_index

                {

                    if(oStrChars[main_index]!=ptnChars[match_index]) break;

                    main_index++;

                    match_index++;

                }

                break;

            }

            main_index++;

        }

        if(match_index==ptn.length()) return main_index-match_index; /*Already matched*/

        int kmpMtx[] = KMPSearchAlg.buildKMP(ptn.toCharArray());



        /*Second round comparing*/

        while(main_index

        {

            char c = oStrChars[main_index];

            while(ptnChars[match_index]!=c)

            {

                match_index=kmpMtx[match_index];

                if(match_index<0) break;

                if(ptnChars[match_index]==c)System.out.printf("\t[KMP] Start searching on match_index=%d...\n", match_index);

            }

            main_index++;

            match_index++;

            if(match_index==0)

            {

                /*Simple string matching*/

                while(main_index

                {

                    if(oStrChars[main_index]==ptnChars[0])

                    {

                        main_index++;

                        match_index=1;

                        while(main_index

                        {

                            if(oStrChars[main_index]!=ptnChars[match_index]) break;

                            main_index++;

                            match_index++;

                        }

                        break;

                    }

                    main_index++;

                }

            }

        }



        if(match_index==ptn.length()) return main_index-match_index; /*Matched*/

        return -1;

    }



    public static void testSearch()

    {

        //testKMPMtx();

        KMPSearchAlg kmpAlg = new KMPSearchAlg("ABCABCDABABCDABCDAD");

        int mIndex = kmpAlg.search("ABCDAD");

        if(mIndex>0)

        {

            System.out.println("\t[KMP] Match!");

            System.out.println(kmpAlg.str);

            for(int i=0; i" ");

            System.out.println("^");

        }

        else

        {

            System.out.println("\t[KMP] Mismatch!");

        }

    }



    public static void main(String args[])

    {

        testSearch();

    }

}

執行結果如下 :

- 演算法分析
建構next表格的複雜度為O(M)，所以 KMP 演算法的整體效率為 O(M+N)!
在實際應用上，KMP演算法可能不會比暴力法快多少。原因在於很少有文件具備重複的內容，而且字串P的內容也很少有重疊的樣式。KMP法在搜尋存放在硬碟的大型文件時，還是有顯著的好處。因為索引i不會遞減的特性，就可以減少在搜尋字串時，存取硬碟的次數.

程式扎記

標籤

2011年9月30日星期五

[ Algorithm in Java ] 搜尋 : KMP 搜尋法

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2011年9月30日 星期五