程式扎記: [ IR Class ] Overall View on IR : Frequency-Based Indexing Method

2012年4月18日星期三

[ IR Class ] Overall View on IR : Frequency-Based Indexing Method

Preface :
在進行文件檢索最簡單的方法, 便是透過查詢的字串與文件比對. 並透過出現次數來進行 weighting. (某個查詢 term 在文件中出現次數多, 代表越相關.). 而這樣的 Model 我們稱之為 Frequency-Based Indexing Method. 底下將簡單對此方法進行介紹.

Term-Frequency Consideration :
在 IR 有個常見的代號 tf (term frequency) 說明某個 term 在某個 doc 中的出現次數. 但在實務上我們會對這些 term 簡單進行以下分類 :
- Function words

* for example, "and", "or", "of", "but", …
* the frequencies of these words are high in all texts

- Content words

* words that actually relate to document content
* varying frequencies in the different texts of a collection
* indicate term importance for content

而我們有興趣的是 Content words, 因此在 IR 開始進行 indexing 前, 通常會有所謂的 stop list 用來移除這些 functional words 以避免對後續查詢造成偏差.

A Frequency-Based Indexing Steps :
底下是使用 Frequency-Based approach 的步驟說明 :

- Log-frequency weighting
通常當我們再進行資料檢索時, 會使用查詢的字對所有文件進行 "計分" 的動作, 而最後便將那些分數較高的文件當作相關文件回傳. 而這個計分的動作我們可以把它想成一種 Model, 不同的 Model 會有不同的計分公式與方式. 而我們的 Model 是透過查詢的字元在文件的出現頻率當作比對的依據 :

在這裡我們使用下面的公式將 "查詢" 對所有 "文件" 進行排序 (Ranking) :

Example :
這裡我寫了一個簡單的教學工具 IRToolkit.jar (下載), 你可以將他加入你的 Java 專案的 Libraries 後便可以如下使用 :

- FBModel .java :

view plain copy to clipboard print ?

package ir.prac;



import java.util.List;

import john.ir.fbi.Model;

import john.ir.fbi.doc.Doc;



public class FBModel {



    /**

     * @param args

     */

    public static void main(String[] args) {

        /*建立 Model*/

        Model m = new Model();



        /*添加文件*/

        m.addDoc("A B C A A B F");

        m.addDoc("C C C D E G H I");

        m.addDoc("Z Z A C E J U");

        m.addDoc("E E C A D D Q P");

        m.addDoc("V A Z X");

        m.addDoc("S P O I A");

        m.addDoc("X B");

        m.addDoc("T K K M");



        String qs = "A V M"; /*查詢字串*/

        int top = 6; /*取出 top 3 relevant docs*/



        System.out.printf("\t[Info] Query String='%s'...\n", qs);

        List relvDocs = m.query(qs, top);

        System.out.printf("\t[Info] Using Model (tf):\n");

        for(Doc d:relvDocs)

        {

            System.out.printf("\t[Info] Retrieve : %s\n", d);

        }

        System.out.println();

    }

}

執行過程訊息如下 :

[Info] Query String='A V M'...
[Info] [Doc001] Hit A->3 : Score=1.477121...
[Info] [Doc001] Miss V...
[Info] [Doc001] Miss M...
[Info] [Doc001] Total Score=1.48...
...
[Info] Using Model (tf):
[Info] Retrieve : 005{V:1 A:1 X:1 Z:1 }:Score=2.00
[Info] Retrieve : 001{F:1 A:3 B:2 C:1 }:Score=1.48
[Info] Retrieve : 003{U:1 E:1 A:1 C:1 J:1 Z:2 }:Score=1.00
[Info] Retrieve : 004{D:2 E:2 Q:1 P:1 A:1 C:1 }:Score=1.00
[Info] Retrieve : 006{A:1 P:1 S:1 O:1 I:1 }:Score=1.00
[Info] Retrieve : 008{T:1 M:1 K:2 }:Score=1.00

可以看出第 5 號文件有最高的 Ranking : 005{V:1 A:1 X:1 Z:1 }:Score=2.00

Supplement :
* 台大資工課程 > 資訊檢索與擷取

程式扎記

標籤

2012年4月18日星期三

[ IR Class ] Overall View on IR : Frequency-Based Indexing Method

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2012年4月18日 星期三

[ IR Class ] Overall View on IR : Frequency-Based Indexing Method

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

2012年4月18日星期三