程式扎記: [ IR Class ] IR Model : Probabilistic Model

2012年4月19日星期四

[ IR Class ] IR Model : Probabilistic Model

Preface :
在傳統常見的 IR Model 約有以下幾種 :
- Boolean Model

The Boolean model of information retrieval is a classical information retrieval (IR) model and, at the same time, the first and most adopted one. It is used by virtually all commercial IR systems today.

- Vector Model

Vector space model (or term vector model) is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System.

- Probabilistic Model

Given a query, there is an ideal answer set which a set of documents which contains exactly the relevant documents and no other. And query process is a process of specifying the properties of an ideal answer set.

這邊要介紹的是 Probabilistic Model, 並透過簡單範例介紹如何利用此 Model 進行查詢 Ranking.

Probabilistic Principle :
使用 Probabilistic Model 的流程如下, 透過 "查詢" 來計算文件是相關文件的機率 :

簡單來說就是會有一個初始的 Rating, 透過第一次取回的文件讓使用者確認那些是相關, 與不相關文件. 接著透過 Feedback 的相關文件再計算下次文件的 Ranking. 而 Ranking 使用的 weighting 如下所示 (dj 為 Docj, q 為查詢) :

Similarity :
在開始檢視用來 Ranking 的公式前, 先來看看使用到的 Terminology 與符號說明 :

在 Probabilistic Model 透過下面公式 sim(dj, q) 來計算 "查詢" 與 "文件" 的相似度, 越相似代表越相關 (Ranking 越高) :

透過推導可以得到下面最終的公式長相 :

- Initial guess
上面的公式 R 是指所有相關文件的 collection. 問題是在一開始的時候根本不知道 R 是什麼, 因此需要一個 initial guess 來讓公式可以求值 :

- Initial ranking
有了 initial guess, 接下來我們便可以進行 Ranking, 並且將取回的 answer set 用下面符號作為後續公式推導 :
* V

a subset of the documents initially retrieved and ranked by the probabilistic model (top r documents)

* Vi

subset of V composed of documents which contain the index term ki

- Feedback ranking
所以我們可以算出在 Initial Guess 後的 Initial Ranking :

Smoothing :
Smoothing 是一個很大的 topic, 一般是解決 Sparse Matrix 的現象. 這邊就公式而言 feedback 的 V 可能是 0, 這樣就會有無窮大的問題發生, 因此為了避免這個問題, 透過 Smoothing 的技巧如下 :

這樣就可以避免分母是 0 的狀況使得 Ranking 的動作可以繼續.

Example :
接著我們來看一個簡單範例, 考慮有 Query 與三個文件如下 (N=3) :

Query: "gold silver truck"
D1: "Shipment of gold damaged in a fire"
D2: "Delivery of silver arrived in a silver truck"
D3: "Shipment of gold arrived in a truck"

第一步就是把 functional word 如 a, of 等排除在 feature terms 外. 並找出 key words 與算出 idf (Inverse of Document Frequency) :

接著選出關鍵字並將之編號 ki :

- Initial Guess

- Feedback search
接著有 Initial Guess 與 Initial Ranking 後, 假設我們選擇 d2 為 Answer set, 則 (使用 Alternative 1 的 Smoothing):

接著我們可以使用套件 IRToolkit.jar 來驗證上面的說明 :

- ProbModel.java :

view plain copy to clipboard print ?

package ir.prac;



import java.util.List;



import john.ir.fbi.doc.Doc;

import john.ir.pi.Model;



public class ProbModel {

    public static void main(String[] args) {

        Model m = new Model();

        m.FEEDBACK_COUNT = 1;  /*設定 Feedback 的文件數*/



        /*添加文件*/

        m.addDoc("Shipment of gold damaged in a fire", false);

        m.addDoc("Delivery of silver arrived in a silver truck", false);

        m.addDoc("Shipment of gold arrived in a truck", false);



        String qs = "gold silver truck";

        int top = 3;



        /*進行查詢*/

        System.out.printf("\t[Info] Query String='%s'...\n", qs);

        List relvDocs = m.query(qs, top, 2); /*第三個參數說明 loop 兩次*/

        System.out.printf("\t[Info] Using Probabilistic Model :\n");

        for(Doc d:relvDocs)

        {

            System.out.printf("\t[Info] Retrieve : %s\n", d);

        }

        System.out.println();

    }

}

執行過程如下 :

[Info] Query String='gold silver truck'...
Loop 1/2...
[Info] Term('gold') score=-0.301030...
[Info] For Doc1 : Score=-0.301030

[Info] Term('silver') score=0.301030...
[Info] Term('truck') score=-0.301030...
[Info] For Doc2 : Score=0.000000

[Info] Term('gold') score=-0.301030...
[Info] Term('truck') score=-0.301030...
[Info] For Doc3 : Score=-0.602060

[Info] Feedback > 002{Delivery:1 truck:1 silver:2 arrived:1 }:Score=0.00... # 在 loop1 使用 Doc002 當作 feedback relevant doc
Loop 2/2...
[Info] Term('gold') score=-1.176091...
[Info] For Doc1 : Score=-1.176091

[Info] Term('silver') score=1.176091...
[Info] Term('truck') score=0.477121...
[Info] For Doc2 : Score=1.653213

[Info] Term('gold') score=-1.176091...
[Info] Term('truck') score=0.477121...
[Info] For Doc3 : Score=-0.698970

[Info] Feedback > 002{Delivery:1 truck:1 silver:2 arrived:1 }:Score=1.65...
[Info] Using Probabilistic Model :
[Info] Retrieve : 002{Delivery:1 truck:1 silver:2 arrived:1 }:Score=1.65
[Info] Retrieve : 003{truck:1 arrived:1 gold:1 Shipment:1 }:Score=-0.70
[Info] Retrieve : 001{damaged:1 gold:1 fire:1 Shipment:1 }:Score=-1.18

Supplement :
* 台大資工課程 > 資訊檢索與擷取 (陳信希教授)

程式扎記

標籤

2012年4月19日星期四

[ IR Class ] IR Model : Probabilistic Model

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2012年4月19日 星期四

[ IR Class ] IR Model : Probabilistic Model

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

2012年4月19日星期四