## 2012年4月19日 星期四

### [ IR Class ] IR Model : Probabilistic Model

Preface :

Boolean Model
The Boolean model of information retrieval is a classical information retrieval (IR) model and, at the same time, the first and most adopted one. It is used by virtually all commercial IR systems today.

Vector Model
Vector space model (or term vector model) is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers, such as, for example, index terms. It is used in information filteringinformation retrievalindexing and relevancy rankings. Its first use was in the SMART Information Retrieval System.

- Probabilistic Model
Given a query, there is an ideal answer set which a set of documents which contains exactly the relevant documents and no other. And query process is a process of specifying the properties of an ideal answer set.

Probabilistic Principle :

Similarity :

- Initial guess

- Initial ranking

* V
a subset of the documents initially retrieved and ranked by the probabilistic model (top r documents)

* Vi
subset of V composed of documents which contain the index term ki

- Feedback ranking

Smoothing :
Smoothing 是一個很大的 topic, 一般是解決 Sparse Matrix 的現象. 這邊就公式而言 feedback 的 V 可能是 0, 這樣就會有無窮大的問題發生, 因此為了避免這個問題, 透過 Smoothing 的技巧如下 :

Example :

Query: "gold silver truck"
D1: "Shipment of gold damaged in a fire"
D2: "Delivery of silver arrived in a silver truck"
D3: "Shipment of gold arrived in a truck"

- Initial Guess

- Feedback search

- ProbModel.java :
1. package ir.prac;
2.
3. import java.util.List;
4.
5. import john.ir.fbi.doc.Doc;
6. import john.ir.pi.Model;
7.
8. public class ProbModel {
9.     public static void main(String[] args) {
10.         Model m = new Model();
11.         m.FEEDBACK_COUNT = 1;  /*設定 Feedback 的文件數*/
12.
13.         /*添加文件*/
14.         m.addDoc("Shipment of gold damaged in a fire"false);
15.         m.addDoc("Delivery of silver arrived in a silver truck"false);
16.         m.addDoc("Shipment of gold arrived in a truck"false);
17.
18.         String qs = "gold silver truck";
19.         int top = 3;
20.
21.         /*進行查詢*/
22.         System.out.printf("\t[Info] Query String='%s'...\n", qs);
23.         List relvDocs = m.query(qs, top, 2); /*第三個參數說明 loop 兩次*/
24.         System.out.printf("\t[Info] Using Probabilistic Model :\n");
25.         for(Doc d:relvDocs)
26.         {
27.             System.out.printf("\t[Info] Retrieve : %s\n", d);
28.         }
29.         System.out.println();
30.     }
31. }

[Info] Query String='gold silver truck'...
Loop 1/2...
[Info] Term('gold') score=-0.301030...
[Info] For Doc1 : Score=-0.301030

[Info] Term('silver') score=0.301030...
[Info] Term('truck') score=-0.301030...
[Info] For Doc2 : Score=0.000000

[Info] Term('gold') score=-0.301030...
[Info] Term('truck') score=-0.301030...
[Info] For Doc3 : Score=-0.602060

[Info] Feedback > 002{Delivery:1 truck:1 silver:2 arrived:1 }:Score=0.00...
# 在 loop1 使用 Doc002 當作 feedback relevant doc
Loop 2/2...
[Info] Term('gold') score=-1.176091...
[Info] For Doc1 : Score=-1.176091

[Info] Term('silver') score=1.176091...
[Info] Term('truck') score=0.477121...
[Info] For Doc2 : Score=1.653213

[Info] Term('gold') score=-1.176091...
[Info] Term('truck') score=0.477121...
[Info] For Doc3 : Score=-0.698970

[Info] Feedback > 002{Delivery:1 truck:1 silver:2 arrived:1 }:Score=1.65...
[Info] Using Probabilistic Model :
[Info] Retrieve : 002{Delivery:1 truck:1 silver:2 arrived:1 }:Score=1.65
[Info] Retrieve : 003{truck:1 arrived:1 gold:1 Shipment:1 }:Score=-0.70
[Info] Retrieve : 001{damaged:1 gold:1 fire:1 Shipment:1 }:Score=-1.18

Supplement :

## 關於我自己

Where there is a will, there is a way!