程式扎記

Preface :
在 Frequency-Based Indexing Method 使用了 tf (term frequency) 來當作資訊檢索時的 weighting, 這樣用沒有什麼不好, 只是會有盲點. 試想如果某些文件字特別多, 自然包含查詢的字元的機率會比一般小文件大的許多. 但是同樣次數的 terms 在小文件與大文件的權重應該是一樣的嗎? 再者某些 terms 在許多文件都有出現與某些 terms 只有在少數文件出現的權重應該也是不同的! (那些 terms 只在少數文件出現的, 相對於在多數文件都有出現的 terms 較有區別度, 故預期應該有比較大的權重.)

Document Frequency :
為了解決上面說的問題, 這邊引入新的符號 df (document frequency) 說明某個 term 在整個文件集有多少文件包含該 term (即使某文件含有多個該 term, 也只算一次.). 底下是對該符號的討論 :

Inverse Document Frequency :
為了將剛剛的說明, 有較少文件包含的 term 應該有較高的權重, 引入了 idf :

接著如果考慮 term 在文件中出現的頻率加上該 term 的權重, 便形成下面公式 :

這個公式帶來以下好處 :

* Eliminating common function words
* Computing the value of wij for each term Tj in document Di
* Assigning to the documents of a collection all terms with sufficiently high (tf × idf) factors

tf-idf weighting :
另一個常見的符號 tf-idf 的 weighting 公式如下 :

底下範例我們將使用此 weighting 來進行檢索的 Ranking 計算.

Example :
這邊的 Score 計算, 是將查詢中的每個 term 在文件中的 tf-idf 加起來, 並以此 Score 進行 Ranking (需要引入套件 IRToolkit.jar):

- DfidfModel.java :

view plain copy to clipboard print ?

package ir.prac;



import java.util.List;



import john.ir.fbi.Model2;

import john.ir.fbi.doc.Doc;



public class DfidfModel {

    public static void main(String args[])

    {

        Model2 m = new Model2();

        m.addDoc("A B C A A B F");

        m.addDoc("C C C D E G H I");

        m.addDoc("Z Z A C E J A");

        m.addDoc("E E C A D D Q P");

        m.addDoc("V A Z X");

        m.addDoc("S P O I A");

        m.addDoc("T K K M");

        String qs = "A B X";

        int top = 5;



        List relvDocs = m.query(qs, top);

        System.out.printf("\t[Info] Query String='%s'...\n", qs);

        System.out.printf("\t[Info] Using Model2 (tf-idf):\n");

        for(Doc d:relvDocs)

        {

            System.out.printf("\t[Info] Retrieve : %s\n", d);

        }

        System.out.println();

    }

}

運行訊息如下 :

[Info][Doc001] Term(A) : tf-idf=0.098889...
[Info][Doc001] Term(B) : tf-idf=0.707849...
[Info][Doc001] Total score=0.81!
[Info][Doc002] Total score=0.00!
...
[Info] Query String='A B X'...
[Info] Using Model2 (tf-idf):
[Info] Retrieve : 001{F:1 A:3 B:2 C:1 }:Score=0.81
[Info] Retrieve : 005{V:1 A:1 X:1 Z:1 }:Score=0.61
[Info] Retrieve : 003{E:1 A:2 C:1 J:1 Z:2 }:Score=0.09
[Info] Retrieve : 004{D:2 E:2 Q:1 P:1 A:1 C:1 }:Score=0.07
[Info] Retrieve : 006{A:1 P:1 S:1 O:1 I:1 }:Score=0.07

這邊值得注意的是 Doc005(一個 A 與一個 X) 雖然與 Doc003(兩個 A) 同樣都包含兩個查詢的 term, 但是因為 "X" 字元在文件集有較少的文件包含, 故有較高的 idf 值. 所以 Doc005 的 Ranking 比 Doc003 高!

Supplement :
* 台大資工課程 > 資訊檢索與擷取

程式扎記

標籤

2012年4月18日星期三

[ IR Class ] Overall View on IR : tf-idf weighting

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2012年4月18日 星期三