程式扎記: [ Java 文章收集 ] 局部敏感哈希 (Locality-Sensitive Hashing, LSH) 方法介紹

Source From Here
Preface
本文主要介紹一種用於海量高維數據的近似最近鄰快速查找技術——局部敏感哈希 (Locality-Sensitive Hashing, LSH)，內容包括了 LSH 的原理、LSH 哈希函數集、以及 LSH 的一些參考資料。

局部敏感哈希 LSH
在很多應用領域中，我們面對和需要處理的數據往往是海量並且具有很高的維度，怎樣快速地從海量的高維數據集合中找到與某個數據最相似（距離最近）的一個數據或多個數據成為了一個難點和問題。如果是低維的小數據集，我們通過線性查找（Linear Search）就可以容易解決，但如果是對一個海量的高維數據集採用線性查找匹配的話，會非常耗時，因此，為了解決該問題，我們需要採用一些類似索引的技術來加快查找過程，通常這類技術稱為最近鄰查找（Nearest Neighbor ,AN），例如Kd tree；或近似最近鄰查找（Approximate Nearest Neighbor, ANN），例如 Kd tree with BBF, Randomized Kd-trees, Hierarchical K-means Tree。而 LSH 是ANN中的一類方法。

我們知道，通過建立Hash Table的方式我們能夠得到 O(1) 的查找時間性能，其中關鍵在於選取一個 hash function，將原始數據映射到相對應的桶內（bucket, hash bin），例如對數據求模：h = x mod w，w通常為一個素數。在對數據集進行hash 的過程中，會發生不同的數據被映射到了同一個桶中（即發生了衝突collision），這一般通過再次哈希將數據映射到其他空桶內來解決。這是普通Hash方法或者叫傳統Hash方法，其與LSH有些不同之處。

局部敏感哈希示意圖（from: Here）

LSH的基本思想是：將原始數據空間中的兩個相鄰數據點通過相同的映射或投影變換（projection）後，這兩個數據點在新的數據空間中仍然相鄰的概率很大，而不相鄰的數據點被映射到同一個桶的概率很小。也就是說，如果我們對原始數據進行一些hash映射後，我們希望原先相鄰的兩個數據能夠被hash到相同的桶內，具有相同的桶號。對原始數據集合中所有的數據都進行hash映射後，我們就得到了一個hash table，這些原始數據集被分散到了hash table的桶內，每個桶會落入一些原始數據，屬於同一個桶內的數據就有很大可能是相鄰的，當然也存在不相鄰的數據被hash到了同一個桶內。因此，如果我們能夠找到這樣一些 hash functions，使得經過它們的哈希映射變換後，原始空間中相鄰的數據落入相同的桶內的話，那麼我們在該數據集合中進行近鄰查找就變得容易了，我們只需要將查詢數據進行哈希映射得到其桶號，然後取出該桶號對應桶內的所有數據，再進行線性匹配即可查找到與查詢數據相鄰的數據。換句話說，我們通過hash function映射變換操作，將原始數據集合分成了多個子集合，而每個子集合中的數據間是相鄰的且該子集合中的元素個數較小，因此將一個在超大集合內查找相鄰元素的問題轉化為了在一個很小的集合內查找相鄰元素的問題，顯然計算量下降了很多。

那具有怎樣特點的 hash functions 才能夠使得原本相鄰的兩個數據點經過 hash 變換後會落入相同的桶內？這些 hash function 需要滿足以下兩個條件：

1）如果d(x,y) ≤ d1，則 h(x) = h(y) 的概率至少為 p1；
2）如果d(x,y) ≥ d2，則 h(x) = h(y) 的概率至多為 p2；

其中 d(x,y) 表示 x 和 y 之間的距離，d1 < d2， h(x) 和 h(y) 分別表示對x和 y 進行 hash 變換。滿足以上兩個條件的 hash functions 稱為 (d1,d2,p1,p2)-sensitive。而通過一個或多個 (d1,d2,p1,p2)-sensitive 的 hash function 對原始數據集合進行 hashing 生成一個或多個 hash table 的過程稱為 Locality-sensitive Hashing。

使用 LSH 進行對海量數據建立索引（Hash table）並通過索引來進行近似最近鄰查找的過程如下：
1. 離線建立索引

（1）選取滿足 (d1,d2,p1,p2)-sensitive 的 LSH hash functions；
（2）根據對查找結果的準確率（即相鄰的數據被查找到的概率）確定 hash table 的個數 L，每個 table 內的 hash functions 的個數 K，以及跟 LSH hash function 自身有關的參數；
（3）將所有數據經過 LSH hash function 哈希到相應的桶內，構成了一個或多個hash table；

2. 在線查找

（1）將查詢數據經過LSH hash function哈希得到相應的桶號；
（2）將桶號中對應的數據取出；（為了保證查找速度，通常只需要取出前2L個數據即可）；
（3）計算查詢數據與這2L個數據之間的相似度或距離，返回最近鄰的數據；

LSH 在線查找時間由兩個部分組成：（1）通過 LSH hash functions 計算hash值（桶號）的時間；（2）將查詢數據與桶內的數據進行比較計算的時間。因此，LSH 的查找時間至少是一個 sublinear 時間。為什麼是“至少”？因為我們可以通過對桶內的屬於建立索引來加快匹配速度，這時第（2）部分的耗時就從 O(N) 變成了O(logN) 或 O(1)（取決於採用的索引方法）。 LSH 為我們提供了一種在海量的高維數據集中查找與查詢數據點（query data point）近似最相鄰的某個或某些數據點。需要注意的是，LSH 並不能保證一定能夠查找到與query data point最相鄰的數據，而是減少需要匹配的數據點個數的同時保證查找到最近鄰的數據點的概率很大。

二、LSH的應用
LSH 的應用場景很多，凡是需要進行大量數據之間的相似度（或距離）計算的地方都可以使用 LSH 來加快查找匹配速度，下面列舉一些應用：
（1）查找網絡上的重複網頁

互聯網上由於各式各樣的原因（例如轉載、抄襲等）會存在很多重複的網頁，因此為了提高搜索引擎的檢索質量或避免重複建立索引，需要查找出重複的網頁，以便進行一些處理。其大致的過程如下：將互聯網的文檔用一個集合或詞袋向量來表徵，然後通過一些hash運算來判斷兩篇文檔之間的相似度，常用的有minhash+LSH、simhash。

（2）查找相似新聞網頁或文章

與查找重複網頁類似，可以通過hash的方法來判斷兩篇新聞網頁或文章是否相似，只不過在表達新聞網頁或文章時利用了它們的特點來建立表徵該文檔的集合。

（3）圖像檢索

在圖像檢索領域，每張圖片可以由一個或多個特徵向量來表達，為了檢索出與查詢圖片相似的圖片集合，我們可以對圖片數據庫中的所有特徵向量建立LSH索引，然後通過查找LSH索引來加快檢索速度。目前圖像檢索技術在最近幾年得到了較大的發展，有興趣的讀者可以查看基於內容的圖像檢索引擎的相關介紹。

（4）音樂檢索

對於一段音樂或音頻信息，我們提取其音頻指紋（Audio Fingerprint）來表徵該音頻片段，採用音頻指紋的好處在於其能夠保持對音頻發生的一些改變的魯棒性，例如壓縮，不同的歌手錄製的同一條歌曲等。為了快速檢索到與查詢音頻或歌曲相似的歌曲，我們可以對數據庫中的所有歌曲的音頻指紋建立LSH索引，然後通過該索引來加快檢索速度。

（5）指紋匹配

一個手指指紋通常由一些細節來表徵，通過對比較兩個手指指紋的細節的相似度就可以確定兩個指紋是否相同或相似。類似於圖片和音樂檢索，我們可以對這些細節特徵建立 LSH 索引，加快指紋的匹配速度。

三、LSH family
我們在第一節介紹了LSH的原理和LSH hash function需要滿足的條件，回顧一下：滿足以下兩個條件的 hash functions 稱為 (d1,d2,p1,p2)-sensitive：

1）如果d(x,y) ≤ d1，則h(x) = h(y)的概率至少為p1；
2）如果d(x,y) ≥ d2，則h(x) = h(y)的概率至多為p2；

d(x,y) 是 x 和 y 之間的一個距離度量（distance measure），需要說明的是，並不是所有的距離度量都能夠找到滿足 locality-sensitive 的 hash functions。下面我們介紹一些滿足不同距離度量方式下的 locality-sensitive 的 hash functions：
1. Jaccard distance

Jaccard distance： (1 - Jaccard similarity)，而Jaccard similarity = (A intersection B) / (A union B)，Jaccard similarity通常用來判斷兩個集合的相似性。 Jaccard distance 對應的 LSH hash function為：minhash，其是(d1,d2,1-d1,1-d2)-sensitive的。

2. Hamming distance

Hamming distance：兩個具有相同長度的向量中對應位置處值不同的次數。Hamming distance對應的 LSH hash function為：H(V) =向量 V 的第 i 位上的值，其是(d1,d2,1-d1/d,1-d2/d)-sensitive 的。

3. Cosine distance

Cosine distance：cos(theta) = A · B / |A||B| ，常用來判斷兩個向量之間的夾角，夾角越小，表示它們越相似。Cosine distance 對應的 LSH hash function為：H(V) = sign(V · R)，R 是一個隨機向量。V · R可以看做是將V向R上進行投影操作。其是 (d1,d2,(180-d1)180,(180-d2)/180)-sensitive 的。
理解：利用隨機的超平面（random hyperplane）將原始數據空間進行劃分，每一個數據被投影后會落入超平面的某一側，經過多個隨機的超平面劃分後，原始空間被劃分為了很多cell，而位於每個 cell 內的數據被認為具有很大可能是相鄰的（即原始數據之間的cosine distance很小）。

4. normal Euclidean distance

Euclidean distance 是衡量D維空間中兩個點之間的距離的一種距離度量方式。

Euclidean distance 對應的 LSH hash function 為：H(V) = |V·R + b| / a，R是一個隨機向量，a 是桶寬，b 是一個在 [0,a] 之間均勻分佈的隨機變量。V·R 可以看做是將 V 向 R 上進行投影操作。其是(a/2,2a,1/2,1/3)-sensitive的。

理解：將原始數據空間中的數據投影到一條隨機的直線（random line）上，並且該直線由很多長度等於a的線段組成，每一個數據被投影后會落入該直線上的某一個線段上（對應的桶內），將所有數據都投影到直線上後，位於同一個線段內的數據將被認為具有很大可能是相鄰的（即原始數據之間的 Euclidean distance 很小）。

四、增強LSH（Amplifying LSH）
通過 LSH hash functions 我們能夠得到一個或多個 hash table，每個桶內的數據之間是近鄰的可能性很大。我們希望原本相鄰的數據經過 LSH hash 後，都能夠落入到相同的桶內，而不相鄰的數據經過 LSH hash 後，都能夠落入到不同的桶中。如果相鄰的數據被投影到了不同的桶內，我們稱為 false negtive；如果不相鄰的數據被投影到了相同的桶內，我們稱為 false positive。因此，我們在使用 LSH 中，我們希望能夠盡量降低 false negtive rate 和 false positive rate。

通常，為了能夠增強 LSH，即使得 false negtive rate 和/或 false positive rate 降低，我們有兩個途徑來實現：1）在一個hash table內使用更多的 LSH hash functio n；2）建立多個hash table。下面介紹一些常用的增強LSH的方法：

1. 使用多個獨立的 hash table
每個 hash table 由 k 個LSH hash function創建，每次選用 k 個LSH hash function（同屬於一個LSH function family）就得到了一個 hash table，重複多次，即可創建多個 hash table。多個 hash table 的好處在於能夠降低 false positive rate。

2. AND 與操作
從同一個 LSH function family 中挑選出 k 個 LSH function，H(X) = H(Y) 有且僅當這 k 個 Hi(X) = Hi(Y) 都滿足。也就是說只有當兩個數據的這 k 個 hash 值都對應相同時，才會被投影到相同的桶內，只要有一個不滿足就不會被投影到同一個桶內。

AND與操作能夠使得找到近鄰數據的 p1 概率保持高概率的同時降低 p2 概率，即降低了false negtive rate。

3. OR 或操作
從同一個 LSH function family 中挑選出 k 個 LSH function，H(X) = H(Y) 有且僅當存在一個以上的 Hi(X) = Hi(Y)。也就是說只要兩個數據的這 k 個 hash 值中有一對以上相同時，就會被投影到相同的桶內，只有當這 k 個 hash 值都不相同時才不被投影到同一個桶內。

OR或操作能夠使得找到近鄰數據的p1概率變的更大（越接近1）的同時保持p2概率較小，即降低了 false positive rate。

4. AND 和 OR 的級聯
將與操作和或操作級聯在一起，產生更多的 hahs table，這樣的好處在於能夠使得 p1 更接近 1，而 p2 更接近0。

除了上面介紹的增強 LSH 的方法外，有時候我們希望將多個 LSH hash function 得到的 hash 值組合起來，在此基礎上得到新的hash值，這樣做的好處在於減少了存儲 hash table 的空間。下面介紹一些常用方法：
1. 求模運算
new hash value = old hash value % N

2. 隨機投影
假設通過k個LSH hash function得到了k個hash值：h1, h2..., hk。那麼新的hash值採用如下公式求得：
new hash value = h1*r1 + h2*r2 + ... + hk*rk，其中r1, r2, ..., rk是一些隨機數。

3. XOR異或
假設通過k個LSH hash function得到了k個hash值：h1, h2..., hk。那麼新的hash值採用如下公式求得：
new hash value = h1 XOR h2 XOR h3 ... XOR hk

Java - LSH
A Java implementation of Locality Sensitive Hashing (LSH). This library implements Locality Sensitive Hashing (LSH), as described in Leskovec, Rajaraman & Ullman (2014), "Mining of Massive Datasets", Cambridge University Press. Are currently implemented:

* MinHash algorithm for Jaccard index;
* Super-Bit algorithm for cosine similarity.

MinHash
MinHash is a hashing scheme that tents to produce similar signatures for sets that have a high Jaccard similarity. The Jaccard similarity between two sets is the relative number of elements these sets have in common: J(A, B) = |A ∩ B| / |A ∪ B|. A MinHash signature is a sequence of numbers produced by multiple hash functions hi. It can be shown that the Jaccard similarity between two sets is also the probability that this hash result is the same for the two sets: J(A, B) = Pr[hi(A) = hi(B)]. Therefore, MinHash signatures can be used to estimate Jaccard similarity between two sets. Moreover, it can be shown that the expected estimation error is O(1 / sqrt(n)), where n is the size of the signature (the number of hash functions that are used to produce the signature).

Binning
下面範例代碼示範該套件的使用, 流程如下:

* 建立 count=10,000 筆測試的 vector, 每個 vector 長度為 n = 50.
* 設定 MinHash 的 stages = 2; buckets = 500.
* 開始進行 Hash 並將結果存放在 sgMap (為兩層的 Map, 第一層的 key 為 stage; 第二層的 key 為 bucket 號碼, value 為 vector 的位置) 與 hrMap 中, 提供後續處理.
* 顯示 buckets 的存放狀況.
* 隨機顯示一個 bucket 的內容 (每個 bucket 存放 vector 位置的集合)
* 隨機產生一個 vector, 透過剛剛建立的 Hash 集合找出最相近的 vector.

- SimpleLSHMinHashExample.java

view plaincopy to clipboardprint?
package jlsh;  
  
import info.debatty.java.lsh.LSHMinHash;  
import jlsh.SimpleLSHMinHashEx2.DefaultHRMap;  
import jlsh.SimpleLSHMinHashEx2.DefaultSGMap;  
  
import java.util.ArrayList;  
import java.util.HashMap;  
import java.util.List;  
import java.util.Map;  
import java.util.Random;  
  
public class SimpleLSHMinHashExample {  
    public static class DefaultHRMap extends HashMap {  
        protected V defaultValue;  
  
        public DefaultHRMap() {}  
  
        @Override  
        public V get(Object k) {  
            if(containsKey(k)) return super.get(k);  
            else  
            {  
                List v = new ArrayList();  
                super.put((K)k, (V)v);  
                return (V)v;  
            }             
        }  
    }  
      
    public static class DefaultSGMap extends HashMap {  
        protected V defaultValue;  
  
        public DefaultSGMap() {}  
  
        @Override  
        public V get(Object k) {  
            if(containsKey(k)) return super.get(k);  
            else  
            {  
                Map> v = new DefaultHRMap>();  
                super.put((K)k, (V)v);  
                return (V)v;  
            }             
        }  
    }  
      
    public static void main(String args[])  
    {  
        // proportion of 0's in the vectors  
        // if the vectors are dense (lots of 1's), the average jaccard similarity  
        // will be very high (especially for large vectors), and LSH  
        // won't be able to distinguish them  
        // as a result, all vectors will be binned in the same bucket...  
        double sparsity = 0.75;  
          
        // Number of sets  
        int count = 10000;  
          
        // Size of vectors  
        int n = 50;  
          
        // LSH parameters  
        // the number of stages is also sometimes called thge number of bands  
        int stages = 2;  
          
        // Attention: to get relevant results, the number of elements per bucket  
        // should be at least 100  
        int buckets = 500;  
          
        // Let's generate some random sets  
        boolean[][] vectors = new boolean[count][n];  
        Random rand = new Random();  
          
        for (int i = 0; i < count; i++) {  
            for (int j = 0; j < n; j++) {  
                vectors[i][j] = rand.nextDouble() > sparsity;  
            }  
        }  
          
        // Create and configure LSH algorithm  
        LSHMinHash lsh = new LSHMinHash(stages, buckets, n);  
          
        int[][] counts = new int[stages][buckets];  
        for (int i = 0; i < stages; i++) {  
            for(int j=0; j0;  
        }  
          
        // Perform hashing  
        int vi = 0; // Vector index  
        DefaultHRMap> hrMap = new DefaultHRMap>();  
        Map>> sgMap = new DefaultSGMap>>();  
        for (boolean[] vector : vectors) {            
            int[] hash = lsh.hash(vector);  
            //hrMap.get(String.valueOf(hash[0])).add(vi);  
            hrMap.get(arr2str(hash)).add(vi);  
            for (int i = 0; i < hash.length; i++) {  
                counts[i][hash[i]]++;  
                sgMap.get(i).get(hash[i]).add(vi);  
            }  
              
            print(vector);  
            System.out.printf(" : %s", arr2str(hash));              
            System.out.print("\n");  
            vi++;  
        }  
          
        System.out.println("Number of elements per bucket at each stage:");  
        for (int i = 0; i < stages; i++) {  
            print(counts[i]);  
            System.out.print("\n");  
        }  
        System.out.println();  
          
        String bis = "(0 0)";  
        int s=0, b=0;  
          
        while(hrMap.get(bis).size()==0)  
        {  
            b++;  
            if(b==buckets) {s++; b = 0;}  
            bis = String.format("(%d %d)", s, b);             
        }  
          
        // Randomly show one bucket  
        System.out.printf("Show records in bin%s (%d):\n", bis, hrMap.get(bis).size());  
        for(int v:hrMap.get(bis))  
        {  
            print(vectors[v]); System.out.println();  
        }  
        b++;  
        if(b==buckets) {s++; b = 0;}  
        bis = String.format("(%d %d)", s, b);  
        System.out.println();  
          
        boolean tvec[] = new boolean[n];  // Target vector  
        for (int j = 0; j < n; j++) {  
            tvec[j] = rand.nextDouble() > sparsity;  
        }  
        System.out.printf("Target vector:\n"); print(tvec); System.out.println();  
        int[] hash = lsh.hash(tvec);  
        System.out.printf("Hash: %s\n", arr2str(hash));  
        double cs = Double.MIN_VALUE; int tvi=0; int cc=0;  
        for(int i=0; i
        {  
            System.out.printf("Look for bucket=%d at stage=%d (%d)...\n", hash[i], i, sgMap.get(i).get(hash[i]).size());  
            for(int j:sgMap.get(i).get(hash[i]))  
            {  
                double ccs = CosineSim(tvec, vectors[j]);  
                if(ccs>cs)  
                {  
                    cs = ccs; tvi = j;  
                    print(vectors[j]); System.out.printf(": %.02f\n", ccs);  
                }  
                cc++;  
            }  
        }  
        System.out.printf("Search count=%,d!\n", cc);  
    }  
      
    static double CosineSim(boolean[] a, boolean[] b)  
    {  
        double pdv = 0.0;  
        double av = 0.0, bv = 0.0;  
        for(int i=0; i
        {  
            if(a[i] && b[i]) pdv++;  
            if(a[i]) av++; if(b[i]) bv++;  
        }  
        if(pdv==0) return 0.0;  
        else  
        {             
            return pdv/(Math.sqrt(av)*Math.sqrt(bv));  
        }  
    }  
      
    static void print(int[] array) {  
        System.out.print("[ ");  
        for (int v : array) {  
            System.out.print("" + v + " ");  
        }  
        System.out.print("]");  
    }  
      
    static String arr2str(int[] arr)  
    {  
        StringBuffer strBuf = new StringBuffer();  
        for(int v:arr) strBuf.append(String.format("%d ", v));  
        return String.format("(%s)", strBuf.toString().trim());  
    }  
      
    static void print(boolean[] array) {  
        System.out.print("[");  
        for (boolean v : array) {  
            System.out.print(v ? "1" : "0");  
        }  
        System.out.print("]");  
    }  
}  

一個可能的執行結果如下:

...
[10100010000111001001000000111010011110001010000000] : (248 0)
[00110010000000000110001000000000001000010000101010] : (370 248)
[00100001000001100001011001100000100000100000001110] : (437 370) // Hash 表的建立過程
Number of elements per bucket at each stage:
[ ... ] // Hash 表中 Buckets 的存放狀況

Show records in bin(47 429) (1):
[00000000100000000010000010110100000001000000000100] // 顯示 (47 429) Bucket 的內容. 因為 stage=2, 所以 Bucket 是二維的.

Target vector:
[00100001000010010010000001001101011001000100000101] // 產生測試 vector
Hash: (311 248) // 測試 vector 計算出來的 hash, 說明我們可以去 stage=1 的第 311 號 bucket 與 stage=2 的 248 號 bucket 去找相似的 vector.
Look for bucket=311 at stage=0 (1243)...
[10000001000000001100101010111011000100000000000000]: 0.21
[10101000010010101011110011000000000010011110001001]: 0.35
[00000011001000001010000000000101001010000100100000]: 0.47
[10000001000011000010000011001101100011001010010101]: 0.61
[00000001000010010100101100001001010001000100100001]: 0.62
[00000001011010011000100001001101001001100100001000]: 0.65
Look for bucket=248 at stage=1 (1581)...
[01100001000010000000000001001000010001000101001101]: 0.72
Search count=2,824! // 經過 2,824 次的比對後, 找出最相近的 vector 與 target vector 的 Cosine similarity 為 0.72

透過 LSH 你不需要比對過 10,000 筆的 vectors, 就可以找出最相近的 vector.

Supplement
* LSH.9 Locality-sensitive hashing: how it works
* LSH.10 False positive and negative errors of LSH
* LSH.11 Hash-code length and number of hashtables
* How to understand Locality Sensitive Hashing?
* Java - LSH - A Java implementation of Locality Sensitive Hashing (LSH).

程式扎記

標籤

2017年10月13日星期五

[ Java 文章收集 ] 局部敏感哈希 (Locality-Sensitive Hashing, LSH) 方法介紹

3 則留言:

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2017年10月13日 星期五