程式扎記

Source From Here
Preface
搜尋的功能在軟體發展到一個階段，甚至是在軟體建置初期就會被要求加入的一個功能，實作搜尋功能的方式有很多，我接觸過的除了傳統關聯式資料庫的 LIKE，還有 Microsoft Index Service、HP IDOL 到今天我要介紹的 Elasticsearch。

什麼是 Elasticsearch？
Elasticsearch（簡稱ES），是一個 Open Source 的搜尋專案，base on Apache Lucene，專案由 Shay Banon 於 2010 年 2 月啟動，其協定是 Apache 2.0。Elasticsearch 由於其天生的分散式和即時特性，很多人把它作為資料庫使用，也有很多人把它拿來儲存 Log，Elasticsearch 的發佈在 Lucene 和 Solr 社群引起很大的騷動，Solr 4.0+ 版本的 SolrCloud 也吸收了很多 Elasticsearch 的特性。

基本名詞解釋

* Node：Elasticsearch 可以用單一站台或叢集的方式運作，以一個整體的服務對外提供搜尋的功能，一個安裝有 Elasticsearch 服務的站台就稱為是一個 Node。
* Cluster：想當然爾，一群安裝有 Elasticsearch 服務的站台們，就稱為是 Cluster。
* Field：這是 Elasticsearch 儲存資料的最小單位，類似於關聯式資料庫的 Column。
* Document：若干個 Field 集合成一個 Document，類似於關聯式資料庫的 Row，每一個 Document 都有一個唯一的 ID 作為區分。
* Type：一個 Document 必須隸屬於一個 Type，類似於關聯式資料庫的 Table。
* Index：一個 Type 必須隸屬於一個 Index，類似於關聯式資料庫的 Database。

跟關聯式資料庫不同的是，Elasticsearch 的每一個 Document 並沒有限定 Field 的數量及 Field 的名稱，也就是說放在相同 Type 之下的每一個 Document 都允許不一樣數量的 Field 及不一樣名稱的 Field。

* Shard：通常叫做分片，這是 Elasticsearch 提供分散式搜尋的基礎，其含義是將一個完整的 Index 分成若干部分，儲存在相同或不同的 Node 上，這些組成 Index 的部分就叫做 Shard。
* Replica：意思跟 Replication 差不多，就是 Shard 的備份，所以一個 Index 的 Shard 數量就等於 Shard × (1 + Replica)。

分散式的特性 - 備份不重覆
我畫了一圖，假設我一個 Elasticsearch 的 Cluster 有 3 個 Node，我將 Index 切成 3 個 Shard、1 份 Replica，大致上就會長成下面這個樣子。

從上面這張圖可以看到一個特性，以我假設的情況為例，在任一個 Node 上無論 Shard 是 master 還是 replica，絕對不會有重覆編號的 Shard 出現。如果我的 Shard 太多，Node 太少怎麼辦？Elasticsearch 就不會分配多餘的 Shard 到 Node 裡面，不過至少 master 的 Shard 保證都會有一份。

分散式的特性 - 自動還原資料
假設有一天 Cluster 內的 Node2 掛點了，Cluster 就會啟動重新分配 Shard 的機制，而遺失的 Shard 就會從其他 Node 補足，目的在確保資料的完整性，重新分配後大致上就會長這個樣子，是不是很方便？

在 CentOS 7 安裝 Elasticsearch 分散式搜尋系統 (link)
現在試著將它裝在 CentOS 上，下面就記錄整個過程及所用到的指令。事前的準備當然就是先將 CentOS 安裝起來，目前我是用 CentOS 7，使用最小安裝即可。

安裝有用到的工具

* net-tools: yum -y install net-tools 這個工具包含 ifconfig，ifconfig 是用來查看網卡資訊的。
* wget: yum -y install wget 用來從網路上下載檔案用的。

安裝 JDK 1.8.0

# yum -y install java-1.8.0-openjdk.x86_64 // nstall JDK 1.8.0
# rpm -qa | grep java-1.8.0-openjdk // Query install jdk packages
java-1.8.0-openjdk-headless-1.8.0.131-3.b12.el7_3.x86_64
java-1.8.0-openjdk-1.8.0.131-3.b12.el7_3.x86_64
# rpm -ql java-1.8.0-openjdk-1.8.0.131-3.b12.el7_3.x86_64 // Check the installed paths
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.131-3.b12.el7_3.x86_64/jre/bin/policytool
...
# ls -hl /usr/lib/jvm/ // Double confirm the exactly folder to point to
...
lrwxrwxrwx. 1 root root 35 Jun 20 05:06 jre-1.8.0-openjdk -> /etc/alternatives/jre_1.8.0_openjdk
...
# echo -e "\nexport JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk" >> ~/.bashrc
# . ~/.bashrc // Make the setting effective
# echo $JAVA_HOME
/usr/lib/jvm/jre-1.8.0-openjdk

安裝 Elasticsearch 2.1.1

# cd ~
# wget https://download.elasticsearch.org/elasticsearch/r.../2.1.1/elasticsearch-2.1.1.rpm // 下載 Elasticsearch 2.1.1 RPM 檔
# rpm -ivh elasticsearch-2.1.1.rpm // Install Elasticsearch

// 安裝 kopf Plugin
# cd /usr/share/elasticsearch/bin/
# ./plugin install lmenezes/elasticsearch-kopf

修改 network.host 參數值
將 network.host 的值修改為 _non_loopback:ipv4_，network.host 參數值的意義可以參考官網的這篇文章 Network Settings 。

# cd /etc/elasticsearch
# vi elasticsearch.yml

view plain copy to clipboard print ?

# network.host: 192.168.0.1

network.host: _non_loopback:ipv4_

設定 Elasticsearch 為背景服務

// Enable the Elasticsearch service to start on server boot
# sudo systemctl daemon-reload
# sudo systemctl enable elasticsearch.service

開啟防火牆 9200, 9300 埠號
Elasticsearch 需要 2 個 Port：

* HTTP Traffic 預設是 9200，範圍 9200~9299。
* Node-to-Node 預設是 9300，範圍 9300~9399。

// Add allow port 9200, 9300
# firewall-cmd --permanent --add-port={9200/tcp,9300/tcp}
# firewall-cmd --reload

啟動 Elasticsearch

// Start Elasticsearch service
# sudo systemctl start elasticsearch.service
# netstat -tunlp | grep 9200
tcp6 0 0 192.168.1.100:9200 :::* LISTEN 4131/java

操作 Elasticsearch

由於 Elasticsearch 是用 Java 寫的，當然有提供它的 Java API，不過在這邊我還是選擇用比較簡單的 RESTful API 來操作。到這裡不懂 RESTful API 沒關係，只要知道等等可以用 curl 來操作就好。不過，你還是看一下什麼叫做 HTTP Method 好了：淺談 HTTP Method：表單中的 GET 與 POST 有什麼差別？ – Soul & Shell Blog. (Elasticsearch 預設開啟的 port 是 9200，有需要的話可以到 config/elasticsearch.yml 來修改)

操作 Elasticsearch 的格式基本上是這樣：

# curl -X ‘http://:///[]

不一定要輸入 id，只是不指定 id 的話使用的 HTTP Method 不一樣。接著來試著新增一些資料到 Elasticsearch 上吧！例如我們建立一個 twitter 的 user (Elasticsearch 會自動幫我們建立索引)：

# curl -XPUT 'http://172.17.0.2:9200/twitter/user/Noob?pretty' -d '{"name": "Noob"}'
{
"_index" : "twitter",
"_type" : "user",
"_id" : "Noob",
"_version" : 1,
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"created" : true
}

?pretty 是什麼？你可以不要加試試看，加了只是會讓它回應的排版變比較美而已 XD. 接著可以用 GET 方法來取得資料，例如：
# curl -XGET 'http://172.17.0.2:9200/twitter/user/Noob?pretty'

view plaincopy to clipboardprint?
{  
  "_index" : "twitter",  
  "_type" : "user",  
  "_id" : "Noob",  
  "_version" : 1,  
  "found" : true,  
  "_source":{"name": "Noob"}  
}  

既然叫做 Elasticsearch，最重要的應該還是搜尋吧？一樣是用 GET 方法，如果你有多筆資料，你可以這樣搜尋：

# curl -XGET 'http://172.17.0.2:9200/twitter/tweet/_search?pretty'
{
"took" : 41,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}

這些回傳的資料都是巢狀的 JSON 格式，會很好處理。當然還有其他搜尋的方式，可以參考官方文件看看。最後，如果對 Elasticsearch 有興趣，可以看看 GitHub 上的文件：elastic/elasticsearch。

Java API
Before we jump straight to how to use the main Java API features, we need to initiate the client using TransportClient:

view plaincopy to clipboardprint?
import java.net.InetAddress;  
import java.util.Arrays;  
import java.util.List;  
  
import org.elasticsearch.action.search.SearchResponse;  
import org.elasticsearch.client.Client;  
import org.elasticsearch.client.transport.TransportClient;  
import org.elasticsearch.common.transport.InetSocketTransportAddress;  
import org.elasticsearch.search.SearchHit;  
...  
    public static Client GetClient() throws Exception  
    {  
        Client client = TransportClient.builder().build().addTransportAddress(new InetSocketTransportAddress(  
                InetAddress.getByName("220.134.109.53"),  
                9300));  
        return client;  
    }  

Indexing Documents
The prepareIndex() function allows to store an arbitrary JSON document and make it searchable (Refer to Index API):

view plaincopy to clipboardprint?
public static class People{  
    String  name;  
    int     age;  
      
    public People(String n, int a){this.name=n; this.age=a;}  
      
    public String toJson() throws Exception    
    {   
        Gson gson = new Gson();    
        return gson.toJson(this);  
    }  
}  
public static void IndexEx() throws Exception  
{  
    Client client = GetClient();  
    People p = new People("Mary", 26);  
    System.out.printf("\t[Info] Index data as:\n%s\n\n", p.toJson());  
    IndexResponse resp = client.prepareIndex("twitter", "user").setSource(p.toJson()).get();  
              
    System.out.printf("\t[Info] Is done? %s\n", resp.isCreated());  
    if(resp.isCreated())  
    {        
        System.out.printf("\t\tid=%s\n", resp.getId());  
        System.out.printf("\t\tindex=%s\n", resp.getIndex());  
        System.out.printf("\t\ttype=%s\n", resp.getType());  
        System.out.printf("\t\tversion=%d\n\n", resp.getVersion());  
    }  
    //assertTrue(resp.isCreated());  
    client.close();  
}  

Execution Result:

[Info] Index data as:
{"name":"Mary","age":26}

[Info] Is done? true
id=AVza45v4XSehhTKMFkXT
index=twitter
type=user
version=1

Querying Indexed Documents
Now that we have a typed searchable JSON document indexed, we can proceed and search using the prepareSearch() method:

view plaincopy to clipboardprint?
public static void QueryEx() throws Exception  
{  
    Client client = GetClient();  
    SearchResponse resp = client.prepareSearch("twitter").setTypes("user").execute().actionGet();     
    List sh = Arrays.asList(resp.getHits().getHits());       
    System.out.printf("\t[Info] %d hit(s)!\n", sh.size());  
    for(SearchHit hit:sh)  
    {  
        System.out.printf("\t[Info] Hit (%s/%s/%s)\n%s\n\n", hit.index(), hit.getType(), hit.getId(), hit.getSourceAsString());           
    }  
    client.close();  
    System.out.printf("\t[Info] Done!\n");  
}  

The results returned by the actionGet() method are called Hits, each Hit refers to a JSON document matching a search request. We can enhance the request by adding additional parameters in order to customize the query using the QueryBuilders methods:

view plaincopy to clipboardprint?
public static void QueryEx2() throws Exception  
{  
    Client client = GetClient();  
    SearchResponse resp = client.prepareSearch()  
              .setTypes()  
              .setSearchType(SearchType.DFS_QUERY_THEN_FETCH)  
              .setPostFilter(QueryBuilders.rangeQuery("age").from(10).to(20))  
              .execute()  
              .actionGet();       
    List sh = Arrays.asList(resp.getHits().getHits());       
    System.out.printf("\t[Info] %d hit(s)!\n", sh.size());  
    for(SearchHit hit:sh)  
    {  
        System.out.printf("\t[Info] Hit (%s/%s/%s)\n%s\n", hit.index(), hit.getType(), hit.getId(), hit.getSourceAsString());             
    }  
    client.close();  
    System.out.printf("\t[Info] Done!\n");  
}  

Execution output:

[Info] 2 hit(s)!
[Info] Hit (twitter/user/AVza4JV6XSehhTKMFkXS)
{"name":"Ken","age":18}
[Info] Hit (twitter/user/John)
{"name": "John", "age": 16}
[Info] Done!

Supplement
* 在 CentOS 7 與 ELK（Elasticsearch + Logstash + Kibana)
* Use Elasticsearch in your Java applications
* Guide to Elasticsearch in Java
* How to convert Java object to / from JSON (Gson)
* A Guide to FastJson

FastJson is a lightweight Java library used to effectively convert JSON strings to Java objects and vice versa. In this article we’re going to dive into several concrete and practical applications of the FastJson library.

程式扎記

標籤

2017年6月20日星期二

[ ELK ] 介紹 Elasticsearch 分散式搜尋系統

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2017年6月20日 星期二