程式扎記: [ InAction Note ] Ch2. Building a search index

Preface:
Not all documents and fields are created equal—or at least you can make sure that’s the case by using boosting. Boosting may be done during indexing, as we describe here, or during searching, as described in section 5.7. Search-time boosting is more dynamic, because every search can separately choose to boost or not to boost with different factors, but also may be somewhat more CPU intensive. Because it’s so dynamic, search-time boosting also allows you to expose the choice to the user, such as a checkbox that asks “Boost recently modified documents?”.

Regardless of whether you boost during indexing or searching, take caution: too much boosting, especially without corresponding transparency in the user interface explaining that certain documents were boosted, can quickly and catastrophically erode the user’s trust. Iterate carefully to choose appropriate boosting values and to ensure you’re not doing so much boosting that your users are forced to browse irrelevant results. In this section we’ll show you how to selectively boost documents or fields during indexing, then describe how boost information is recorded into the index using norms.

Boosting documents:
Imagine you have to write an application that indexes and searches corporate email. Perhaps the requirement is to give company employees’ emails more importance than other email messages when sorting search results. How would you go about doing this?

Document boosting is a feature that makes such a requirement simple to implement. By default, all documents have no boost—or, rather, they all have the same boost factor of 1.0. By changing a document’s boost factor, you can instruct Lucene to consider it more or less important with respect to other documents in the index when computing relevance. The API for doing this consists of a single method, setBoost(float) , which can be used as shown in listing 2.4. (Note that certain methods, likegetSenderEmail and isImportant, aren’t defined in this fragment, but are included in the full examples sources included with the book.)
- Listing 2.4 Selectively boosting documents and fields

view plaincopy to clipboardprint?
Document doc = new Document();  
String senderEmail = getSenderEmail();  
String senderName = getSenderName();  
String subject = getSubject();  
String body = getBody();  
doc.add(new Field("senderEmail", senderEmail, Field.Store.YES,  
        Field.Index.NOT_ANALYZED));  
doc.add(new Field("senderName", senderName, Field.Store.YES,  
        Field.Index.ANALYZED));  
doc.add(new Field("subject", subject, Field.Store.YES,  
        Field.Index.ANALYZED));  
doc.add(new Field("body", body, Field.Store.NO, Field.Index.ANALYZED));  
String lowerDomain = getSenderDomain().toLowerCase();  
if (isImportant(lowerDomain)) {  // 1) Good domain -> boost factor=1.5  
    doc.setBoost(1.5F);  
} else if (isUnimportant(lowerDomain)) { // 2) Bad domain -> boost factor=0.1  
    doc.setBoost(0.1F);  
}  
writer.addDocument(doc);  

In this example, we check the domain name of the email message sender to determine whether the sender is a company employee. During searching, Lucene will silently increase or decrease the scores of documents according to their boost.

Boosting fields:
Just as you can boost documents, you can also boost individual fields. When you boost a document, Lucene internally uses the same boost factor to boost each of its fields. Imagine that another requirement for the email-indexing application is to consider the subject field more important than the field with a sender’s name. In other words, search matches made in the subject field should be more valuable than equivalent matches in the senderName field in our earlier example. To achieve this behavior, we use the setBoost(float) method of the Field class:

view plaincopy to clipboardprint?
Field subjectField = new Field("subject", subject,  
        Field.Store.YES,  
        Field.Index.ANALYZED);  
subjectField.setBoost(1.2F);  

In this example, we arbitrarily picked a boost factor of 1.2, just as we arbitrarily picked document boost factors of 1.5 and 0.1 earlier. The boost factor values you should use depend on what you’re trying to achieve; you’ll need to do some experimentation and tuning to achieve the desired effect. But remember when you want to change the boost on a field or document, you’ll have to fully remove and then read the entire document, or use the updateDocument method, which does the same thing.

Document and field boosting come into play at search time, as you’ll learn in section 3.3.1. Lucene’s search results are ranked according to how closely each document matches the query, and each matching document is assigned a score. Lucene’s scoring formula consists of a number of factors, and the boost factor is one of them.

Norms:
During indexing, all sources of index-time boosts are combined into a single floating point number for each indexed field in the document. The document may have its own boost; each field may have a boost; and Lucene computes an automatic boost based on the number of tokens in the field (shorter fields have a higher boost). These boosts are combined and then compactly encoded (quantized) into a single byte, which is stored per field per document. During searching, norms for any field being searched are loaded into memory, decoded back into a floating-point number, and used when computing the relevance score.

Even though norms are initially computed during indexing, it’s also possible to change them later using IndexReader’s setNorm method. setNorm is an advanced method that requires you to recompute your own norm factor, but it’s a potentially powerful way to factor in highly dynamic boost factors, such as document recency or click-through popularity.

One problem often encountered with norms is their high memory usage at search time. This is because the full array of norms, which requires one byte per document per separate field searched, is loaded into RAM. For a large index with many fields per document, this can quickly add up to a lot of RAM. Fortunately, you can easily turn norms off by either using one of the NO_NORMS indexing options in Field.Index or by calling Field.setOmitNorms(true) before indexing the document containing that field. Doing so will potentially affect scoring, because no index-time boost information will be used during searching, but it’s possible the effect is trivial, especially when the fields tend to be roughly the same length and you’re not doing any boosting on your own.

Beware: if you decide partway through indexing to turn norms off, you must rebuild the entire index because if even a single document has that field indexed with norms enabled, then through segment merging this will “spread” so that all documents consume one byte even if they’d disabled norms. This happens because Lucene doesn’t use sparse storage for norms.

程式扎記

標籤

2012年10月21日星期日

[ InAction Note ] Ch2. Building a search index - Boosting documents and fields

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2012年10月21日 星期日