程式扎記: [ InAction Note ] Ch2. Building a search index

2014年5月9日星期五

[ InAction Note ] Ch2. Building a search index - Indexing numbers, dates, and times

Preface:
Although most content is textual in nature, in many cases handling numeric or date/time values is crucial. In a commerce setting, the product’s price, and perhaps other numeric attributes like weight and height, are clearly important. A video search engine may index the duration of each video. Press releases and articles have a timestamp. These are just a few examples of important numeric attributes that modern search applications face.

Indexing numbers:
There are two common scenarios in which indexing numbers is important. In one scenario, numbers are embedded in the text to be indexed, and you want to make sure those numbers are preserved and indexed as their own tokens so that you can use them later as ordinary tokens in searches. For instance, your documents may contain sentences like "Be sure to include Form 1099 in your tax return": you want to be able to search for the number 1099 just as you can search for the phrase “tax return” and retrieve the document that contains the exact number.

To enable this, simply pick an analyzer that doesn’t discard numbers. As we discuss in section 4.2.3, WhitespaceAnalyzer and StandardAnalyzer are two possible candidates. If you feed them the "Be sure to include Form 1099 in your tax return" sentence, they’ll extract 1099 as a token and pass it on for indexing, allowing you to later search for 1099 directly. On the other hand, SimpleAnalyzer and StopAnalyzer discard numbers from the token stream, which means the search for 1099 won’t match any documents. If in doubt, use Luke, which is a wonderful tool for inspecting all details of a Lucene index, to check whether numbers survived your analyzer and were added to the index. Luke is described in more detail in section 8.1.

In the other scenario, you have a field that contains a single number and you want to index it as a numeric value and then use it for precise (equals) matching, range searching, and/or sorting. For example, you might be indexing products in a retail catalog, where each product has a numeric price and you must enable your users to be able to restrict a search by price range.

In past releases, Lucene could only operate on textual terms. This required careful preprocessing of numbers, such as zero-padding or advanced number-to-text encodings, to turn them into Strings so that sorting and range searching by the textual terms worked properly. Fortunately, as of version 2.9, Lucene includes easy-to-use builtin support for numeric fields, starting with the new NumericField class. You simply create a NumericField, use one of its setValue methods (accepting types int, long, float, and double, and then returning itself) to record the value, and then add the NumericField to your document just like any other Field. Here’s an example:

view plaincopy to clipboardprint?
doc.add(new NumericField("price").setDoubleValue(19.99));  

Under the hood, Lucene works some serious magic to ensure numeric values are indexed to allow for efficient range searching and numeric sorting. Each numeric value is indexed using a trie structure, which logically assigns a single numeric value to larger and larger predefined brackets. Each bracket is assigned a unique term in the index, so that retrieving all documents within a single bracket is fast. At search time, the requested range is translated into an equivalent union of these brackets, resulting in a high-performance range search or filter.

Although each NumericField instance accepts only a single numeric value, you’re allowed to add multiple instances, with the same field name, to the document. The resulting NumericRangeQuery and NumericRangeFilter will logically "or" together all the values. But the effect on sorting is undefined. If you require sorting by the field, you’ll have to index a separate NumericField that has only one occurrence for that field name.

Indexing dates and times:
Email messages include sent and received dates, files have several timestamps associated with them, and HTTP responses have a Last-Modified header that includes the date of the requested page’s last modification. Chances are, like many other Lucene users, you’ll need to index dates and times. Such values are easily handled by first converting them to an equivalent int or long value, and then indexing that value as a number. The simplest approach is to use Date.getTime to get the equivalent value, in millisecond precision, for a Java Date object:

view plaincopy to clipboardprint?
doc.add(new NumericField("timestamp").setLongValue(new Date().getTime()));  

Alternatively, if you don’t require full millisecond resolution for your dates, you can simply quantize them. If you need to quantize down to seconds, minutes, hours, or days, it’s straight division:

view plaincopy to clipboardprint?
doc.add(new NumericField("day").setIntValue((int) (new Date().getTime()/24/3600)));  

If you need to quantize further, to month or year, or perhaps you’d like to index hour of day or day of week or month, you’ll have to create a Calendar instance and get fields from it:

view plaincopy to clipboardprint?
Calendar cal = Calendar.getInstance();  
cal.setTime(date);  
doc.add(new NumericField("dayOfMonth").setIntValue(cal.get(Calendar.DAY_OF_MONTH)));  

As you’ve seen, Lucene makes it trivial to index numeric fields. You’ve seen several approaches for converting dates and times into equivalent numeric values for indexing.

程式扎記

標籤

2014年5月9日星期五

[ InAction Note ] Ch2. Building a search index - Indexing numbers, dates, and times

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2014年5月9日 星期五

[ InAction Note ] Ch2. Building a search index - Indexing numbers, dates, and times

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

2014年5月9日星期五