Although most content is textual in nature, in many cases handling numeric or date/time values is crucial. In a commerce setting, the product’s price, and perhaps other numeric attributes like weight and height, are clearly important. A video search engine may index the duration of each video. Press releases and articles have a timestamp. These are just a few examples of important numeric attributes that modern search applications face.
Indexing numbers:
There are two common scenarios in which indexing numbers is important. In one scenario, numbers are embedded in the text to be indexed, and you want to make sure those numbers are preserved and indexed as their own tokens so that you can use them later as ordinary tokens in searches. For instance, your documents may contain sentences like "Be sure to include Form 1099 in your tax return": you want to be able to search for the number 1099 just as you can search for the phrase “tax return” and retrieve the document that contains the exact number.
To enable this, simply pick an analyzer that doesn’t discard numbers. As we discuss in section 4.2.3, WhitespaceAnalyzer and StandardAnalyzer are two possible candidates. If you feed them the "Be sure to include Form 1099 in your tax return" sentence, they’ll extract 1099 as a token and pass it on for indexing, allowing you to later search for 1099 directly. On the other hand, SimpleAnalyzer and StopAnalyzer discard numbers from the token stream, which means the search for 1099 won’t match any documents. If in doubt, use Luke, which is a wonderful tool for inspecting all details of a Lucene index, to check whether numbers survived your analyzer and were added to the index. Luke is described in more detail in section 8.1.
In the other scenario, you have a field that contains a single number and you want to index it as a numeric value and then use it for precise (equals) matching, range searching, and/or sorting. For example, you might be indexing products in a retail catalog, where each product has a numeric price and you must enable your users to be able to restrict a search by price range.
In past releases, Lucene could only operate on textual terms. This required careful preprocessing of numbers, such as zero-padding or advanced number-to-text encodings, to turn them into Strings so that sorting and range searching by the textual terms worked properly. Fortunately, as of version 2.9, Lucene includes easy-to-use builtin support for numeric fields, starting with the new NumericField class. You simply create a NumericField, use one of its set
- doc.add(new NumericField("price").setDoubleValue(19.99));
Although each NumericField instance accepts only a single numeric value, you’re allowed to add multiple instances, with the same field name, to the document. The resulting NumericRangeQuery and NumericRangeFilter will logically "or" together all the values. But the effect on sorting is undefined. If you require sorting by the field, you’ll have to index a separate NumericField that has only one occurrence for that field name.
Indexing dates and times:
Email messages include sent and received dates, files have several timestamps associated with them, and HTTP responses have a Last-Modified header that includes the date of the requested page’s last modification. Chances are, like many other Lucene users, you’ll need to index dates and times. Such values are easily handled by first converting them to an equivalent int or long value, and then indexing that value as a number. The simplest approach is to use Date.getTime to get the equivalent value, in millisecond precision, for a Java Date object:
- doc.add(new NumericField("timestamp").setLongValue(new Date().getTime()));
- doc.add(new NumericField("day").setIntValue((int) (new Date().getTime()/24/3600)));
- Calendar cal = Calendar.getInstance();
- cal.setTime(date);
- doc.add(new NumericField("dayOfMonth").setIntValue(cal.get(Calendar.DAY_OF_MONTH)));
沒有留言:
張貼留言