2012年10月7日 星期日

[ InAction Note ] Ch1. Meet Lucene - Understanding the core searching/indexing classes


Understanding the core indexing classes :
As you saw in our Indexer class (A simple application), you need the following classes to perform the simplest indexing procedure:
IndexWriter
Directory
Analyzer
Document
Field


Figure 1.5 shows how these classes each participate in the indexing process. What follows is a brief overview of each of these classes, to give you a rough idea of their role in Lucene. We’ll use these classes throughout this book.


- IndexWriter
IndexWriter is the central component of the indexing process. This class creates a new index or opens an existing one, and adds, removes, or updates documents in the index. Think of IndexWriter as an object that gives you write access to the index but doesn’t let you read or search it. IndexWriter needs somewhere to store its index, and that’s what Directory is for.

- Directory
The Directory class represents the location of a Lucene index. It’s an abstract class that allows its subclasses to store the index as they see fit. In our Indexer example, we used FSDirectory.open to get a suitable concrete FSDirectory implementation that stores real files in a directory on the file system, and passed that in turn to Index-Writer’s constructor.

Lucene includes a number of interesting Directory implementations, covered in section 2.10. IndexWriter can’t index text unless it’s first been broken into separate words, using an analyzer.

- Analyzer
Before text is indexed, it’s passed through an analyzer. The analyzer, specified in the IndexWriter constructor, is in charge of extracting those tokens out of text that should be indexed and eliminating the rest. If the content to be indexed isn’t plain text, you should first extract plain text from it before indexing. Chapter 7 shows how to use Tika to extract text from the most common rich-media document formats. Analyzer is an abstract class, but Lucene comes with several implementations of it. Some of them deal with skipping stop words (frequently used words that don’t help distinguish one document from the other, such as aan,thein, and on); some deal with conversion of tokens to lowercase letters, so that searches aren’t case sensitive; and so on. Analyzers are an important part of Lucene and can be used for much more than simple input filtering. For a developer integrating Lucene into an application, the choice of analyzer(s) is a critical element of application design. You’ll learn much more about them in chapter 4.

The analysis process requires a document, containing separate fields to be indexed.

- Document
The Document class represents a collection of fields. Think of it as a virtual document—a chunk of data, such as a web page, an email message, or a text file—that you want to make retrievable at a later time. Fields of a document represent the document or metadata associated with that document. The original source (such as a database record, a Microsoft Word document, a chapter from a book, and so on) of document data is irrelevant to Lucene. It’s the text that you extract from such binary documents, and add as a Field instance, that Lucene processes. The metadata (such as author, title, subject and date modified) is indexed and stored separately as fields of a document.

Lucene only deals with text and numbers. Lucene’s core doesn’t itself handle anything but java.lang.String, java.io.Reader, and native numeric types (such as int or float). Although various types of documents can be indexed and made searchable, processing them isn’t as straightforward as processing purely textual or numeric content. You’ll learn more about handling nontext documents in chapter 7.

In our Indexer, we’re concerned with indexing text files. So, for each text file we find, we create a new instance of the Document class, populate it with fields (described next), and add that document to the index, effectively indexing the file. Similarly, in your application, you must carefully design how a Lucene document and its fields will be constructed to match specific needs of your content sources and application.

A document is simply a container for multiple fieldsField is the class that holds the textual content to be indexed.

- Field
Each document in an index contains one or more named fields, embodied in a class called Field. Each field has a name and corresponding value, and a bunch of options, described in section 2.4, that control precisely how Lucene will index the field’s value. A document may have more than one field with the same name. In this case, the values of the fields are appended, during indexing, in the order they were added to the document. When searching, it’s exactly as if the text from all the fields were concatenated and treated as a single text field.

You’ll apply this handful of classes most often when using Lucene for indexing. To implement basic search functionality, you need to be familiar with an equally small and simple set of Lucene search classes.

Understanding the core searching classes :
The basic search interface that Lucene provides is as straightforward as the one for indexing. Only a few classes are needed to perform the basic search operation:
IndexSearcher
Term
Query
TermQuery
TopDocs


The following sections provide a brief introduction to these classes. Chapter 3 will have more advanced topics on them.

- IndexSearcher
IndexSearcher is to searching what IndexWriter is to indexing: the central link to the index that exposes several search methods. You can think of IndexSearcheras a class that opens an index in a read-only mode. It requires a Directory instance, holding the previously created index, and then offers a number of search methods, some of which are implemented in its abstract parent class Searcher; the simplest takes a Query object and an int topN count as parameters and returns aTopDocs object. A typical use of this method looks like this:
  1. Directory dir = FSDirectory.open(new File("/tmp/index"));  
  2. IndexSearcher searcher = new IndexSearcher(dir);  
  3. Query q = new TermQuery(new Term("contents""lucene"));  
  4. TopDocs hits = searcher.search(q, 10);  
  5. searcher.close();  
We cover the details of IndexSearcher in chapter 3, along with more advanced information in chapters 5 and 6. Now we’ll visit the fundamental unit of searching, Term.

- Term
Term is the basic unit for searching. Similar to the Field object, it consists of a pair of string elements: the name of the field and the word (text value) of that field. Note that Term objects are also involved in the indexing process. However, they’re created by Lucene’s internals, so you typically don’t need to think about them while indexing. During searching, you may construct Term objects and use them together with TermQuery:
  1. Query q = new TermQuery(new Term("contents""lucene"));  
  2. TopDocs hits = searcher.search(q, 10);  
This code instructs Lucene to find the top 10 documents that contain the word lucene in a field named contents, sorting the documents by descending relevance.

- Query
Lucene comes with a number of concrete Query subclasses. So far in this chapter we’ve mentioned only the most basic Lucene Query: TermQuery. Other Query types are BooleanQueryPhraseQueryPrefixQueryPhrasePrefixQueryTermRangeQueryNumericRangeQueryFilteredQuery, and SpanQuery. All of these are covered in chapters 3 and 5. Query is the common, abstract parent class. It contains several utility methods, the most interesting of which is setBoost(float), which enables you to tell Lucene that certain subqueries should have a stronger contribution to the final relevance score than other subqueries. The setBoost()method is described in section 3.5.12. Next we cover TermQuery, which is the building block for most complex queries in Lucene.

- TermQuery
TermQuery is the most basic type of query supported by Lucene, and it’s one of the primitive query types. It’s used for matching documents that contain fields with specific values, as you’ve seen in the last few paragraphs. Finally, wrapping up our brief tour of the core classes used for searching, we touch on TopDocs, which represents the result set returned by searching.

- TopDocs
The TopDocs class is a simple container of pointers to the top N ranked search results—documents that match a given query. For each of the top N results, TopDocsrecords the int docID (which you can use to retrieve the document) as well as the float score. Chapter 3 describes TopDocs in more detail.

This message was edited 27 times. Last update was at 08/10/2012 10:31:46

26 則留言:

  1. 回覆
    1. Dear Karthika, Many thanks for your feedback and wish you a great day! Also if you are interested in Lucene, strongly encourage you to read the book "Lucene in action" (https://www.manning.com/books/lucene-in-action-second-edition) while the post here is only my personal note for future reference and may not be enough and systematic to learn the powerful Lucene. :p

      刪除
  2. Wonderful article, very useful and well explanation. Your post is extremely incredible. I will refer this to my candidates...
    Click here:
    angularjs6 Training in Chennai
    Click here:
    Microsoft azure training in chennai
    Click here:
    angularjs6 Training in Chennai
    Click here:
    angularjs Training in online

    回覆刪除
  3. Thanks for posting this info. I just want to let you know that I just check out your site and I find it very interesting and informative. I can't wait to read lots of your posts
    angularjs online Training

    angularjs Training in marathahalli

    angularjs interview questions and answers

    angularjs Training in bangalore

    angularjs Training in bangalore

    angularjs online Training

    回覆刪除
  4. Writing a blog is a piece of art and the writer has undoubtedly mastered this skill.
    Devops Training in Chennai | Devops Training Institute in Chennai

    回覆刪除
  5. I really appreciate this post. I’ve been looking all over for this! Thank goodness I found it on Bing. You’ve made my day! Thx again!

    informatica mdm online training

    apache spark online training

    angularjs online training

    devops online training

    aws online training

    回覆刪除
  6. 作者已經移除這則留言。

    回覆刪除
  7. Useful information.I am actual blessed to read this article.thanks for giving us this advantageous information

    BEST ANGULAR JS TRAINING IN CHENNAI WITH PLACEMENT

    https://www.acte.in/angular-js-training-in-chennai
    https://www.acte.in/angular-js-training-in-annanagar
    https://www.acte.in/angular-js-training-in-omr
    https://www.acte.in/angular-js-training-in-porur
    https://www.acte.in/angular-js-training-in-tambaram
    https://www.acte.in/angular-js-training-in-velachery

    回覆刪除
  8. Yes, the post is very interesting and I really like it.I never seen articles like this. I meant it's so knowledgeable, informative, and good looking site. I appreciate your hard work. Good job. thank you
    python training in chennai

    python online training in chennai

    python training in bangalore

    python training in hyderabad

    python online training

    python flask training

    python flask online training

    python training in coimbatore



    回覆刪除
  9. This is most informative and also this post most user friendly and super navigation to all posts. Thank you so much for giving this information to me.
    amazon web services aws training in chennai

    microsoft azure course in chennai

    workday course in chennai

    android course in chennai

    ios course in chennai

    回覆刪除
  10. This blog is very interesting. I learned so much and want to thank you for sharing it in the first place. It is really helpful for my future endeavors. Thanks for your efforts and making it available to public
    Java course in chennai

    python course in chennai

    web designing and development course in chennai

    selenium course in chennai

    digital-marketing seo course in chennai

    回覆刪除
  11. DevOps is currently a popular model currently organizations all over the world moving towards to it. Your post gave a clear idea about knowing the DevOps model and its importance.

    DevOps Training in Chennai

    DevOps Course in Chennai

    回覆刪除
  12. Thanks a lot very much for the high quality and results-oriented help.
    I won’t think twice to endorse your blog post to anybody who wants
    and needs support about this area.
    oracle dba training in chennai
    java training in chennai
    node js training in chennai

    回覆刪除

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...