2012年10月7日 星期日

[ InAction Note ] Ch1. Meet Lucene - Understanding the core searching/indexing classes


Understanding the core indexing classes :
As you saw in our Indexer class (A simple application), you need the following classes to perform the simplest indexing procedure:
IndexWriter
Directory
Analyzer
Document
Field


Figure 1.5 shows how these classes each participate in the indexing process. What follows is a brief overview of each of these classes, to give you a rough idea of their role in Lucene. We’ll use these classes throughout this book.


- IndexWriter
IndexWriter is the central component of the indexing process. This class creates a new index or opens an existing one, and adds, removes, or updates documents in the index. Think of IndexWriter as an object that gives you write access to the index but doesn’t let you read or search it. IndexWriter needs somewhere to store its index, and that’s what Directory is for.

- Directory
The Directory class represents the location of a Lucene index. It’s an abstract class that allows its subclasses to store the index as they see fit. In our Indexer example, we used FSDirectory.open to get a suitable concrete FSDirectory implementation that stores real files in a directory on the file system, and passed that in turn to Index-Writer’s constructor.

Lucene includes a number of interesting Directory implementations, covered in section 2.10. IndexWriter can’t index text unless it’s first been broken into separate words, using an analyzer.

- Analyzer
Before text is indexed, it’s passed through an analyzer. The analyzer, specified in the IndexWriter constructor, is in charge of extracting those tokens out of text that should be indexed and eliminating the rest. If the content to be indexed isn’t plain text, you should first extract plain text from it before indexing. Chapter 7 shows how to use Tika to extract text from the most common rich-media document formats. Analyzer is an abstract class, but Lucene comes with several implementations of it. Some of them deal with skipping stop words (frequently used words that don’t help distinguish one document from the other, such as aan,thein, and on); some deal with conversion of tokens to lowercase letters, so that searches aren’t case sensitive; and so on. Analyzers are an important part of Lucene and can be used for much more than simple input filtering. For a developer integrating Lucene into an application, the choice of analyzer(s) is a critical element of application design. You’ll learn much more about them in chapter 4.

The analysis process requires a document, containing separate fields to be indexed.

- Document
The Document class represents a collection of fields. Think of it as a virtual document—a chunk of data, such as a web page, an email message, or a text file—that you want to make retrievable at a later time. Fields of a document represent the document or metadata associated with that document. The original source (such as a database record, a Microsoft Word document, a chapter from a book, and so on) of document data is irrelevant to Lucene. It’s the text that you extract from such binary documents, and add as a Field instance, that Lucene processes. The metadata (such as author, title, subject and date modified) is indexed and stored separately as fields of a document.

Lucene only deals with text and numbers. Lucene’s core doesn’t itself handle anything but java.lang.String, java.io.Reader, and native numeric types (such as int or float). Although various types of documents can be indexed and made searchable, processing them isn’t as straightforward as processing purely textual or numeric content. You’ll learn more about handling nontext documents in chapter 7.

In our Indexer, we’re concerned with indexing text files. So, for each text file we find, we create a new instance of the Document class, populate it with fields (described next), and add that document to the index, effectively indexing the file. Similarly, in your application, you must carefully design how a Lucene document and its fields will be constructed to match specific needs of your content sources and application.

A document is simply a container for multiple fieldsField is the class that holds the textual content to be indexed.

- Field
Each document in an index contains one or more named fields, embodied in a class called Field. Each field has a name and corresponding value, and a bunch of options, described in section 2.4, that control precisely how Lucene will index the field’s value. A document may have more than one field with the same name. In this case, the values of the fields are appended, during indexing, in the order they were added to the document. When searching, it’s exactly as if the text from all the fields were concatenated and treated as a single text field.

You’ll apply this handful of classes most often when using Lucene for indexing. To implement basic search functionality, you need to be familiar with an equally small and simple set of Lucene search classes.

Understanding the core searching classes :
The basic search interface that Lucene provides is as straightforward as the one for indexing. Only a few classes are needed to perform the basic search operation:
IndexSearcher
Term
Query
TermQuery
TopDocs


The following sections provide a brief introduction to these classes. Chapter 3 will have more advanced topics on them.

- IndexSearcher
IndexSearcher is to searching what IndexWriter is to indexing: the central link to the index that exposes several search methods. You can think of IndexSearcheras a class that opens an index in a read-only mode. It requires a Directory instance, holding the previously created index, and then offers a number of search methods, some of which are implemented in its abstract parent class Searcher; the simplest takes a Query object and an int topN count as parameters and returns aTopDocs object. A typical use of this method looks like this:
  1. Directory dir = FSDirectory.open(new File("/tmp/index"));  
  2. IndexSearcher searcher = new IndexSearcher(dir);  
  3. Query q = new TermQuery(new Term("contents""lucene"));  
  4. TopDocs hits = searcher.search(q, 10);  
  5. searcher.close();  
We cover the details of IndexSearcher in chapter 3, along with more advanced information in chapters 5 and 6. Now we’ll visit the fundamental unit of searching, Term.

- Term
Term is the basic unit for searching. Similar to the Field object, it consists of a pair of string elements: the name of the field and the word (text value) of that field. Note that Term objects are also involved in the indexing process. However, they’re created by Lucene’s internals, so you typically don’t need to think about them while indexing. During searching, you may construct Term objects and use them together with TermQuery:
  1. Query q = new TermQuery(new Term("contents""lucene"));  
  2. TopDocs hits = searcher.search(q, 10);  
This code instructs Lucene to find the top 10 documents that contain the word lucene in a field named contents, sorting the documents by descending relevance.

- Query
Lucene comes with a number of concrete Query subclasses. So far in this chapter we’ve mentioned only the most basic Lucene Query: TermQuery. Other Query types are BooleanQueryPhraseQueryPrefixQueryPhrasePrefixQueryTermRangeQueryNumericRangeQueryFilteredQuery, and SpanQuery. All of these are covered in chapters 3 and 5. Query is the common, abstract parent class. It contains several utility methods, the most interesting of which is setBoost(float), which enables you to tell Lucene that certain subqueries should have a stronger contribution to the final relevance score than other subqueries. The setBoost()method is described in section 3.5.12. Next we cover TermQuery, which is the building block for most complex queries in Lucene.

- TermQuery
TermQuery is the most basic type of query supported by Lucene, and it’s one of the primitive query types. It’s used for matching documents that contain fields with specific values, as you’ve seen in the last few paragraphs. Finally, wrapping up our brief tour of the core classes used for searching, we touch on TopDocs, which represents the result set returned by searching.

- TopDocs
The TopDocs class is a simple container of pointers to the top N ranked search results—documents that match a given query. For each of the top N results, TopDocsrecords the int docID (which you can use to retrieve the document) as well as the float score. Chapter 3 describes TopDocs in more detail.

This message was edited 27 times. Last update was at 08/10/2012 10:31:46

24 則留言:

  1. 回覆
    1. Dear Karthika, Many thanks for your feedback and wish you a great day! Also if you are interested in Lucene, strongly encourage you to read the book "Lucene in action" (https://www.manning.com/books/lucene-in-action-second-edition) while the post here is only my personal note for future reference and may not be enough and systematic to learn the powerful Lucene. :p

      刪除
  2. It's interesting that many of the bloggers to helped clarify a few things for me as well as giving.Most of ideas can be nice content.The people to give them a good shake to get your point and across the command.
    Java Training in Chennai

    回覆刪除
  3. Wonderful article, very useful and well explanation. Your post is extremely incredible. I will refer this to my candidates...
    Click here:
    angularjs6 Training in Chennai
    Click here:
    Microsoft azure training in chennai
    Click here:
    angularjs6 Training in Chennai
    Click here:
    angularjs Training in online

    回覆刪除
  4. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.
    Devops Training in Chennai

    Devops Training in Bangalore

    Devops Training in pune

    回覆刪除
  5. Inspiring writings and I greatly admired what you have to say , I hope you continue to provide new ideas for us all and greetings success always for you..Keep update more information..
    Data Science course in rajaji nagar | Data Science with Python course in chenni
    Data Science course in electronic city | Data Science course in USA
    Data science course in pune | Data science course in kalyan nagar


    回覆刪除
  6. Nice tutorial. Thanks for sharing the valuable information. it’s really helpful. Who want to learn this blog most helpful. Keep sharing on updated tutorials…

    java training in tambaram | java training in velachery

    java training in omr | oracle training in chennai

    回覆刪除
  7. Greetings. I know this is somewhat off-topic, but I was wondering if you knew where I could get a captcha plugin for my comment form? I’m using the same blog platform like yours, and I’m having difficulty finding one? Thanks a lot.
    AWS Interview Questions And Answers
    AWS Online Training | Online AWS Certification Course - Gangboard
    AWS Training in Chennai | AWS Training Institute in Chennai Velachery, Tambaram, OMR
    AWS Training in Bangalore |Best AWS Training Institute in BTM ,Marathahalli

    回覆刪除
  8. Thanks for posting this info. I just want to let you know that I just check out your site and I find it very interesting and informative. I can't wait to read lots of your posts
    angularjs online Training

    angularjs Training in marathahalli

    angularjs interview questions and answers

    angularjs Training in bangalore

    angularjs Training in bangalore

    angularjs online Training

    回覆刪除
  9. Writing a blog is a piece of art and the writer has undoubtedly mastered this skill.
    Devops Training in Chennai | Devops Training Institute in Chennai

    回覆刪除
  10. I really appreciate this post. I’ve been looking all over for this! Thank goodness I found it on Bing. You’ve made my day! Thx again!

    informatica mdm online training

    apache spark online training

    angularjs online training

    devops online training

    aws online training

    回覆刪除
  11. I found your blog while searching for the updates, I am happy to be here. Very useful content and also easily understandable providing.. Believe me I did wrote an post about tutorials for beginners with reference of your blog. 
    Microsoft Azure online training
    Selenium online training
    Java online training
    Python online training
    uipath online training

    回覆刪除
  12. Interesting information and attractive.This blog is really rocking... Yes, the post is very interesting and I really like it.I never seen articles like this. I meant it's so knowledgeable, informative, and good looking site. I appreciate your hard work. Good job.
    Kindly visit us @
    Sathya Online Shopping
    Online AC Price | Air Conditioner Online | AC Offers Online | AC Online Shopping
    Inverter AC | Best Inverter AC | Inverter Split AC
    Buy Split AC Online | Best Split AC | Split AC Online
    LED TV Sale | Buy LED TV Online | Smart LED TV | LED TV Price
    Laptop Price | Laptops for Sale | Buy Laptop | Buy Laptop Online
    Full HD TV Price | LED HD TV Price
    Buy Ultra HD TV | Buy Ultra HD TV Online
    Buy Mobile Online | Buy Smartphone Online in India

    回覆刪除

[ Py DS ] Ch5 - Machine Learning (Part2)

Source From  Here   Introducing Scikit-Learn   There are several Python libraries that provide solid implementations of a range of machin...