Log in

Sign in with your email address and password:

No account? Sign up here.
Forgot your password? Reset it here.
Didn't receive account confirmation email? Request a new one.


Solr Introduction – Lucene Indexing Guide

Solr Introduction – What Is Solr?

Welcome to this Solr introduction. Here we’ll go into what Solr is and an overview of how Lucene indexing works.

Solr is an open source enterprise search server. It is a mature product powering search for public sites such as CNET, Zappos, and Netflix, as well as countless other government and corporate intranet sites. It is written in Java, and that language is used to further extend and modify Solr through simple plugin interfaces. However, being a server that communicates using standards such as HTTP and XML and JSON, knowledge of Java is useful but not a requirement. In addition to the standard ability to return a list of search results for some query, Solr has numerous other features such as result highlighting, faceted navigation (as seen on most e-commerce sites), query spell correction, query completion, and a “more like this” feature for finding similar documents.

Solr Introduction – Lucene Indexing Guide

The Underlying Engine: Lucene

Before describing Solr, it is best to start with Apache Lucene, the core technology underlying it. Lucene is an open source, high-performance text search engine library. Lucene was developed and open sourced by Doug Cutting in 2000 and has evolved and matured since then with a strong online community and is the most widely deployed search technology today. Being just a code library, Lucene is not a server and certainly isn’t a web crawler either. This is an important fact. There aren’t even
any configuration files.

In order to use Lucene, you write your own search code using its API, starting with indexing documents: first you supply documents to it. A document in Lucene is simply a collection of fields, which are name-value pairs containing text or numbers. You configure Lucene with a text analyzer that will tokenize a field’s text from a single string into a series of tokens (words) and further transform them by chopping off word stems (called stemming), substitute synonyms, and/or perform other processing. The final tokens are said to be the terms. The aforementioned process starting with the analyzer is referred to as text analysis. Lucene indexes each document into its index stored on disk. The index is an inverted index, which means it stores a mapping of a field’s terms to associated documents, along with the ordinal word position from the original text. Finally you search for documents with a user-provided query string that Lucene parses according to its syntax. Lucene assigns a numeric relevancy score to each matching document and only the top scoring documents are returned.

How Does Lucene Indexing Work?

So far, we’ve mentioned some fancy words but not really gotten into how Lucence indexes files. Here we’ll got into a little more detail on the subject. The following concepts are fundamental to understanding how Lucence (and Solr therefore) are used to make groovy searching applications.

In a nutshell, Lucene does two things:

  1. It creates indexes
  2. It lets you search content in those indexes

What you decide to put in the index is entirely up to you. This data can be anything from HTML pages, database records or Word Documents, the sky is the limit! Essentially, and kind of data object can be made searchable through Lucene indexes.  For this example, we’ll use a set of Word Documents as our example.

Creating The Lucene Index

So, step one is to create the index for our set of Word documents. To do this, we need to write some code that takes the contents from the Word documents and turns them into a searchable index. The only way to do this is by brute force. We’ll have to iterate over each of the Word documents, examining each and converting each document into the information pieces that Lucene needs to work with when it creates the index.

What are the pieces that Lucene needs to create the index? There are two:

  1. Documents
  2. Fields

These two abstractions are so key to Lucene that Lucene represents them with two top level Java classes, Document and Field. A Document, not to be confused with our actual Word documents, is a Java class that represents a searchable item in Lucene. By searchable item, we mean that a Document is the thing that you find when you search. It’s up to you to create these Documents.

Lucky for us, it’s a pretty clear step from an actual Word document to a Lucene Document. I think anyone would agree that it will be the Word documents that our users will want to find when they conduct a search. This makes our processing rather simple, we will simply create a single Lucene Document for each of our actual Word documents.

Creating The Lucene Document and Fields

But how do we do that? It’s actually very easy. First, we make the Document object, with the ‘new Document’ operator, simple. But at this point the Document is meaningless. We now have to decide what Fields to add to the Document. Here we need to get our thinking caps on and decide what information we want to store in the fields. A Document is made of any number of Fields, and each Field has a name and a value (kind of like a hashmap). That’s all there is to it.

Two fields are created almost universally by developers creating Lucene indexes. The most important field will be the “content” field. This the Field that holds the content the Word document for which we are creating the Lucene Document. Bear in mind, the name of the Field is entirely arbitrary, but most people call one of the Fields “content” and they stick the actual content of the real world searchable object, the Word document in our case, into the value of that Field. Keep in mind, a Field is simply a {name:value} pair.

Another very common Field that developers create is the “title” Field. This field’s value will be the title of the Word document. What other information about the Word document might we want to keep in our index. Other common fields are things like “author”, “creation_date”, “keywords”, etc. The identification of the fields that you will need is entirely driven by your business requirements.

So, for each Word document that we want to make searchable, we will have to create a Lucene Document, with Fields such as those we outlined above. Once we have created the Document with those Fields, we then add it to the Lucene index writer and ask it to write our Index. That’s it! We now have a searchable index. But hang on, not so fast – we have glossed over a couple of Field details. Let’s take a closer look at the Lucene Fields themselves.

 Lucene Fields – Either Stored or Indexed

A Field can be kept in the index in more than one way. The most obvious way, and perhaps the only way that you might at first suspect the existence of, is the searchable way. In our example, not surprisingly, we expect that if the user types in a word that exists in the contents of one of the Word documents, then the search will return that Word document in the search results. To do this, Lucene must index that Field. The nomenclature is a bit confusing a first, but, note, it is entirely possible to “store” a Field in the index without making it searchable. In other words, it’s possible to “store” a Field but not “index” it. We’ll get to why you might want to do that in a minute.

The first distiniction that Lucene makes between the way it can keep a Field in the index is whether it is a stored field or indexed field. If we expect a match on a Field’s value to cause the Document to be hit by the search, then we must index the Field. If we only store the Field, it’s value can’t be reached by the search queries. Why then store a Field? Simple, when we hit the Document, via one of the indexed fields, Lucene will return us the entire Document object. All stored Fields will be available on that Document object; indexed Fields will not be on that object. An indexed Field is information used to find an Document, a stored Field is information returned with the Document. This is an important fact to remember.

This means that while we might not make searches based upon the contents of a given Field, we might still be able to make use of that Field’s value when the Document is returned by the search. The most obvious use case I can think of is a “url” Field for a web based Document. It makes no sense to search for the value of a URL, but you will definitely want to know the URL for the documents that your search returns. How else would your results page be able to steer the user to the hit page? This is a very important point: a stored Field’s value will be available on the Document returned by a search, but only an indexed Field’s value can actually be used as the target of a search.

Technically, stored Fields are kept within the Lucene index. But we must keep track of the fact that an indexed Field is different than a stored Field. This is slightly unfortunate naming convention, but we now understand the principle, which is what matters.

Lucene Indexed Fields Analyzed

Alas, our example is yet to become slightly more complex (nothing we can’t handle though!).  The new concept? An indexed Field can be indexed in two different fashions: analyzed or non-analyzed. First, we can index the value of the Field in a single chunk. In other words, we might have a “phone number” Field. When we search for phone numbers, we need to match the entire value or nothing. This makes perfect sense. So, for a Field like phone number, we index the entire value ATOMICALLY into the Lucene index – this is a non-analyzed Lucene field.

But let’s consider the “content” Field of the Word document. Do we want the user to have to match that entire Field? Definitely not. We want the contents of the Word document to be broken down into searchable tokens. This process is know as analyzation. We can start by throwing out all of the unimportant (stop) words like, “a”, “the”, “and”, etc (there are a few of theses processes, discussed in the intro paragrah, including stemming, etc). There are many other optimizations we can make, but the bottom line is that the content of a Field like “contents” should be analyzed by Lucene. This produces a targeted lightweight index. This is how search becomes efficient and powerful.

In the APIs, this comes down to the fact that when we create a Field, we must specify

  1. Whether to STORE it or not
  2. Whether to INDEX it or not
    • If indexing, whether to ANALYZE it or not

Now we should be clear on the details of Fields. Importantly, we can both store and index a given Field. It’s not an either or choice.

Indexing Summary

When we have added all the Documents to the index, we simply tell the index writer to create the index. From this point on we can search according to the indexed Fields for any of our Documents. Look for an upcoming Solr entry to give a high level overview of the searching for things in a Lucene index.


Comments

  • soundcloud to mp3
    #

    Hey! I stumbled this blog on Google. I am posting here to find out which theme you are using on your site,
    I would like to have the theme that you are using so I can put it
    on my site (soundcloud downloader).

    Thanks!

    • Gvanto
      #

      Hi,

      The theme used on here is a custom WordPress theme, developed from the vanilla-flavour Sandbox theme.

      Hope this helps,
      gvanto

Reply to soundcloud to mp3

Cancel