In this Solr glossary we’ll cover some key Solr concepts and terms. Also see the Solr blog posts for more interesting articles.
A Solr collection is a set of documents, kind of like a SQL table, but with documents and fields rather than columns and rows.
A Solr core is the representation, storage, and runtime support for the portion of a Solr collection that lives on a single node or machine. More often than not the terms core and collection are used interchangeably, and some people only use collection when referring to a multi-node, multi-shard SolrCloud cluster.
A Solr schema – called schema.xml – is the metadata that describes a Solr collection in terms of the fields and field types used in the collection.
A Solr config file – called solrconfig.xml – describes the environment settings for accessing a Solr collection.
A document, based on the underlying Lucene search technology, is analogous to a SQL row (but can be anything, including Word documents, web pages, etc) – it is the unit of search and retrieval in Solr. Documents can be added, queries can identify documents and Solr returns documents during a search.
A document is comprised of a set of fields which are simply key-value pairs. Each field has a type, which is also defined in the schema. Solr is very flexible, so not all fields need to have values in every document. Solr also supports dynamic fields, which are field values whose names have not been defined in the schema.
Each field must have a type. There are a number of built-in types, and these can be customized.
The smallest set of data in Solr. A document consists of a set of field key-value pairs.
Generally speaking, fields only have a single value. However, Solr provides the ability to define fields which have any number of values. A Solr response will include all of the values associated with a multivalued field.
Sometimes Solr applications are so dynamic that merely defining fields in a well designed schema, is not possible (or simply too much effort). This is where dynamic fields come in. Here, we can declare patterns of names, prefixes or suffixes, and declare the type and other attribute values associated with any incoming names that fit those patterns. You can also choose whether to support the wild-card ‘passthrough’ pattern which allows and matches any field names beyond simply the declared fields and the dynamic field prefix or suffix patterns.
A field to be queried must be indexed. It is up to you as the developer to select whether each field is to be indexed or not. Fields can be indexed without being stored. See the Lucene Indexing Guide for more info.
A field must be stored for its value to be retrievable when documents are returned from a query. As developers, we need to decide whether each field is stored or indexed (or both). A field can be stored without being indexed.
Sometimes we want to simply ignore a field – it is neither indexed nor stored. But it’s still there.
Solr lets us declare a field that should be populated with the value from one or more other fields. Why? To index the data in a different way or to support the querying of multiple fields using a single, combined field.
The heart of our search engine is the index – this is where the data is stored in specialized structures that are optimized for fast and efficient access.
Indexing is the process of taking input data from any of a possible range of sources, transforming it to fit our needs and adding it to Solr to be placed in the index.
Similar to indexing, however stale and old data is discarded and refreshed. Often this is necessary once we have made changes to the schema that are not compatible with existing data in the index.
Solr always attempts to optimize the rate at which incoming data can be indexed. Part of this process involves buffering data in memory before writing it to the index. While this is a very optimal process, the downside is that data is not available for queries until it has been written to the index. A commit operation is necessary to write all of the buffered data to the index and make it available for queries. A number of options are provided for Solr commits so that for example the commits will occur automatically – such as every 10 minutes. These options are configurable and some experimentation might be required to find a suitable commit period.
As of the latest version of Solr (4.4 at the time of writing), Solr now has a ‘soft commit’ feature which lets queries see freshly indexed data before it has actually been formally commited to the index. The basic idea is that hard (formal) commits are done every so often, but soft commits are done as frequently as necessary for our application to see the changes.
Lucene, the underlying search engine, actually creates segments before storing changes to an index. This is really an under-the-hood and transparent operation, and we don’t need to overly concern ourselves with it. Be aware of its existence though.
Segments are reasonably efficient units for storing data, but too many segments or segments containing a lot of deleted data can become a problem over time. For this reason, Lucene provides the ability to merge any number of segments into a more efficient segment. But, keeping large amounts of data in multiple segments can also be more
efficient than one monster segment. For this reason, segment merging is mostly an automated process and the application merely provides parameters to guide that process. One of the parameters is the merge policy. The newest and most interesting merge policy is the tiered merge policy which maintains a dynamic hierarchy of facets for fields containing a large number of values. This is the default merge policy for Solr, but the application can tune this policy as well.
Lucene is the core technology that implements the data index for Solr. It is a library of code in the form of Java jar files.
Ultimately, all input data is reduced to what is called a term, the fundamental unit of indexing in Lucene. A term could be a keyword from a block of tokenized text, a numeric field value, a raw character string, a date or time, etc.
Simple data types such as numeric data and raw strings can be directly indexed, but more complex data like natural language text must be heavily processed before it is suitable for indexing. This process is called analysis, and the component of the code library which does is called an analyzer. One of Solr’s key strengths is that it provides a large range of basic building blocks that can easily be customized and combined in the Solr schema – without writing a single line of code.
A tokenizer performs the main process of breaking down a long character string into its component words or terms. A filter is a method for fine tuning the analysis of terms. Together, a tokenizer and some filters, executed in sequence, form a Solr analyzer. There can only be one tokenizer, but any number of filters. There are specialized filters that execute before the tokenizer, called ‘char filters‘ that transform the raw input string. The rest of the filters that operate on the tokenized text are called token filters.
One of the top criteria for measuring search engine performance is by their ability to return the most highly relevant results for a query, formally known as their relevancy. There are a number of factors that go into computing relevancy of documents. The evaluation of these factors is called scoring. A score is calculated for each document that matches the specified query terms.
The starting point for relevancy scoring is the frequency of a given term in a document, called the ‘term frequency‘, or “tf”.
The number of documents in which a given term appears is known as the term’s document frequency, ‘df’ or ‘docFreq’ for that term.
Terms which occur in fewer documents are typically scored higher than more common terms, so scoring places a higher value on inverse document frequency (idf).
The main idea behind relevancy scoring is to evaluate a given document against a given set of query terms and to score more highly those documents which have a higher net term frequency, across all of the query terms, but discounted by the document frequency for those terms. Essentially, multiplying the term frequency (tf) with the inverse document frequency (idf), tf*idf forms the basis for the relevancy scoring calculation.
A Solr application is the combination of client code, back-end code, and the data and metadata managed by Solr on behalf of that application.
A server such as Solr is an Internet facility to can perform services on behalf of applications which typically are running on other machines somewhere on the the network. Applications typically communicate with servers using the HTTP protocol and data formats such as XML and JSON.
An instance is a single copy of Solr running on a single machine. These instances might be completely independent and servicing independent applications, or they may be a distributed or cloud cluster that are servicing the same application or multiple applications. A single instance of Solr can support any number of cores or collection.
A node is simply another term for a Solr instance.
Multicore simply refers to a single Solr instance that is running multiple
cores or collections.
To make a Solr application more robust and to handle higher query loads, any number of copies or replicas of the same data and metadata for a Solr collection or core can be made, each running independently and in parallel. Each copy is called a replica and the process of making a copy is called replication.
An application may wish to split or partition an index so that each partition, called a shard is a separate core managed by a separate Solr instance. There are two specific use cases for sharding. First, Lucene, and hence Solr, has a hard upper limit of 2 billion or so (2^31-1) documents in a single index. So, if an application has more than that 2 billion limit of documents, they need to be partitioned. The second use case is performance, to reduce the number of documents on a single server so that each query can be executed significantly faster and in parallel as well. Solr has automatic support for distributing a query to the shards and combining the individual shard results into a single set of query results. Note that shards are distinct from replicas.
A Solr request is an HTTP request to a Solr instance. It may have parameters and data and metadata. There are many types of Solr request, the two most common being query requests and update requests.
A query request is a Solr request to perform a query. This is typically
sent to the “/select” request handler.
An update request is a Solr request to index or delete documents. This
request is typically sent to the “/update” request handler.
A Solr response is the data and metadata that Solr returns to the application
upon completion of processing a Solr request.
A Solr request may take any number of optional parameters which are placed in the query portion of the URL for the HTTP request to Solr. Parameters may be placed in any order, but the first is preceded by a question mark and the others are preceded by an ampersand (normal HTTP URL parameter formatting).
Upon completion of a query request, Solr will return a specified number of the top documents that matched the query parameters. This set of documents is referred to as the query results. The query request may select whether to return all or only selected field values for the matched documents.
A Solr request handler is the code within the Solr server that processes a specific category of Solr requests. Each request handler has a name which corresponds to a portion of the HTTP URL for the Solr request, as well as a Java class name to identify the Java code that implements that request handler. A number of request handlers are built into Solr, but developers may develop and add in their own using Solr’s plug-in architecture.
A Solr plug-in is any Java code that can be dynamically configured and loaded by Solr to implement features that are not hardwired into Solr’s inner code Java code. Plugins can include request handlers, query parsers, update processors, codecs, etc.
A Solr update processor is a plug-in that can examine and transform input data at the last stage before Solr adds it to the Solr index. An update processor can filter, cleanse, and remove data, as well as supply missing values, enrich the data from some external source, or even save slected data for later processing.
The execution of a query in Solr involves a number of steps, each of which is performed by a Solr plug-in called a search component. The query component itself is the most basic and common. A highlight component or spellcheck component are also common. The debug component is of course very popular with developers. Solr provides a rich, built-in list of default components. The developer can specify additional components to execute, revise the order of their execution, and even develop custom search components. Individual components can be enabled or disabled either through default parameter settings in the solr config file, or additional parameters in each query request.
Although various parameters may have hardwired default settings within Solr itself, the developer can declare application-specific defaults in the Solr config file, or in some cases in the schema file. The most common case are default request parameters for queries, which are specified by the developer in the request handler settings in the Solr config file.
A cluster is simply a set of machines or servers that are in fairly close proximity so that communications between them is very fast. Cloud can sometimes be synonym for cluster, but more properly is a set of communicating servers that may or may not be local. SolrCloud in fact is intended primarily as support for a cluster of Solr servers. DataStax Enterprise is an example of both cluster and cloud support, the latter because it supports managing of servers that are in distinct data centers.
Searching and sorting by distance or containment within a bounding box or other geometric shape enables location-based search. The co-ordinate system may be based on geography, but may be any other system as well, such as time.
Geospatial search implies a geographic coordinate system. Spatial search is typically geospatial, but doesn’t need to be.