Specbee: Get the Most Out of Apache Solr: A Technical Exploration of Search Indexing

Get the Most Out of Apache Solr: A Technical Exploration of Search Indexing Saranya Ashok Kumar 21 Feb, 2023 Subscribe to our Newsletter Now Subscribe Leave this field blank

A search feature enhances the user experience of a website by allowing the user to find what they’re looking for easily and quickly. More so for large websites, e-commerce sites, and sites with dynamic content (news sites, blogs).

Apache Solr is one of the most popular search platforms used by websites of all sizes. It is an open-source search engine based on Java that lets you search through large amounts of data, like articles, products, customer reviews and more. Take a deeper look into Apache Solr in this article.

Check out this article to learn how to configure Apache Solr in Drupal

Why is Apache Solr so popular?

Apache Solr is fast and flexible and allows for full-text search, hit highlighting (highlights the matching search term), faceted search (a more refined search), real-time indexing (allows new content to be indexed immediately), dynamic clustering (organizes search results into groups), database integration, NoSQL features (non-relational database) and rich document handling (to index a wide variety of document formats like PDF, MS Office, Open office).

Some good-to-know facts about Apache Solr:

  • It was initially developed by CNET networks, inc. as a search engine for their websites and articles. Later, it was open-sourced and became a top-level Apache project.
  • Supports multiple programming languages like PHP, Java, Python, and Ruby. It also provides APIs for these languages.
  • Has built-in support for geospatial search, allowing to search content based on its location. Especially useful for sites like real estate websites, travel websites, etc.
  • Supports advanced search features like spell checking, autocomplete, and custom search via APIs and plugins.
  • Uses Lucene for indexing and searching.

What is Lucene

Apache Lucene is an open-source Java search library that lets you easily add search or information retrieval to the application. It is versatile, powerful, accurate, and works on an efficient search algorithm.

Although known for its full-text search capabilities, Lucene can also be used for document classification, data analysis and information retrieval. It also supports many languages other than English like German, French, Spanish, Chinese, Japanese, and more.

What is Indexing?

All search engines begin with indexing. Indexing is the processing of original data into highly efficient cross-reference lookup to facilitate rapid search.

Search engines don't index data directly. The texts are first broken into tokens (atomic elements). Searching is the process of consulting the search index and retrieving the doc matching the query.

Advantages of indexing

  • Fast and accurate information retrieval (collects, parses and stores)
  • Without indexing, the search engine requires more time to scan every document

Indexing flow

Image removed.

First, the document will be analyzed and split into tokens. All those tokens will be indexed to the inverted index. Inverted index is a way in which Solr builds the index.

How inverted indexing works

Lets consider we have 3 documents:

  1. I love chocolate (D 1)
  2. I ordered chocolate cake (D 2)
  3. I prepared big vanilla cake (D 3)

The way it is tokenized is as shown in the 2nd column of the below table.

Image removed.

“Chocolate” is available in D1 and D2
“Cake” is available in D2 and D3
“Big” is available in D3
“Ordered” is available in D2
“Prepared” is available in D3
“Vanilla” is available in D3

You will notice that words like “I”, “love” are not tokenized. These are called Stop words which will not be indexed or searchable by Solr.

So when someone searches for the term “Chocolate Cake”, the engine looks into the index. Instead of looking for the document, it first looks into the index to see which documents do the words “Chocolate” and “Cake” fall under. This makes it easy and faster to fetch the particular document only. This is called inverted indexing.

Storage Schema

Apache Solr uses a document-based storage schema and stores every piece of data as a separate document within a collection. This allows for efficient and flexible storage and retrieval of data.

In Drupal, each node is considered as a document. So when you index your node to Apache Solr, it is considered as a document. Each document can contain multiple fields. Lucene does not have common global schema. Which means you can index any type of field in each document in Apache Solr.

Image removed.

How to Install Apache Solr

  • First, make sure you have Java installed on your system.
  • Next, let’s install Solr from here: https://solr.apache.org/downloads.html
  • Download and extract Solr.
  • Run this command on the Solr folder.

           ◦ bin/solr -e techproducts

             This will create a dummy core for demonstration and it will also start the Solr server.

  • Once the server has started, go to your browser and type “http://localhost:8983/”.
  • Make sure Solr is installed successfully with dummy core.
Image removed.

Directory Structure

Once you have installed Solr, you will see many folders like:

Docs - contains documentation about Solr
Dist - Solr main .jar file
Contrib - contains add-on plugins and specialised features of Solr
Bin - scripts of Solr
Example - contains demonstrate solr capabilities
Server - heart of Solr. Contains Solr web application, logs, Solr core

Config files

To create a core, we need two files mandatory.

  • Schema.xml
  • Solrconfig.xml

Schema.xml

  • It will contain the types of fields you plan to support and how those types should be analyzed.

Solrconfig.xml

  • Contains various settings that controls the behavior of a Solr core like request handler, request dispatcher, query components, update handlers, etc.

Querying in Solr

Now lets see how to query the Solr results in the Solr admin UI.

Query Parameter

  • Local parameters are arguments in a Solr request that are specific to a query parameter.

For example: cat: electronics

Image removed.

 

Query Parameter with operations

  • We can query multiple fields with operation.

For example: cat: electronics id:TWINX2048-3200PRO with q.op AND
[OR]
cat: electronics AND id:TWINX2048-3200PRO

Image removed.

[OR]

Image removed.

 

Filter Query

A filter query helps narrow down the results of a search. A query can be specified by the fq parameter to restrict which documents are returned in the superset, without affecting the score.

Image removed.

 

Sort Parameter

The sort parameter arranges search results in either ascending (asc) or descending (desc) order. Depending on the content, the parameter can be used either numerically or alphabetically.

Image removed.

 

Rows Parameter

The rows parameter allows you to paginate results from a query.

Image removed.

 

Field List Parameter

The fl parameter limits the information included in a query response to a specified list of fields.

Image removed.

 

Default field Parameter

Default field parameter is the default field for query parameter.

Image removed.

 

Highlights Parameter

The highlight feature in Solr enables the inclusion of fragments of documents that match a query.

Image removed.

 

Some of the most common highlight parameters are:

  • Hl.fl - Highlights a list of fields.
  • Hl.simple.pre - Specifies which "tag" should be used before a highlighted word.
  • Hl.simple.post - Specifies which “tag” should be used after a highlighted term.
  • hl.highlightMultiTerm - If it is set to true, Solr will highlight wildcard queries. If false, they won’t be highlighted at all.
Image removed.

 

Facet:

Facets enable users to explore and refine large sets of search results. They’re displayed in a UI as checkboxes, dropdowns or other controls. The two general parameters to control facets are:

  1. Facet parameter

Using the facet parameter, users can generate facets based on the values of one or more fields in their search index. In the search results, the facet parameter can be configured to control how facets are generated and displayed.

      2. Facet.query paramater

When a user includes a facet.query parameter in their Solr query, Solr will generate a list of facet counts that correspond to the number of documents in the index that match each query. Facet.query is useful when you want to generate facets based on complex search criteria that can't be easily represented using a simple field value. 

There are several other facet parameters like the facet.field (to specify the fields that should be used to generate facets), facet.limit (max number of facets to display for each field), facet.mincount (min number of document needed for the facet to be included in the response), facet.sort (specifies the order in which the facet values should display).

Image removed.

 

Image removed.

Final Thoughts

Apache Solr is a highly versatile search engine that comes with many interesting features which can be customized as per your requirements. Drupal works extremely well with Apache Solr. If you’re looking for Drupal experts to configure a powerful search engine for your new project, we would love to take it further!

Author: Saranya Ashok Kumar

Meet Saranya Ashok Kumar, Drupal Specialist, who’s extremely fond of coding and Drupal and likes sharing valuable content through her YouTube channel. Saranya likes tapping her toes to her favorite music and dreams of traveling to the Maldives.

Email Address Subscribe Leave this field blank Drupal Development Drupal Planet Drupal

Leave us a Comment

 

Recent Blogs

Image Image removed.

Get the Most Out of Apache Solr: A Technical Exploration of Search Indexing

Image Image removed.

From Mother to Manager - Shreevidya’s Career Story

Image Image removed.

How to Integrate Google Tag Manager with Drupal 9 - An Easy Step-by-Step Tutorial

Want to extract the maximum out of Drupal? TALK TO US

Featured Case Studies

Image removed.Image removed.

Great Southern Homes, one of the fastest growing home builders in the US, sees greater results with Drupal

Explore
Image removed.Image removed.

A reimagined digital solution for Abaco, a global leader in embedded computing systems for the defense industry

Explore
Image removed.Image removed.

Upgrading the web presence of IEEE Information Theory Society, the most trusted voice for advanced technology

Explore
View all Case Studies
PubDate

Tags