Implementing Search: An Overview to Building a Search Engine
In this white paper
Introduction
Allowing consumers to find the content they see as relevant is a critical feature for various applications, from eCommerce to media-focused companies, financial applications and booking services, to name a few. Yet many developers find the search to be daunting to implement. The typical developer infrequently uses search tools, which involve understanding mathematics and statistics. A complex search system incorporates the latest trends in machine learning.
This paper aims to break down barriers, making search an easy and practical feature to implement. What are the typical characteristics of Search? What tools exist out there? How would you easily create a search engine that is maintainable long-term? Readers will gain a general understanding of the problem space. Topics covered include:
- Search engine optimization
- Search engine tools
- Implementing search
- Features of a search engine
- Best practices
After reading, you should have a basic understanding of terminology, insights into how to design a search engine and suggestions for filtering content into the search database. Also provided is an overview of methods for extracting data and building that into standard features. The intent is to give the reader an overall view of implementation as a starting ground that may launch further research.
Overview of the search space
Google set the gold standard on what users expect from a search engine. Users can search with logical operators and/or synonyms, match a word or phrase, and use wildcards and group terms. The Google team perfected what search looks like, and it is now the most visited website with 89.3 billion users going to the site. Google owns 91.9% of the search engine marketplace. 46% of product searches begin on the search engine. Surveyed respondents indicated that 84% used Google more than three times daily (Mohsin, 2022). An entire industry has been built on search engine optimization (commonly known as SEO). Consulting agencies are being created to assist companies in the race to get a prime spot as the first-page result for a particular term.
For any content-based front end, search is one of the essential features to get right. Research on customer satisfaction on websites expressed that "more than 40% say a search box is the most important feature on a website" (Southern, 2019). Search is one of many features that enable content discovery. On a desktop-based website, content discovery is more effortless. Companies can show their users more at once with a larger screen space. When screen space is smaller on mobile devices, search becomes more important. With a standard search bar being one of the most recognizable UI designs, users have pre-conceived expectations. Worldwide, mobile use of internet traffic hovered around 55% in 2021 (Clement, 2022). Beyond mobile, search enables functionality on other devices like voice-command speakers, smartwatches, smartglasses and even kitchen appliances!
Marketers and UX designers alike spend time focusing on the consumer journey. The consumer journey focuses on the inception of a task and the steps an individual follows to completion. Search plays a significant role in speeding up the time it takes to complete a task and do something productive. Search functionality was initially conceived to find a list of documents containing word matches. Expectations have since evolved. Keywords play a significant role, but the search must provide the right content, to the right user, at the right time (Jones, 2021).
A variety of features backs a good search. Front-end concerns center around search relevance, discussed later. Tags and synonyms commonly power the back-end content ingestion systems. Modern analytical tools will help to bring insights into user searches. Insights will guide the product team to refine the algorithm. User behavior can also be fed into the system, providing even more metadata to search against.
Numerous tools are available to develop a custom search engine:
- Standard search engines are built on top of Apache Lucene, a Java library for search and indexing.
- Lucene-based engines include Elasticsearch, Solr and Lucidworks Fusion.
- Companies might service internal data for analytics with tools like Splunk.
- Other vendors such as SAP, Oracle and Sharepoint provide out-of-box search capabilities.
Search engine optimization
In the early days of the web, a search might provide documents in order of how frequently a search term appeared on the page. With the internet expanding to a global reach, anyone anywhere could publish a website loaded with terms. Difficulties on the scale and complexity aside, it is hard to determine what websites provide the most valuable content which should be given to a user first- in steps Google. Google developed its PageRank algorithm. They realized the fundamental problem of the internet: how can anyone trust a given website? The algorithm calculates the priority of a website by how many other sites link to it. Thus, results for websites that match the search terms are roughly based on how many other websites reference them (Turnbill & Berryman, 2016, p. 4). The algorithm is self-reinforcing. Pages with a high page rank lend credibility to other linked pages.
While much has changed since the original publication of the white paper by Google co-founders Brin and Page, good insights are still shown to be the foundation on which modern search engines operate. In addition to the PageRank, the primary attributes for a first-pass search are the anchor text (words attached to a hyperlink) and the page's title. A webpage that is only an image is still searchable by the text of the links that point to it. They also used font information to identify keywords (Brin & Page, 1998). It is essential to use proper HTML syntax structure when defining web pages. Images, especially links, should have an alt attribute with a few words. Crawlers will naturally treat titles, headers, and bolded or italicized words with higher precedents. Following best practices for HTML structure will aid the web crawlers and reduce errors.
Content on the website will also become a factor. The full text of a web page is stored in the search engine. Use concise language for a more significant impact. The number of hits on a particular keyword will influence results. Word proximity is another important consideration. A clothing retailer that sells products primarily to men will want the term "men's clothing" instead of just clothing as that is more likely to be searched.
Google provides a few recommendations:
- Submit a sitemap detailing all the pages on the website. Use the meta-HTML tag to include sentences that describe the page's content. At the root of the webpage, a particular file called robots.txt contains a list of pages that Google should not index.
- Search engines require unique URLs per piece of content. Make URLs simple and human readable as they will display in the search results.
- Consider a high-level topic navigation page when several pages have similar content and enable breadcrumb markup to help identify hierarchy (Search Engine Optimization (SEO) Starter Guide, n.d.). Remove duplicated content.
- Optimize the website's hierarchy for web crawlers. A web crawler likes to understand content grouping. Similar pages are clustered, and a canonical (main) page is chosen.
Google now prioritizes mobile websites first. If you use subdomains, the primary page indexed will be the mobile version. The major search engines provide tools to take a URL and analyze a website. These tools provide reports which will identify potential problems of a site. Correcting these can optimize for a particular vendor.
Picking a search engine
Many different implementations of search engines are available in the market. Specific instances are better suited for particular use cases. Elasticsearch is one of the leaders in the space. The popular ELK stack (Elasticsearch, Logstash and Kibana) provides robust data ingestion, storage and graphical display framework. Elasticsearch is a distributed search engine by default, enabling performance. Solr is another widespread implementation, focusing on text search. An additional standard tool is Splunk, built for capturing data for graphs, reports, alerts and dashboards. Elasticsearch spans use cases for these top two competitors. It provides text-based searches but extends that with aggregations and time-series queries that can display dashboards in Kibana. Elasticsearch is also a managed instance on the big three public cloud providers. Elasticsearch and Solr are based on Apache Lucene, an open-source search library. Others, like Splunk, are proprietary. This paper will focus on a Lucene-based approach since that provides complete data-enabled control.
Popular search engine trends
Image 1 shows search engine trends as of April 2022.
Comparison of popular search engines
Table 1 highlights some key differences between Elasticsearch and the competitor Solr.
Solr | Elasticsearch | |
Installation and Configuration | Easy to get running with supportive documentation | Easy to get running with supportive documentation. Packages are available for various platforms. |
Searching and Indexing | Optimal for text search and big- data-enabled enterprise systems | Useful as both a text search and analytical engine because of aggregation capabilities |
Scalability and Clustering | Cluster coordination with Apache Zookeeper and Solr Cloud | Better inherent scalability; design optimal for cloud deployments |
Community | A historically large ecosystem | A thriving ecosystem for the free and open-source ELK stack |
Documentation | Patchy; out of date | Well-documented |
Note: Adapted from Solr vs. Elasticsearch: Who's the Leading Open Source Search Engine by Asaf Yigal. Copyright 2020 by logz.io.
Table 2 compares several points between Elasticsearch and Splunk.
Splunk | Elasticsearch | |
Loading Data | Forwarders are pre-configured for a variety of sources | Data sent through Logstash, fields need defining |
Visualizations | Add and edit dashboards which can be unique per user; dashboard configuration with XML; mobile friendly | Line charts, area charts, and tables through Kibana |
Search Capabilities | Proprietary Search Processing Language (SPL); dynamic across fields | Lucene query syntax; fields need defining |
Community | Customer and support platforms | Clear and extensive documentation |
Learning Curve | Moderate, expensive educational courses | Flat, free material online |
User Management | Role-based access, user management, and auditing | Role-based access, user management, and auditing |
Pricing | High price tag | Open Source with paid add-ons |
Note: Adapted from Splunk and the ELK Stack: A Side-by-Side Comparison by Asaf Yigal. Copyright 2017 by devops.com.
Implementing search
Search is about solving the relevancy problem. What is relevant is not only defined by the user's request for information to satisfy a need in their experience but also by the business need. Search is the sales device of a website. It enables users to explore content that will benefit the needs of both the individual and the company. Search is a unique feature. There is no one-size-fits-all solution. A retailer might want to make a profit, promote featured inventory, and sell overstocked materials faster, all while delivering results close to the query. A lawyer looking for transcripts of a phone call in the middle of a courtroom has a different need than one exploring common case law in a state to build their argument.
Data pipeline and storage
Search starts its journey with data. An engine can only be built to find content with data. Therefore, significant time will be spent gathering and preparing information. A typical software pattern is Extract-Transform-Load (ETL). For search pipelines, this is a bit different. There are four steps to preparing data: extraction, enrichment, analysis and indexing (Turnbill & Berryman, 2016, p. 25).
Extraction is the step to read the data from the source. Enrichment cleans, augments and merges the data to enhance it. Augmentation might mean performing sentiment analysis, running content through an automated tagging system or using machine learning to classify documents. Analysis is the process where documentation is broken down into tokens. Tokens are unique words that a search runs against. Search tools typically come with pre-built analysis options. An English analyzer would lowercase all letters, remove plurals, remove possessives, modify words to their root (running becomes run) and remove common words like "and" and "the." Those tokens are indexed, which updates the underlying data store. The heavy lifting will come to defining the enrichment layer, choosing the analysis steps and configuring the index definition. A search tool can only be as good as the back-end systems that support it.
Data is stored and processed in a term-centric fashion. This means that a document shaped like { firstName: "John", lastName: "Smith" } will naturally lose context connecting "John" and "Smith." Terms are more important than the set of fields within a document (Turnbill & Berryman, 2016, p. 139). Mappings define an index, like a table specification in SQL. Mappings contain field types and analyzer definitions; an index can only specify mappings before storing any documents. They exist across several shards. Dividing an index into shards increases performance as smaller chunks of data can be processed in parallel. Different index shards may live on a single node of a cluster. Optimize performance by balancing the nodes and shard in comparison to data size. Stored data is more prominent in size compared to the source. Lucene indexes store the original document, a token lookup table (unique tokens to a document), and a term dictionary (tokens assigned to an ordinal number).
Scoring results to a query
The next step is to define the query structure to get data out. Keep in mind that the analyzer used for documents is also applied to the incoming query terms. Lucene generates scores of documents. A score is an indication of relevancy. The score is computed with the following function:
Lucene Practical Scoring Function
SCORE(Q,D) = COORD(Q,D) * QUERYNORM(Q) * ∑ [ TF(T IN D) * IDF(T)2 * T.GETBOOST() * NORM(T,D) ] (T IN Q)
Note: Copyright 2000-2011 by Apache Software Foundation.
The score of a document d for a query q has the following factors:
- tf(t in d): Term frequency correlates to the number of times a term t appears in document d. Documents with higher term frequency receive a higher score.
- idf(t)2: Inverse Document Frequency correlates to the inverse document frequency (the number of documents the term t appears in). The rarer a term, the greater weight it contributes to the score.
- t.getBoost(): The search-time boost of a term t in the query q.
- norm(t,d): lengthNorm is computed when a document is stored in an index and represents the
- number of tokens in a field. Shorter fields carry heavier weight. It also incorporates configured document boost and field boost values when adding documents to an index, both of which can be specified in the mappings.
- Coord(q,d): Coordination factor is a score factor that bases how many query terms are found in the document. Documents that contain more of the query's terms will have a higher score.
- queryNorm(q): Normalization factor makes scores between queries comparable. It tries to compare scores between different matches outside of a single search.
The Lucene Practical Scoring Function is calculated against the documents to produce a score for each. It can be altered. For instance, disabling the queryNorm frequently comes up in Lucene discussions (Turnbill & Berryman, 2016, p. 70). An executed query leverages this score calculation. The query is directed to look at specific fields within the index and uses a combination of MUST, SHOULD, and MUST_NOT clauses. Typical queries will apply filters to remove certain types of content. Boost values may be applied to certain clauses in the query to promote importance. Boost values can be additive or multiplicative to the score of a document, can be positive or negative, and has floating point precision. These toggles should be tuned according to business needs to provide relevance.
Relevance feedback
A good search knows how to interpret a query and return relevant documents. A great search provides a dialogue between search results to the user, known as relevance feedback. Relevance feedback comprises several features that help users understand if they are getting what they want from their query. Search-as-you-type is often implemented as a dropdown of search hits based on what is entered. Search completion attempts to complete a phrase the user is typing. Completion can be accomplished by querying against existing fields or storing and analyzing past queries. Search suggestions can be used to correct spelling mistakes. Suggestions can either be asked of the user ("did you mean...") or automatically replace the user's query. Since the latter is somewhat risky, it is recommended to test the feature heavily or only to replace it when the user query returns no results.
The previous examples provide relevant feedback to the user before searching, but what happens when they need to browse first? Faceted browsing allows additional filters to be applied after an initial query. Faceted browsing can be accomplished with aggregations. Consider terms like size, color, price, or category. Providing breadcrumb navigation can also aid the user by visualizing what filters are applied. Allowing the user to select alternative results order will override the search based on input like price, newest or highest ratings.
Users skim results to validate that their query provided the correct answers. The purpose of the search feature should drive what is displayed. Typical attributes are a title, an image and a short description. The results might also display text snippets where the search terms are highlighted in the document text.
Making the information easier to process will help reduce the time the user has to assess whether the query helped. Consider grouping like content visually. Field collapsing can combine similar documents while displaying the top result (Turnbill & Berryman, 2016, pp. 206-230).
Of course, only some of these features apply universally to search. A mobile user's experience will be much different. Consider the increasing use of text-to-speech on mobile platforms and how those may produce different styles of queries compared to hands-on keyboard (Gudivada, Rao, & Paris, 2015).
Personalization
Advanced search can be personalized to the user. Personalization implies knowledge of the user and how content might relate to their preferences. There are two data analytic pipelines to consider:
- Concept search suggests an understanding of how content relates to one another.
- Personalized search is user awareness.
Concept search can pre-empt personalized search and is good when there is little data on user interaction. User data can be gleaned from a profile page or their search and interactions with the website. It might be possible to learn that a user prefers affordable options on an eCommerce website. Be careful when configuring the parameters and use a value rating scale. A cheap t-shirt and a cheap television are on separate scales. Concept search can also augment a personalized search, as Google put it – search for things, not strings.
Collaborative filtering applies data analysis to determine what other items might be related to users interested in a specific one. Several models exist for performing this calculation, but it can be as simple as counting interaction data. Two maps should be developed, an item-to-item and a user-to-item. Cross-reference the user-to-item to identify recent items a user examined and look up similar entries with the item-to-item table. During query execution, pass in items the user has a high affinity for to a filter. Be careful; as the list grows, performance can become problematic. Processing can also occur during index time by adding a field "users_who_might_like" user array to documents. Monitor the index size, which will grow proportionate to the number of users. A mixed solution can add a "related_items" array field to the index, passing the relevant item ids into the query filter.
The metadata on a document can be enhanced to increase the searchability of tokens. Synonyms and tags can be added. It is common to start doing this manually, which is time-consuming. If synonyms and tags had a hierarchy, a curator could apply only the most specific element, and the system would automatically add related items. Advanced setups will define and apply a discrete set of tags through automation (Sevier, 2021). An ingestion system could identify unique and valuable tokens in a document, comparing it to other related documents and adding cross-tokens that frequently appear. Collocation analysis can pull out common phrases that appear across many documents and store them in a unique field to make queries more impactful.
Building a search culture
The most challenging question to solve with a search algorithm is determining what is relevant. A greenfield application has the most difficulty answering this question, with no historical data to look back on. Concentrate on four categories to start. Go out and interview stakeholders: developers, business and product owners, and the identified users the application targets. With an increasing understanding of your target consumers, build out their user personas. An excellent search algorithm understands both the needs of the customer and that of the business. Also, try to understand what company problems the algorithm is trying to solve. Finally, identify what information is required to understand if the algorithm is effective and what is currently available.
Once the application has been launched, review key performance indicators. Monitor the time users spend searching. If they leave too quickly, they might be frustrated by not finding what they want. However, leaving quickly by clicking items is more likely to indicate success. Determine the click-through rate and how frequently they view the details of an item. What is the purchase conversion rate if the search is focused on eCommerce? The user retention rate expresses how satisfied they are with the app. Deep paging indicates poor relevancy or how often a user goes past page one. A user pogo-sticking, going from a result back to the search page frequently, shows they are not finding the correct result. Thrashing also indicates a problem when a user fires off several searches repeatedly (Turnbill & Berryman, 2016, pp. 253-254).
Improve the algorithm by pairing the content curator with developers, testing and learning from user behavior. Have the curator work directly with search engineers. The curator best understands the need and can provide immediate feedback on changes. Automated tests help build signals for the search engineer to perform modifications. Assertion lists are explicitly defined searches and ranking combinations. Expecting an exact match on a title first is a good example. Define a structure for judgment-list tests that grade a particular result and use it to provide an assessment. Behavioral data can reinforce the results. Remember thrashing? The repeated queries used when a user thrashes indicate similar searches to a query. Store those queries so the algorithm can link similar queries to a particular result set. Combine the identified behavioral data into the tests (Turnbill & Berryman, 2016, pp. 259-276).
As the data revolution continues, learning to rank is a standard machine-learning algorithm with applications to search. Learning to rank trains a system with a data set of partial order items. Algorithms then use the model to rank new items into the list. Scoring models can be based on vector space or grouped into pointwise (regression), pairwise (binary), or listwise (direct computation) methods (Casalegno, 2022). Learning to rank plugins is available for Elasticsearch and Apache Solr.
Future of search
As the search field continues to evolve, new trends are appearing. In modern applications, data and analytics will continue to expand into all aspects of the software. Integrating machine learning can help customize search experiences through personalization. Gartner has coined the term insight engine. An insight engine "appl[ies] relevancy methods to describe, discover, organize and analyze data" (Insight Engines Reviews and Ratings, n.d.). Insight engines are different from search engines. They combine search with machine learning to provide user information or data for machines (Dilmegani, 2022). They provide more complex queries against richer indexes and datasets. Greater relevancy methods are available with the addition of machine learning. Insight engines proactively search databases rather than executing on-demand. Finally, don't confuse search engines with recommendation systems.
Conclusion
Building an enterprise search feature is no small feat. Complex technical solutions power the search bar at the top of the application. It leverages a search engine that translates queries into an ordered result set of relevant documents. Search engines understand the complexity of human language and interpret speech and text into machine language, known as natural language processing. Developing the correct query takes time to target both user and business needs. This requires collaboration between the content curator expert and the search engineer. The engine is only as powerful as the sourcing of data it uses. Building a data pipeline that extracts data from the source, enriches it, analyzes it into tokens, and indexes it into the data store will take time to create. Use key performance metrics to assess the quality of results. Leverage behavioral data to improve relevancy. Advanced systems supported by learning to rank machine learning models further enhance the search engine. Developing a well-defined search system takes time and energy but provides enormous value to users, saving time and energy while helping them complete essential tasks.
Key takeaways
- Elasticsearch and Apache Solr are known leaders in the search space.
- ETL pipelines that are search-specific use extract-enrich-analyze-index models.
- Search engineers must understand the Apache Lucene practical scoring function.
- Relevance feedback provides a conversation between the user and the search.
- Use behavioral-based key performance indicators to assess the quality of results.
- Pair the content curator with the search engineer and build automated tests to ensure quality.
- Machine learning and learning to rank models can improve search results.
Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. Stanford, CA: Elsevier Science B.V. Casalegno, F. (2022, Feburary 28). Learning to Rank: A Complete Guide to Ranking using Machine Learning. Retrieved from Towards Data Science: https://towardsdatascience.com/learning-to-rank-a- complete-guide-to-ranking-using-machine-learning-4c9688d370d4 Class Similarity. (n.d.). Retrieved from Lucene 3.5.0 Core API: https://lucene.apache.org/core/3_5_0/api/core/org/apache/lucene/search/Similarity.html Clement, J. (2022, January). Percentage of mobile device website traffic worldwide from 1st quarter 2015 to 4th quarter 2021. Retrieved from Statista: https://www.statista.com/statistics/277125/share- of-website-traffic-coming-from-mobile-devices/ DB-Engines Ranking - Trend of Search Engines Popularity. (2022, April). Retrieved from DB-Engines: https://db-engines.com/en/ranking_trend/search+engine Dilmegani, C. (2022, April 4). Insight Engines: How it works, Why it matters & Use Cases[2022]. Retrieved from AI Multiple: https://research.aimultiple.com/insight-engine/ Gudivada, V. N., Rao, D., & Paris, J. (2015). Understanding Search Engine Optimization. IEEE Computer Society. Insight Engines Reviews and Ratings. (n.d.). Retrieved from Gartner: https://www.gartner.com/reviews/market/insight-engines Jones, R. (2021, February 28). Why Search – and SEO – Is Important. Retrieved from Search Engine Journal: https://www.searchenginejournal.com/seo-guide/why-is-search-important/ Mohsin, M. (2022, January 2). 10 Google Search Statistics You Need to Know in 2022 [Infographic]. Retrieved from Oberlo: https://www.oberlo.com/blog/google-search-statistics Search Engine Optimization (SEO) Starter Guide. (n.d.). Retrieved from Google Search Central: https://developers.google.com/search/docs/beginner/seo-starter- guide?hl=en%2F&visit_id=637864107782188247-1628640588&rd=1 Sevier, R. (2021, April 21). Enterprise Tag Standard. Retrieved from Harvard University Enterprise Architecture: https://enterprisearchitecture.harvard.edu/enterprise-tags Southern, M. G. (2019, January 29). 81% of People Think Less of a Business if its Website is Outdated. Retrieved from Search Engine Journal: https://www.searchenginejournal.com/81-of-people- think-less-of-a-business-if-its-website-is-outdated/290283 Turnbill, D., & Berryman, J. (2016). Relevant Search. Shelter Island: Manning Publications Co. Yigal, A. (2017, June 27). Splunk and the ELK Stack: A Side-by-Side Comparison. Retrieved from devops.com: https://devops.com/splunk-elk-stack-side-side-comparison/ Yigal, A. (2020, July 19). Solr vs. Elasticsearch: Who's The Leading Open Source Search Engine? Retrieved from logz.io: https://logz.io/blog/solr-vs-elasticsearch/