Over the last few months, I have spent a lot of time thinking about, talking about, and building a search engine. The search engine was first designed for my blog and then I expanded it into IndieWeb Search, a search engine for the IndieWeb community. Working on this project has been interesting but also challenging. Search involves solving many different problems, even if your intention is only to build a search engine for your own website.
I wanted to write a short post outlining some of the things I have learned so far about building a search engine. I have decided to write this post in the form of a list rather than a more in-depth essay as I want to provide a high-level guide to some things you should keep in mind when building a search engine. If any of the points below interest you, let me know and I might write about them more in the future.
Without further ado, here are a few things you should keep in mind if you decide to build a search engine, particularly one that indexes multiple sites that might not have built yourself:
Not all websites are marked up in the way that you would like. But your job is to build a crawler that understands this and can still do a good job.
You will never be able to predict all the ways in which a site might present itself before you actually start crawling sites.
Crawl different sites while testing your crawler. Testing multiple sites will help you find lots of bugs. You'll also get a feel for how to retrieve the information you need to crawl. For instance, I used to just index meta description tags for my personal search engine. When I started building IndieWeb Search, I realised there was so much more to getting meta descriptions (i.e. some people don't have them, some people use og:description).
Concurrency will help you index content much faster. I'd recommend starting with a standard program but then implementing concurrency if you want to scale up your search engine.
Don't use multiple threads on the same site without some kind of queries per second rate limit. You don't want to accidentally break someone's site by sending hundreds or thousands of requests within the space of a minute or two.
Don't reinvent the wheel for everything. I rely on libraries like BeautifulSoup and mf2py for HTML parsing and microformats parsing, respectively. These libraries are fast and help me do tasks that would otherwise be very hard to do.
Start small. Search engines can get really complicated, quickly.
Links are useful in ranking if your search engine indexes multiple sites. I noticed a major bump in the quality of search results on IndieWeb Search when I started taking into account the number of incoming links pointing to a web page. (NB: Adding links helped me address "name" queries like "jamesg.blog" or "Aaron Parecki" so that they return links that are more relevant to the site. Before using links as a ranking factor, relevance was all over the place for these sorts of queries, especially if a site used their name in many title tags and/or headings.)
Relevance is a hard problem to solve. Links help build relevance but ultimately you'll need to experiment with ranking weights and systems. I recommend Elasticsearch if you don't want to learn or worry too much about low-level search relevance algorithms. IndieWeb Search uses Elasticsearch. It's fast, robust, and is easily extensible to meet our needs.
SQLite is not great for building a search engine that indexes multiple sites. Concurrency is really hard. Performance can be an issue if you index a lot of content. And your database might get very large and will thus may be slow on some cloud application platforms.
Building a search engine that indexes content for your own site is a different problem to indexing multiple sites. Building a search engine just for my blog helped me dip my toe into search. But there were growing pains while porting my search engine over to one that could index multiple sites (i.e. implementing concurrency, using Elasticsearch instead of SQLite, adjusting my crawler to address edge cases).
Python is a great language to use if you want to build a web crawler. (NB: This is subjective and biased because I generally use Python for scripting. But, there are so many libraries in Python that can help you with search. BeautifulSoup is great for parsing HTML. concurrent.futures is great for concurrency.)
Above are a few notes on designing a search engine. I am by no means an expert on search but I wanted to share some things that have been useful to me in building a search engine for my blog and, eventually, IndieWeb Search. I hope that some of the notes above prove useful to anyone who wants to build a search engine. Search is difficult—really difficult—but there are plenty of resources out there that can help you. I have open-sourced the code for IndieWeb Search so anyone interested can take a peek behind the scenes to see how everything is working.