We recently released an update to Sifter that included search, and despite the deceptively simple name of "search", it's actually a fairly complex feature. I wanted to run through everything involved in getting search up and running. This was written to help people who are completely new to search as well as providing a behind the scenes look to some who already have a deep understanding. As a result, parts of this will be very basic, some parts will be fairly technical, and other parts will even talk about some of the interface decisions.
I'm not planning on going into the decision-making process of choosing Sphinx and Thinking Sphinx for our search implementation, but I will say that it was carefully considered and researched. There are advantages and disadvantages to every option, and everyone has different needs. This combination just happens to be the one that works best for what we wanted to do.
Before I get too carried away talking about the technical bits, it's worth looking at the big picture of how everything fits together. We chose to use Sphinx and Thinking Sphinx for Sifter. I'm going to gloss over some of the finer points in favor of keeping this simple.
Sphinx, by Andrew Aksyonoff, is a full-text search engine. It includes both indexer and searchd which, unsurprisingly, handle the index creation and search requests respectively.
Thinking Sphinx is a Ruby library by Pat Allan that plays the go-between for ActiveRecord and Sphinx including rake tasks for configuring, indexing, and starting/stopping the search daemon.
There are several elements that come together so that everything works. I'll be focusing exclusively on our specific so setup. We're relying on delta indexing, which is optional, but since it's more complex than not using delta indexing, this is a safe superset of what can be done.
Adding search to a web application involves several moving parts that involve some external coordination.
/config/sphinx.yml). Running the rake thinking_sphinx:configure rake task will generate the actual Sphinx configuration file for you based on the settings in your models and your Sphinx yaml file.With all of the moving pieces, Sphinx isn't trivial to setup, but it's really not too bad either. I've tried to pull together the resources that I found most helpful at each stage of the game so that it's easier to get through the process. All of these steps are well-documented elsewhere, so instead of reinventing the wheel, I just wanted to pull together all of the relevant resources based on the process I went through setting things up for Sifter.
Installation was straightforward, and both Sphinx and Thinking Sphinx have fantastic installation instructions.
For configuration, there are extensive details on both Sphinx Configuration and Thinking Sphinx Configuration which explains how to configure Sphinx through the sphinx.yml file.
The Thinking Sphinx usage guide covers most of the basics on how to setup your models. The most significant aspect of this configuration for us was enabling delta indexing which takes advantage of Rails callbacks to automatically update the delta index after a relevant record is saved or updated. While we went with standard delta indexing, several additional options for delta indexes are available.
From time-to-time, the search daemon will die, so keeping an eye on it is important. Adding a section to our Monit configuration file for the search daemon was fairly straightforward. I've include the relevant section of our Monit config file below. Make sure to replace anything in all caps with the values for your system.
check process searchd with pidfile /YOUR_PATH_TO/searchd.pid
start program = "/YOUR_PATH_TO/searchd --config /YOUR_CURRENT_PATH_TO_RAILS_APP/config/ENVIRONMENT.sphinx.conf" as uid USER and gid GROUP
stop program = "/YOUR_PATH_TO/searchd --stop --config /YOUR_CURRENT_PATH_TO_RAILS_APP/config/ENVIRONMENT.sphinx.conf" as uid USER and gid GROUP
if failed host localhost port 3312 then restart
if 3 restarts within 5 cycles then timeout
Even if you're using delta indexing, you'll still want to regularly reindex because the larger the delta gets, the slower your saves and updates will become. Cron is the natural choice here. You'll just need to setup a cron task to run the rake thinking_sphinx:index task on a regular basis and you'll be all set. The indexer is incredibly fast, on a small site, you could probably get away with running it every 5 minutes or so without any problems. However, with delta indexing turned on, frequency is less important, so we're running it once per hour.
While it wouldn't be the end of the world to reindex with every deploy, I wanted to avoid having to rebuild it every time. Instead, we're following in Wade's footsteps and keeping the index in our shared path and then creating the symlink to it on each deploy.
Wade mentions in hist post that Thinking Sphinx generates the actual Sphinx config file from your models and sphinx.yml using rake thinking_sphinx:configure. The resulting file will be used when the Sphinx search deamon is started. So, with each deploy you'll want to manually rebuild the Sphinx configuration file in order to ensure that any relevant changes in your models or sphinx.yml get updated within the file that Sphinx uses. So, just after the symlink and before we restart the Sphinx search daemon, we make sure to run rake thinking_sphinx:configure to make sure it will start itself up with the new settings.
I would have never made it this far without the following resources, and of course many thanks go to Andrew Aksyonoff and Pat Allan for providing such wonderful tools to the community.
We built a simple bug and issue tracker named Sifter and we blog about it when we're not working on it. We think it’s a great way to get feedback and keep everyone updated on our status.
Grab our feedWe'll only send emails for significant product announcements, and those happen every couple of months at most. Of course, we won't give away or sell your e-mail address either.
Comments
I'm a huge fan of your blog here so I excited to see this article in my RSS reader. I too set up Sphinx and ThinkingSphinx but was sidelined. I'm not an expert on this but with Sphinx you need the search daemon to run as another process on your web server right? In my case I'm hosted through mediatemple and this doesn't appear to be supported.
Perhaps I missed this in a previous post but how did you get around this? What host are you going through that supports this? It's fairly important to the app I'm working on and at the moment I've added logic to disable it in production until we can support it.
Keep up the great articles!
We have a dedicated server for Sifter, so we have complete control of what we run. Assuming you're running on a grid container, I'm not surprised that it wouldn't work on Media Temple. Their setup is pretty simple and boxed in.
If you want to develop an app, I strongly suggest getting dedicated hosting. Another great option if you're not ready for dedicated hosting is Slicehost. The upside is that you have full control of your slice, but you're responsible for handling all of the configuration.