I’m developing a high-volume web application, where part of it is a MySQL database of discussion posts that will need to grow to 20M+ rows, smoothly.
I was originally planning on using MyISAM for the tables (for the built-in fulltext search capabilities), but the thought of the entire table being locked due to a single write operation makes me shutter. Row-level locks make so much more sense (not to mention InnoDB’s other speed advantages when dealing with huge tables). So, for this reason, I’m pretty determined to use InnoDB.
The problem is… InnoDB doesn’t have built-in fulltext search capabilities.
Should I go with a third-party search system? Like Lucene(c++) / Sphinx? Do any of you database ninjas have any suggestions/guidance? LinkedIn’s zoie (based off Lucene) looks like the best option at the moment… having been built around realtime capabilities (which is pretty critical for my application.) I’m a little hesitant to commit yet without some insight…
(FYI: going to be on EC2 with high-memory rigs, using PHP to serve the frontend)
I can vouch for MyISAM fulltext being a bad option - even leaving aside the various problems with MyISAM tables in general, I’ve seen the fulltext stuff go off the rails and start corrupting itself and crashing MySQL regularly.
A dedicated search engine is definitely going to be the most flexible option here - store the post data in MySQL/innodb, and then export the text to your search engine. You can set up a periodic full index build/publish pretty easily, and add real-time index updates if you feel the need and want to spend the time.
Lucene and Sphinx are good options, as is Xapian, which is nice and lightweight. If you go the Lucene route don’t assume that Clucene will better, even if you’d prefer not to wrestle with Java, although I’m not really qualified to discuss the pros and cons of either.
Along with the general phasing out of MyISAM, InnoDB full-text search (FTS) is finally available in MySQL 5.6.4 release.
These indexes are physically represented as entire InnoDB tables, which are acted upon by SQL keywords such as the FULLTEXT clause of the CREATE INDEX statement, the MATCH() … AGAINST syntax in a SELECT statement, and the OPTIMIZE TABLE statement.
While other engines have lots of different features, this one is InnoDB, so it’s native (which means there’s an upgrade path), and that makes it a worthwhile option.
You should spend an hour and go through installation and test-drive of Sphinx and Lucene. See if either meets your needs, with respect to data updates.
One of the things that disappointed me about Sphinx is that it doesn’t support incremental inserts very well. That is, it’s very expensive to reindex after an insert, so expensive that their recommended solution is to split your data into older, unchanging rows and newer, volatile rows. So every search your app does would have to search twice: once on the larger index for old rows and also on the smaller index for recent rows. If that doesn’t integrate with your usage patterns, this Sphinx is not a good solution (at least not in its current implementation).
I’d like to point out another possible solution you could consider: Google Custom Search. If you can apply some SEO to your web application, then outsource the indexing and search function to Google, and embed a Google search textfield into your site. It could be the most economical and scalable way to make your site searchable.