Text Search Administration

DataCaster provides a Lucene-based implementation of Text Search that can be enabled for any table in the database. Apache Lucene is a widely used Apache search library ( see http://lucene.apache.org). DataCaster also supports adding other search implementations, by implementing an internal search interface and configuring the implementation in the search.xml file in the conf/db configuration directory.

The Lucene-based implementation of search provided in dataCore supports adding search capability to one or more columns of a table in the database. Once search is enabled, any insert, update and delete will result in the search columns being indexed for later text search queries.

Applications can then use the API or SQL to search the table. Many of the search features provided by Lucene and it's search query parser can be used to search specific fields.

Implementation Notes for Lucene Integration

The following is some information and tips for Lucene search in DataCaster.

Lock File

Lucene creates a lock file in the Tomcat/temp directory. This file may not get deleted properly upon database shutdown if server is killed abruptly (should now be deleted if shutdown properly). Need to ensure this is deleted before restarting the server, else search indexing will fail. There is likely to be one lock per index stance, i..e table with search enabled.

Performance Issues

With the initial parameters we used, the insert performance takes a big hit with Lucene, and seems to deteriorate pretty quickly. Still looking at ways to improve performance, primarily through parameters and invoking optimization at the right times.

One of our big issues at the start was using optimize too frequently. Using a value of 100,000 for the optimize_count seems to work much better. We now see 220 inserts/sec on the Article table (versus 450/sec without text search).

One issue is optimize takes 30 secs with just 200K articles, so that one unlucky insert will be quite slow in sync mode.

If the merge factor is increased, and/or the max merge docs, you are liable to see too many file descriptors errors, which will seriously affect the database, and probably crash it. If you insist on using higher numbers there, you have to increase the number of file descriptors with uimit. See link below.

A link for some information on Lucene performance:

http://www.onjava.com/pub/a/onjava/2003/03/05/lucene.html

Optimize Tool

Added a tool to optimize a LuceneIndex. OptimizeLuceneIndex on the com.applibase.db.tools package. Will allow running without optimizing and doing an optimize when needed. This is also available in WebAdmin.

Note: Running for too long without optimize causes too many open files errors. At least that seemed to be the case during migration of the database without optimize. So need to do an occasional optimize.

  • Reindexing of a table: Currently we only index text when new data comes in for a table. Do not support reindexing of a table on demand - for now, need to drop search and add it again. Search will be enabled for all existing tuples when search is added to a table with existing tuples.

  • OptimizeLuceneIndex in WebAdmin and using the command line tool.

Configuring Search Implementation Instances

Text search support in DataCore is designed to be a plug-able module, allowing different implementations to be introduced if required. Each search-enabled table uses a different implementation instance, and therefore multiple types of search implementations may also be used in DataCore. Only one implementation is provided with DataCore, and users may build other implementations if required.

However, most will find the Lucene implementation sufficient for their needs, given that there is some flexibility in using different flavors of the Lucene implementation, e.g. using different Analyzers, stop words, and other parameters in each instance of Lucene used for a table. This is controlled via the search.xml configuration file in the conf/db configuration directory.

DataCore provides an interface to access search, while the work of indexing and searching is delegated to other implementations using the com.applibase.db.search.TextSearch interface and the implementations of that interface. The search.xml conf file specifies the list of configured implementations, each of which needs a TextSearch instance to implement the text search capability. Currently, only one implementation of TextSearch is provided, a Lucene-based implementation, although variants of this can be configured for individual tables using the properties below.

One can create different flavors of the Lucene implementation using the parameters in search.xml. Two required parameters must be provided to identify and find the implementation.

  1. type: The name used to specify the text search implementation.
  2. SearcherClass: The class name of the server text search implementation class to be used, which must have a constructor with no arguments, and support the com.applibase.db.search.TextSearch interface.

By using different type names, the same underlying implementation can be used to provide effectively different implementations by varying the parameters supported by the implementation. For example, we often configure a synchronous implementation with type name lucene-sync, and asynchronous implementation with type name lucene-async, to allow for using both versions at the same time, for different tables.

The following are the parameters allowed in search.xml for the Lucene implementation.

  1. synchronous : Permitted values true or false. Determines if indexing is run synchronously with table updates, or may be async and delayed a little in exchange for not holding up table operations
  2. analyzer : Any class name of a valid Lucene Analyzer implementation. Allows pluggin in any of the available or custom analyzers.
  3. stop_words: A space separated list of stop words. Used to set the stop words if StandardAnalyzer or StopAnalyzer.
  4. maxFieldLength : integer value. The maximum # of terms that will be indexed for a single field.
  5. mergeFactor : Integer value. Determines how often segment indices are merged by addDocument().
  6. minMergeDocs : Integer value. Determines min number of docs required before in-memory docs are merged.
  7. maxMergeDocs : Integer value. Determines the largest number of documents ever merged by addDocument() in Lucene.
  8. optimize_count : Integer value. The number of changes to trigger an optimize operation on the index writer.

Custom implementations determine the properties for each implementation, and may support as many properties as required.

Enabling Text Search for a Table

The first step in getting text search capability is to determine the table and columns that need to be searched. DataCore search is designed to use an independent search implementation instance for each table to be searched. Each such table needs to have text search enabled for the desired columns.

Text search is enabled after a table is created, using either SQL or the API to set up search for the table.

SQL Statements

The syntax for an SQL statement to enable search for a table is

SET TEXTSEARCH 'implementation_type' TABLE COLUMN [BOOST_FACTOR] ( COLUMN [BOOST_FACTOR] )*

where the implementation_type is a string that must be one of the implementation names configured in the search.xml configuration file described earlier. Many of the options that need to be configured for a specific search instance can be setup using the search.xml configuration file.

With the SQL statement, only the table to be search-enabled, the specific columns in the table to be searched, and an optional boost factor for each column can be specified. The boost factors are integer or floating point literals that determine the weighting of the column used by the search implementation in determining search results.

Once search is enabled for a table, a directory with the search data is created and any normal insert, update and delete on the table will result in the text search data being updated to reflect the new data. This should be transparent to users who do not need to be aware a particular table has search enabled when using it in normal SQL and other operations on the table.

Enabling Search With the API

The Schema interface provided a method to enable text search for a table It has identical options to the SQL statement when it comes to enabling text search for a table.