Using Text Search For Database Tables
DataCaster provides a Lucene-based implementation of Text Search that can be enabled for any table in the database. Apache Lucene is a widely used Apache search library ( see
http://lucene.apache.org). DataCaster also supports adding other search implementations, by implementing an internal search interface and configuring the implementation in the search.xml file in the conf/db configuration directory.
The Lucene-based implementation of search provided in DataCaster supports adding search capability to one or more columns of a table in the database. Once search is enabled, any insert, update and delete will result in the search columns being indexed for later text search queries.
Applications can then use the API or SQL to search the table. Many of the search features provided by Lucene and it's search query parser can be used to search specific fields.
Overview
Using the built-in text search is a relatively simple process.
1. First enable text search for a Table. This can be done via SQL statements or via the API, specifically the Schema interface. Search must be enabled for specific set of columns of a table.
Add Search Using SQL
SET TEXTSEARCH 'implementation_type' TABLE COLUMN ( COLUMN )*
For example
SET TEXTSEARCH 'lucene-sync' mytable mycolumn
Add Search via the API
Database db = new DatabaseImple(user, passwd);
Schema schema = db.getDefaultSchema();
String columns[] = { "mycolumn" };
float boost_factors[] = null;
schema.addTextSearch("mytable", columns, boost_factors, "lucene-sync");
2. Populate the table with data. As an insert, update or delete is performed on the table, the configured column data is being indexed for text search behind the scenes.
3. Perform search queries: Use the SQL SELECT statement, or the API to perform search queries. The SQL SELECT statements use a new TEXTSEARCH condition to search text. The API search function is provided via the Table interface.
Search Using SQL
SELECT SELECT_LIST FROM TABLE WHERE TEXTSEARCH TABLE [ COLUMN ] [MAX_HITS] [OFFSET] 'query string'
For example
SELECT * FROM mytable WHERE TEXTSEARCH mytable mycolumn 'query string'
The data is returned as with any other select query. We now support mixing the TEXTSEARCH condition with other conditions as well as allow the use of joins in these select statements. For example:
SELECT * FROM mytable, othertable WHERE TEXTSEARCH mytable mycolumn 'query string' AND othertable.c1 = 0;
Search Using the API
Database db = new DatabaseImple(user, passwd);
Table table = db.getTable("mytable");
List tuples = table.search("query string", "mycolumn", 25, 0);
The column string (second argument) is the default column used for the search if the query string does not specify columns. The third argument is the maximum number of hits to be returned from a search. The last argument is the offset to use for the first hit. The last two arguments are useful to provide paging of query results for clients.
The query string supports all the syntax options provided for Lucene search, since it is transparently passed to Lucene. The following provides some details on query string support with Lucene,
http://lucene.apache.org/java/docs/queryparsersyntax.html
Users can also directly use the Lucene APIs to search.
Using Lucene text Search in the Database
Configured Search Implementations
Enabling Text Search for a Table
The first step in getting text search capability is to determine the table and columns that need to be searched. DataCaster search is designed to use an independent search implementation instance for each table to be searched. Each such table needs to have text search enabled for the desired columns.
Text search is enabled after a table is created, using either SQL or the API to set up search for the table.
SQL Statements
The syntax for an SQL statement to enable search for a table is
SET TEXTSEARCH 'implementation_type' TABLE COLUMN [BOOST_FACTOR] ( COLUMN [BOOST_FACTOR] )*
where the implementation_type is a string that must be one of the implementation names configured in the search.xml configuration file described earlier. Many of the options that need to be configured for a specific search instance can be setup using the search.xml configuration file.
With the SQL statement, only the table to be search-enabled, the specific columns in the table to be searched, and an optional boost factor for each column can be specified. The boost factors are integer or floating point literals that determine the weighting of the column used by the search implementation in determining search results.
Once search is enabled for a table, a directory with the search data is created and any normal insert, update and delete on the table will result in the text search data being updated to reflect the new data. This should be transparent to users who do not need to be aware a particular table has search enabled when using it in normal SQL and other operations on the table.
Serching Using SQL Queries
Search With the DataCaster API
The Schema interface provided a method to enable text search for a table It has identical options to the SQL statement when it comes to enabling text search for a table.
Using the Lucene APIs
Applications can directly use the Lucene APIs to search the Lucene indexes created by DataCaster when data is inserted/updated for database tables. The text search indexes do not have the data, only the indexes with the key being the primary key, when it exists for the tebale, or otehrwise the tuple ID for each tuple.
The index directory used for Lucne Text Search is a sub-directory within the schema directory for the table. The name of the sub-directory used for the Lucene index is TABLENAME_TEXT-SEARCH
With the directory path to the Lucnene index, applications can directly use the Lucene APIs to perform more complex search queries than available via the SQL support for Text Search in DataCaster.