|
download source code
Introduction
As users of search engines, we have to learn new query syntax with every new search engine or database.
It would be nice to be able to search our own databases with Google-style syntax so that non-SQL
programmers could search the data. This article will build a .NET class library that converts
simple Google-style syntax into Transact-SQL syntax for full-text searches. Our full-text search
code uses the CONTAINS portion of the WHERE clause to search columns.
The IT instructions for installing and configuring a full-text search on a table is covered in a
different article found here: Perform SQL Queries Using Google-Style Syntax.
This article will focus on the GoldParser/Calitha usage and .NET C# programming required.
The class library built in this article is called GoogleToSQL.dll.
In order to demonstrate the GoldParser/Calitha parsing, the GoogleToSQL.dll will take a simple
Google-style query and transform it into the Transact SQL contains statement. A web site calls
the GoogleToSQL class library, finish creating the SQL statement and run the statement against
a database in order to find rows that match that full-text query.
How to Convert From One Language to Another
Converting from one programming language to another can be a difficult task.
If both languages can be described by Backus-Naur form (BNF), a conversion is
much easier.
GoldParser (www.goldparser.com)
is a free program that can help with that conversion.
The BNF Grammar File
The first step is to write the BNF that defines the Google-style search.
Using GoldParser's Parse Builder, you can write the BNF contained in the grammar file.
I used the grammar as described on the Google web site
(http://code.google.com/apis/base/query-lang-spec.html)
to write the BNF file, simplifying it for this article.
However, if you want a strict Google Syntax, the grammar provided is a good starting point.
Since GoldParser is a compiler, some of the syntax and keywords of the grammar file are
specific to GoldParser and not Google or BNF per se. GoldParser has documentation on the
web site to help understand the GoldParser syntax.

Figure: Google BNF displayed in Gold Parser Builder
Parts of GoldParser's BNF Grammar File
The top of the GoogleToSQL.grm file has information about author and comments.
The next section tells the compiler where to start and what to start with.
For this example, the starting symbol is called '<Query>'.
'<Query>' is not a keyword; its specific name is meaningful to the human reader of the file.
The next section is the Delimiters section.
For this example, it is called {GoogleValid} which used the GoldParser keyword '{All Valid}'.
GoogleValid also removes several characters from that valid set such as the pipe '|' character
which are not valid characters to Google. The next section is the smallest unit to parse that
is made up of {GoogleValid} characters, called a 'Term'. Examples of a term are a word like
'hawaii' or a numeric value such as '1421'.
The next section starting with <Query> is the definition of the BNF tree that will
be followed. The purpose of the BNF tree is to parse text and determine its linguistic
value. Google has symbols that need to be included in the BNF such a negation (-),
wildcards (*), and Boolean values (true or false). These symbols are also part of the
BNF because the parser needs to know that these symbols are different than a term.
Compile the Grammar File
Once you have the file, you can compile the file with the GoldParser and test out different search
queries to understand the parsing of the query. Compile the file by clicking on the 'Continue'
button at the bottom right of the GoldParser's Parse Builder window. The compilation is in 4
steps so you will need to keep clicking the button until it says to save the file. When you
save the file, you have successfully compiled the grammar. If there is an error in the
grammar, you will be shown the error and can fix it.
Testing the Grammar
Let's step through a test query to understand the BNF better. While in the Gold Parser
Builder with the grammar file open and compiled, open a test window.
As an example of a query, let's use 'vacation hawaii'. This query is made up of two 'terms'.
It has no other symbols or delimiters for the BNF to figure out. Enter the test query in the
top window and click on start button on the bottom right. Either step through the parsing
with the 'Step' button, or parse the entire query with the 'Parse All' button. Notice how
the Parse Actions window fills up with different items as the Google BNF is tested against the query.
The test window for parse actions looks like the following figure when all the parsing
is done. Three token types denoted by 'Token Read' actions (in black) are found. Two
token types are the terms 'vacation' and 'hawaii' and the third is the 'eof' symbol
meaning the end of the file or query was found. The other rows in the lower window
show how the BNF grammar was used. Shifting is the process of eliminating a non-essential
character and reducing is parsing the actual token to determine it's parts.
The parse tree also shows this information but in a different form that is more related
to the BNF provided in the grammar file. The words in the '<>' sections are not
GoldParser keywords but are terms I've defined.
Now that you have a compiled grammar file (A raw grammar file (GoogleToSQL.grm) and a
compiled grammar file (GoogleToSQL.cgt) are provided in this article's download.),
you need a c# .NET library that can interpret the compiled grammar file. For this
article, I use the free Calitha engine found at www.calitha.com.
You will need to download the calitha engine in order to use it in the class library.
When using the Calitha class library, you will need to register events to listen for.
These events are the Parse Actions. If you want to listen to the Token Read action,
you need to register to listen for that event.
Now that the BNF grammar file is finished, let's create the database objects in SQL server
in order for the full-text search to work.
Microsoft Full-Text Search
Microsoft has added Full-Text Search to SQL server to help with complex queries. In order for Full-Text Search to work, you need to have Full-Text Search installed from SQL Server. Full-Text Search is an installation option of the Database Engine Install, when you install SQL Server.
Full-Text Search T-SQL
A full-text search in SQL Server uses the standard SELECT statement but the WHERE clause needs to have a CONTAINS clause, for example:
SELECT * FROM TableName
WHERE CONTAINS (ColumnName,'query')
This article's download contains a SQL script to create and delete all objects necessary for a full-text search. The table name is tblFullSearch and the column name is colText so the above Transact-SQL for the sample query is:
SELECT * FROM tblFullSearch
WHERE CONTAINS (colText,'vacation and hawaii')
An alternate form is:
SELECT * FROM tblFullSearch
WHERE
CONTAINS (colText,'vacation')
AND
CONTAINS (colText,'hawaii')
This query is interpreted as find all rows that have both the words 'vacation' and 'hawaii' in them regardless if they are next to each other or have text between them.
Notice that the Google syntax does not need the 'and' but assumes a logical 'and' whereas the Transact SQL needs the logical 'and' stated. Also notice that the logical 'and' for the Transact SQL is inside of the single quotes meaning a literal. All text inside the single quotes is not a literal value but is in itself syntax specific to full-text search. If the query was to be interpreted as a literal value, it would need to appear like:
SELECT * FROM tblFullSearch
WHERE CONTAINS (colText,' "vacation and Hawaii" ')
This query is interpreted as find all rows that have the literal phrase of "vacation and Hawaii". These three words need to be found in this word in order for a row to match.
The use of the double quotes is not used in Transact SQL but is used in the full-text search query to denote a literal phrase to search for.
The GoogleToSQL.dll .NET Class Library
The GoogleToSQL class library will convert the Google-style phrase 'vacation hawaii' to the
T-SQL phrase 'vacation and hawaii'. The library constructor takes a column name in order to
use the column name in the CONVERT clause. Then the .Convert method is called and the CONTAINS
statement is returned. The rest of the SELECT statement will be provided by the calling
application. In Visual Studio, create a class library and add the GoldParserEngine.dll
and the CalithaLib.dll files as references and put them in the bin directory.
In order to use the Calitha engine, add the following code to the top of the file:
using com.calitha.commons;
using com.calitha.goldparser;
The constructor takes the column name for the CONTAINS statement, opens the compiled grammar
file, and starts listening for events:
public ContainsClause(String columnName)
{
// make sure column name is provided
if (columnName.Length == 0)
throw new Exception("ContainsClause c'tor: input param columnName is empty");
_columnName = columnName;
// read web.config for location of compiled grammar file
_fileLocation = System.Configuration.ConfigurationSettings.AppSettings["CompiledGoldParserFileLocation"];
using (FileStream fileStream = new FileStream(_fileLocation, FileMode.Open))
{
// create grammar file reader from Calitha library
CGTReader reader = new CGTReader(fileStream);
_lalrParser = reader.CreateNewParser();
// register 3 events to listen for
_lalrParser.OnTokenRead += new LALRParser.TokenReadHandler(_lalrParser_OnTokenRead);
_lalrParser.OnReduce += new LALRParser.ReduceHandler(_lalrParser_OnReduce);
_lalrParser.OnShift += new LALRParser.ShiftHandler(_lalrParser_OnShift);
}
}
The event this code listens for is the OnTokenRead event. Using the example of 'vacation hawaii',
we will get 3 tokens: vacation, Hawaii, and eof. As we get each token that is not eof, we need
to add the Transact SQL syntax 'AND'.
private void _lalrParser_OnTokenRead(LALRParser parser, TokenReadEventArgs args)
{
// Get Symbol Name
String sCurrentItemName = args.Token.Symbol.Name.ToLower();
// Get Symbol Value
String sCurrentItemValue = args.Token.Text.ToLower();
// if we are not at end of query and 'AND' is required, append it
if ((_andRequired) && (sCurrentItemName != "(eof)"))
_sQuery += " and ";
// parse action items
switch (sCurrentItemName)
{
case "-":
_sQuery += " not ";
break;
case "term":
_sQuery += String.Format("contains (" + _columnName + ",'{0}')", sCurrentItemValue);
_andRequired = true;
break;
case "eof":
break;
}
}
The Test Website
In order to test the class library, you will need a calling application.
For this article, I'll use a web application, which is included in the download.
Create a .NET web site and add the compiled grammar file GoogleToSQL.cgt to the
file system. Change the web.config file to reference that location in the AppSettings
section named 'CompiledGoldParserFileLocation'. Make sure to reference the
GoogleToSql.dll, GoldParserEngine.dll, CalithaLib.dll.
Build a web page that will take a query and pass the query to the GoogleToSQL.dll.
After the query is converted to SQL, you can finish the SQL syntax and execute the query.
The code for this web page can be found in the associated article found here:
Perform SQL Queries Using Google-Style Syntax.
Summary
One way to convert from one language to another is to have the BNF form of the language you are
converting, and using a parsing engine. Extending the BNF and adding to the code is left to the
user to produce the exact syntax needed for your web site.
Required Files
Required 3rd Party Files:
Required Files Provided By Author:
- GoogleToSql.dll
- GoogleToSql.cgt
- GoogleToSql.grm
Web Site Files Provided By Author:
- Default.aspx
- Default.aspx.cs
- Default.aspx.designer.cs
- TestWeb.csproj
- Web.config
- CreateDB.sql
Zip File Containing Files Listed Above: 070719.zip (2 MB)
|