NOTE: This post was originally written in 2011 so it may be dated. I’m resurrecting it due to relative popularity. This post has been copied between several blogging systems (some of which were home-brewed) and some formatting has been lost along the way.
I’ve been fiddling with
Lucene a good bit of late and have been quite
impressed. It’s more than just a “blazing fast” full-text indexing system,
especially when implemented via
Solr. With Solr it becomes
an incredibly scalable, full-featured and extensible search engine platform.
I had always assumed that the Lucene stack wasn’t for me. For the most part
I store my data either in SQL Server or MySQL, both of which have perfectly
adequate full-text search capability. It turns out that I could have saved
myself a few headaches and saved my employer some money by adopting Solr and not writing my own faceting, caching, etc.
Naturally, Lucene/Solr isn’t for everyone. If you just have a few
hundred-thousand rows of text that you want to perform some basic searches on
under light load then you’re probably better off using the full-text search
facility within your RDMS.
However, If you need to scale out widely, perform faceted searches or use
some advanced/custom search techniques then it’s probably worth looking into
Solr, even if you’re already deployed under an RDBMS with full-text support.
In this article I’ll outline the *VERY* basics of getting Solr up and running using
SQL Server as a data source. While I’m actually doing this in production
under Linux I’m going to tailor my instructions to Windows here to appeal to the
average SQL Server DBA. I’ll also employ the AdventureWorks sample database
for demonstrative purposes.
Note that you’ll have to have TCP/IP enabled in your SQL Server instance.
Named pipes, VIA and shared memory won’t cut it.
Step 1: Download and install Java
Solr and Lucene are written in Java so
a Java Runtime is a prerequisite. It can be downloaded
here.
After installation make sure to set the JRE_HOME environment variable to your
Java install directory i.e. C:\Program Files\Java\jre6
Step 2: Download Tomcat
Solr requires a servlet container. I recommend
Tomcat which can be downloaded
here. Then extract it to C:\Tomcat6 (Note that I’m going to hang this all right off C:\ to keep the tutorial simple).
Step 3: Download Solr
This whole thing’s about Solr, right? You can pick it up here. Extract
the contents to a temporary location.
Step 4: Move Solr into Tomcat
Copy:
-
apache-solr-1.4.0\example\solr to
c:\Tomcat6 - apache-solr-1.4.0\dist\apache-solr-1.4.0.war
to c:\Tomcat6\webapps\solr.war
Congratulations! Solr is essentially operational now, or would be upon starting tomcat. It’d just be devoid of data.
Step 5: Download and install a SQL Server JDBC driver
In order for Java to talk to SQL Server we’ll have to supply a JDBC driver.
There are many available but I used Microsoft’s which can be downloaded
here.
Note that there’s also a unix version available.
Now create a C:\Tomcat6\solr\lib folder. Copy the
file sqljdbc4.jar out of the archive downloaded above into it.
Step 6: Configure the import
Create a C:\Tomcat6\solr\conf\data-config.xml file and put the
following content in it, modifying it to the details of your configuration, naturally. This file defines what data we’re going to import
(SQL statement), how we’re going to get it (definition of JDBC
driver class) and where form (connection string and authentication information).
The resultant columns are then mapped to fields in Lucene.
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
url="jdbc:sqlserver://localhost\INSTANCENAME;databaseName=AdventureWorks"
user="TESTUSER"
password="TESTUSER"/>
<document name="productreviews">
<entity name="review" query="
SELECT ProductReviewID, ProductID, EmailAddress, Comments
FROM Production.ProductReview">
<field column="ProductReviewID" name="id"/>
<field column="ProductID" name="product_id"/>
<field column="EmailAddress" name="email"/>
<field column="Comments" name="comments"/>
</entity>
</document>
</dataConfig>
Step 7: Tell Solr about our import
Add the following requesthandler to
C:\Tomcat6\solr\conf\solrconfig.xml:
<requestHandler name="/dataimport"
class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">C:\Tomcat6\solr\conf\data-config.xml</str>
</lst>
</requestHandler>
This essentially allows Solr to perform operations of the data import we defined above upon
a visit to the /dataimport URL.
Step 8: Configure schema
Ensure the fields are set up correctly in C:\Tomcat6\solr\conf\schema.xml. There will be plenty of example fields, copy fields dynamic fields and a default search field in there to start with. Just get rid of them.
<fields>
<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="comments" type="text" indexed="true" stored="true"/>
<field name="email" type="string" indexed="true" stored="true"/>
<field name="product_id" type="int" indexed="true" stored="true"/>
<field name="text" type="text" indexed="true" stored="false" multiValued="true"/>
</fields>
<copyField source="comments" dest="text"/>
<copyField source="email" dest="text"/>
<defaultSearchField>text</defaultSearchField>
There’s quite a bit of power that I won’t go into in this article dealing with Solr schemas. Dynamic fields, copy fields, compression… Needless to say it’s worth reading up on which you can do here.
Step 9: Start Tomcat
OK! We’re finally configured well enough for an import. All we have to do is start up Tomcat. Make sure you’re in Tomcat’s directory as the quick-and-dirty configuration I showed you here requires it in order to find the Solr webapp.
c:\Tomcat6>.\bin\startup.bat
If you’d like to move the Solr webapp elsewhere on the filesystem, remove the requirement for starting in Tomcat’s directory or perform an advanced configuration please see the Solr with Apache Tomcat article in the Solr Wiki. Pay special attention to the section labeled, “Installing Solr instances under Tomcat” where they show you how to create contexts.
Step 10: Import
Now visit
http://localhost:8080/solr/dataimport?command=full-import
with a web browser. That’ll trigger the import. Because we’re just importing
a small amount of test data the process will be nearly instantaneous.
Step 11: Observe your results
That’s it! You can verify your work by issuing a query against Solr with a
RESTful query like http://localhost:8080/solr/select/?q=heavy&version=2.2&start=0&rows=10&indent=on that searches the index for all reviews with the word heavy in the comments.
Pitfalls
There are a number of reasons a data import could fail, most likely due to problem with the configuration of data-config.xml. To see for sure what’s going on you’ll have to look in C:\Tomcat6\solr\logs\catalina.*.
If you happen to find that your import is failing due to system running out of memory, however, there’s an easy, SQL Server specific fix. Add responseBuffering=adaptive and selectMethod=cursor to the url attribute of the dataSource node in data-config.xml. That stops the JDBC driver from trying to load the entire result set into memory before reads can occur.
Next Steps
So we’ve gone from zero to a functioning Solr instance rather quickly there. Not too shabby!
However, we’ve only queried Solr through REST. Libraries like
solrnet are handy for wrapping
objects around the data in .Net. For example:
/* review domain object */
public class Review
{
/* attribute decorations tell solrnet how to map
the properties to Solr fields. */
[SolrUniqueKey("id")]
public string Id { get; set; }
[SolrField("product_id")]
public string ProductID { get; set; }
[SolrField("email")]
public string EmailAddress { get; set; }
[SolrField("comments")]
public string Text { get; set; }
}
class Program
{
static void Main(string[] args)
{
/* create a session */
Startup.Init<Review>("http://localhost:8080/solr");
ISolrOperations<Review> solr =
ServiceLocator.Current.GetInstance<ISolrOperations<Review>>();
/* issue a lucene query */
ICollection<Review> results = solr.Query("comments:heavy");
foreach (Review r in results)
{
Console.WriteLine(r.Id);
}
}
}
Resulting in:
2
4
If you’re totally new to Solr it’s worth checking out the wiki. It outlines the handy features such as replication, facets and
distribution.