NOTE: This post was originally written in 2010 so it may be dated. I’m resurrecting it due to relative popularity. This post has been copied between several blogging systems (some of which were home-brewed) and some formatting has been lost along the way.
One thing that’s impressed me with Solr is the flexibility of the Data Import Handlers (DIHs). When I was new to Solr there were several times I thought for sure I’d have to write my own extension of DataImportHandler. Every time that’s happened I’ve been wrong. A transformer or something handled my needs. Sometimes it’s wonderful to be wrong! Especially when it means less code I have to write myself!
One of the aspects of DIH’s that provide such great flexibility is transformers like RegexTransformer and TemplateTransformer. In this post, however, I’m going to *quickly* cover the ScriptTransformer wich allows you to employ your own custom JavaScript code in the processing of imports.
Prerequisites
Obviously you’ll need a functional Solr instance. Also, ScriptTransformers require Java 6 due to JavaScript support. I’ll also assume you have an understanding of how dynamicFields work.
Objective
At the office I’ve recently used a ScriptTransformer to build the field names of dynamicFields and I’m going to do the same in this article. The actual use-case I dealt with was very esoteric and honestly a bit proprietary so I’ll substitute an example data scenario here.
Basically I’ll import data about students grades for various courses from different institutions. In the resultant Solr index I’ll provide a dynamicField for every course to provide easy sorting of students by their grades in the courses they took.
Consider the following MySQL schema and data and try to think beyond this sample data. Think about hundreds of schools, thousands of courses and, well, a ton of students.
create table schools (
id int auto_increment primary key,
name varchar(255)
);
insert into schools (name) values ('Pitt');
insert into schools (name) values ('Penn State');
create table students (
id int auto_increment primary key,
first_name varchar(255),
last_name varchar(255),
current_school_id int references schools(id)
);
insert into students (first_name, last_name, current_school_id) values
('John', 'Doe', 1);
insert into students (first_name, last_name, current_school_id) values
('Bill', 'Miller', 1);
insert into students (first_name, last_name, current_school_id) values
('Jane', 'Dow', 2);
insert into students (first_name, last_name, current_school_id) values
('Dennis', 'Itchison', 2);
create table courses (
id int auto_increment primary key,
school_id int references schools(id),
course_number varchar(10),
name varchar(255)
);
insert into courses (school_id, course_number, name) values
(1, 'CS1501', 'Algorithm Implementations');
insert into courses (school_id, course_number, name) values
(1, 'CS1541', 'Introduction to Computer Architecture');
insert into courses (school_id, course_number, name) values
(2, 'CMPSC465', 'Data Structures and Algorithm');
insert into courses (school_id, course_number, name) values
(2, 'CMPSC473', 'Operating Systems');
create table grades (
id int auto_increment primary key,
value FLOAT,
course_id int references courses(id),
student_id int references students (id)
);
insert into grades (value, course_id, student_id) values (4.0, 1, 1);
insert into grades (value, course_id, student_id) values (2.5, 2, 1);
insert into grades (value, course_id, student_id) values (3.0, 3, 1);
insert into grades (value, course_id, student_id) values (3.0, 1, 2);
insert into grades (value, course_id, student_id) values (3.5, 2, 2);
insert into grades (value, course_id, student_id) values (3.5, 3, 3);
insert into grades (value, course_id, student_id) values (2.5, 4, 3);
insert into grades (value, course_id, student_id) values (3.0, 3, 4);
insert into grades (value, course_id, student_id) values (2.0, 4, 4);
Keep in mind that an idea here is that there would be far too many courses to conceivably have a sparse-style column per course if we were denormalizing a list of students. A student can also have taken courses at several of the institutions despite where they’re enrolled now.
Solr Schema
The data above will be transformed into the following Solr schema:
<fields>
<field name= "id " type= "int " indexed= "true " stored= "true " required= "true "/>
<field name= "first_name " type= "string " indexed= "true " stored= "true "/>
<field name= "last_name " type= "string " indexed= "true " stored= "true "/>
<dynamicField name= "grade_* " require= "false " type= "float "/>
</fields>
DIH Configuration
In order to facilitate the transformation of the data into the schema defined above I’ll employ the following DIH configuration:
<dataConfig>
<dataSource type= "JdbcDataSource "
driver= "com.mysql.jdbc.Driver "
url= "jdbc:mysql://localhost/school "
user= "YOUR_USER "
password= "YOUR_PASSWORD "/>
<script>
<![CDATA[
function pivotGrades(row) {
var courseNumber = row.get( "course_number ");
var gradeValue = row.get( "value ");
var fieldName = "grade_ " + courseNumber;
row.put(fieldName, gradeValue);
return row;
}
]]>
</script>
<document name= "studentGrades ">
<entity name= "student " query= "
SELECT id, first_name, last_name
FROM students
">
<field column= "id " name= "id "/>
<field column= "first_name " name= "first_name "/>
<field column= "last_name " name= "last_name "/>
<entity name= "grade " query= "
SELECT course_number, value
FROM grades
INNER JOIN courses
ON grades.course_id = courses.id
WHERE grades.student_id = ${student.id}
" transformer= "script:pivotGrades ">
</entity>
</entity>
</document>
</dataConfig>
See the script tag? That’s where I’ve defined a pivotGrades javascript function to turn the data from grade sub-entity on its side into dynamicFields. In the real world you might expect to see some more intense text manipulation here to warrant the ScriptTransformation I s’pect.
Querying
All the work I’ve done above was done specifically so I can easily and concisely sort students by their grades in specific courses. Here’s the money:
http://localhost:8080/solr/students/select/?q=*:*&version=2.2&sort=grade_CS1541%20desc
Resulting in:
<response>
<lst name= "responseHeader ">
<int name= "status ">0</int>
<int name= "QTime ">1</int>
<lst name= "params ">
<str name= "sort ">grade_CS1541 desc</str>
<str name= "indent ">on</str>
<str name= "q ">*:*</str>
<str name= "version ">2.2</str>
</lst>
</lst>
<result name= "response " numFound= "4 " start= "0 ">
<doc>
<str name= "first_name ">Bill</str>
<float name= "grade_CS1501 ">3.0</float>
<float name= "grade_CS1541 ">3.5</float>
<int name= "id ">2</int>
<str name= "last_name ">Miller</str>
</doc>
<doc>
<str name= "first_name ">John</str>
<float name= "grade_CMPSC465 ">3.0</float>
<float name= "grade_CS1541 ">2.5</float>
<int name= "id ">1</int>
<str name= "last_name ">Doe</str>
</doc>
<doc>
<str name= "first_name ">Jane</str>
<float name= "grade_CMPSC465 ">3.5</float>
<float name= "grade_CMPSC473 ">2.5</float>
<int name= "id ">3</int>
<str name= "last_name ">Dow</str>
</doc>
<doc>
<str name= "first_name ">Dennis</str>
<float name= "grade_CMPSC465 ">3.0</float>
<float name= "grade_CMPSC473 ">2.0</float>
<int name= "id ">4</int>
<str name= "last_name ">Itchison</str>
</doc>
</result>
</response>