As you know, we've recently migrated JavaRanch from a variant of Universal Bulletin Board (Perl)
to JForum (Java.) Naturally, I wanted to write an article talking about how we managed it.
Who are we
JavaRanch is a website run by volunteers throughout the world. A number of us worked on the forum
migration project, including developers in Japan, India, Switzerland, Germany, England, and in the
states of New York, Maryland, Texas, Colorado and California in the USA, Many other people made
suggestions and tested.
How do we communicate
Mainly through the "Moderators Only" forum. In addition to plotting world domination, we've been
quite busy in there talking about the new forum software. In fact, the Moderators Only forum has
been on the new software since the end of November; we've been busy making sure the new software
is ready for use. We also communicated some through e-mail for more targetted discussions amongst
the core developers on the project. I find that being in different time zones is helpful.
A surprising number of things get solved "while one is asleep."
Well that's all and well, but what happens when we need to communicate synchronously?
For the two data migration/deployment weekends, we used an instant message session.
This allowed the three people involved to do things simultaneously and manage dependencies.
Technologies in JForum
Servlets
Freemarker template language – While this serves the same purpose as a JSP (view)
it uses a separate language that forces the Java code to stay out of the view. If we were
writing software from scratch we probably would have used JSPs/EL since we had to go and
learn Freemarker to use it.
JDBC
Java 5/J2EE 1.4 – This was an interesting one. We backported it to Java 1.4/J2EE 1.3
and then decided to go with the later runtime after all.
Tools we used
Multiple development environments – A few of us ran the software locally on our
own machines along with our own database. We also had a sandbox environment for integration
and testing. And of course we have a production environment.
Subversion (SVN) so we don't lose our work. It was down for a week and let me tell you that
I really missed it! We did operate under the "if it isn't in source control, it doesn't exist"
philosophy, which worked well. A bit of uncommitted work was lost early on, but we were able to
have a stable production base.
Ant build file to minimize number of steps in deployment and avoid the need to upload
loose files. Very early on, we released inconsistent class files (rather than a jar)
which broke the application in our sandbox. Needless to say we now use jars!
JUnit – Some of the new classes/major functionality got thorough unit tests. The rest
of the enhancements were manually tested. This was a strategy of managing the risk.
Some of the new code was quite complex and really needed the tests. These tests found
things many, many times. Whereas minor modifications could be manually tested without
a huge effort. Not ideal, but a decent balance for "legacy software" - modified by volunteers.
Password protected wiki pages to track internal requests along with what was done
(and in what environment). This helped us prioritize and manage requests from other moderators.
Our long road to release
If you've been around a while, you know that new forum software gets talked about
regularly and nothing appears to happen. The reason is that migrating to new software is
a lot or work! Our major milestones for this project were:
January 2008 (yes, last year) – Our server is straining under load. Temporary solution
is to buy a more recent server. Long term solution is to migrate to software we can actually
change to manage problems.
January 2008 – "This time is going to be different." We are going to pick a software
package this time and go with it.
January/February 2008 – JForum seems the closest to what we want. One of our sheriffs
installed it on our sandbox and proved you can get up and running with JForum very quickly.
Unfortunately, migrating our data (anywhere) and adding features we view as critical are a
lot more work.
February 2008 – Fundraiser to buy a new server since migrating to new software will take
too long to fix the problem.
February 2008 – Let's go with JForum. Initial commit to SVN so we can start making changes to it.
February/March 2008 – JForum seems easy enough to extend. Let's go with it.
March-October 2008 – A lot of work got done on extending JForum and the data migration.
We stopped a few times for holidays, vacations and nice weather. However, we never lost momentum
and that's the important thing.
November 29 2008 (Thanksgiving weekend) – We cut over the private Moderators Only forum to JForum.
This let us beta test by using the software "for real" amongst the moderators. It also represented
the "point of no return" where we were really committing to the project going live no matter what.
This was the first time we couldn't decide not to continue without losing data. We also used this
day as a dry run for a full cutover.
January 3, 2009 (weekend after New Year) – Public cutover. We made it.
Development process
The process evolved over time. Some roles:
Project Manager – We realized we needed one before starting. Being a bunch
of developers, this wasn't something anyone was overly enthusiastic about. We did get a
volunteer project manager, although we didn't call it that.
Deployer – We all deployed to our sandbox environment so everyone knew how and
we didn't have a "bottleneck" person. We did have one person do the deployments to production
and a team of three for the two cutover weekends.
Developers – Of course!
Tester – We solicited a rancher than we were thinking of asking to become a bartender
to do some independent beta testing for us. He did a great job and is now a bartender –
thanks Amit!
Legacy data
A side effect of working with volunteers is that the most interesting works gets done faster,
either right after it is thought of or right before it is needed. Whereas the less interesting
work -like migrating users and threads- dragged on for months. Granted, this was hard
and a lot of work. There were more pauses in it than some other work though.
Summary
We did it! We did it! We did it! I'm proud and excited that a project of this size succeeded
with only volunteers on it.
2009 brought a major change to the forums at JavaRanch: we mooooved our old UBB forums to fresh new JForum software. Check out what Jeanne Boyarsky, Ulf Dittmer, Ernest Friedman-Hill, Bear Bibeault and Amit Ghorpade reveal about behind-the-scenes action on moving day and during the whole year-long (!) process.
JavaRanch: Jeanne, of all the people involved in this effort, you're probably the one who's seen the inner workings of JForum most up close and personal. What's your impression? Has anything made you scratch your head and wonder why they did it that way?
Jeanne: I'm happy with JForum. It was a stable base that we could build on.
There's always going to be moments when you look at someone else's
code and wonder "what were they thinking." For me, that came from the
size of some of the methods. For example, the "insert save" method
for posts was 301 lines originally. Not to say that I refactored it
to make things any better. I tried to change "as little as possible."
And that the unit tests that did come out of the box didn't all pass.
The JForum forums say this is known so we deleted the tests
altogether in our copy.
JR: Was there anything in the code that made you say, oh cool, I like the way they did that!
Jeanne: I really like that all the SQL is in one property file with overrides
for database specific ones.
JR: Ooh, we like it when you talk like that... Other cool things about JForum? Or other things that got y'all scratching your heads?
Bear: I really like the way that code blocks are styled, with the line numbers and all. And the fact that you can actually enter "onclick" without resorting to HTML entities.
With regard to head scratching, I wonder why they resorted to Freemarker templates over pure JSP. Nothing against Freemarker (and other templating engines) but with the advent of JSP 2.0, where JSPs are easy to use as pure templates, I just don't think they're necessary anymore.
Ernest: Overall the design is awesome and clean. As far as the choice of Freemarker, I understand that JForum 3, now under development, is moving to JSPs, so they've seen the light there. In any case, Freemarker doesn't let you embed Java scriptlets, so there is still a nice separation between presentation and business logic.
Bear: True, dat.
JR: Speaking of presentation, Ernest, you dove in and did a lot of serious modifications to JForum's look and feel. Are you a touchy feely kind of guy?
Ernest: Hmmm. Maybe. I really did it because I thought that the sense many of us have of the Saloon being a tangible place is strongly tied to its look and feel. I wanted to make sure that when we launched, it would seem like the same place. For those following along at home, you need to know that we used JForum to host our "Moderator's Only" forum for a month or two before we went live with the rest of the Saloon. At that time, it looked a lot more like JForum's own native look and feel. My sense was that the staff didn't feel at ease; it felt like we were all sitting around in the lobby of a strange hotel, rather than our old familiar digs. I was worried that the abrupt change would alienate the Ranchers at large. So I did what I could to make our software look like UBB. Some people thought I was nuts, but I think it has worked out for the best.
As time goes on, I think we'll move away from this look and try to bring things into the 21st century, but for the moment, I am happy that it feels like the same old place.
JR: Bear, you did a bit of sprucing up around the forum too, in lovely chocolate tones. Can you tell us more about those yummy chocolate buttons? As a self-respecting chocolatier, you probably mixed up a quick batch of 70% Ecuadorian, yes? Do you reveal your recipes?
Bear: Whoa, whoa! One question at a time! I have a short attention span.
pause
pause
JR: Bear?
Bear: Oh, yeah. Sorry about that. The chocolate-y buttons actually have a fairly rich history.
Those who have been around the Ranch for some time may remember that way back when the buttons were oval, and aqua, and used a serif font that didn't match anything else on the rest of the site.
At one point, someone mentioned that it'd be nice to replace them with something that matched the rest of the site better. Paul nixed the idea, but before I saw that, I had whipped up some buttons to replace the aqua ones. Naturally, I chose browns to match the Ranch-y colors, and used a sans serif font that didn't clash with the text of the site.
When I saw that Paul wasn't keen on replacing the existing buttons, I almost discarded them. But, I went ahead and posted them in a reply just to show what I had come up with.
When Paul saw the new buttons, he did an about face and said something along the lines of "I want those!". I've always thought of that as a supreme compliment.
When the time came to create buttons for the JForum version, I figured I could use the same theme and volunteered. The new buttons are a cross between the default JForum buttons, and the brown, oval buttons I had created for the "old Ranch".
JR: Were they hard to make?
Bear: Not really, but the process would probably be of little interest to anyone but Photoshop geeks. An alpha channel here, some layer blending there, and liberal use of layers, and there you have them!
Now that the "recipes" are set up, it's pretty easy to generate new buttons as needed.
JR: Ooooh recipes! Do you share?
Bear: When it comes to real recipes, I share pretty liberally. I'm actually not super-big on sweets, so dark chocolate appeals to me more than something sweeter like milk chocolate. Add some bitter orange flavor, or better yet, chiles, and I'm a happy bear.
JR: Chiles?
Bear: I like it hot! Want my recipe for 7-Chile Jelly?
JR: Hmmm, think I prefer my jelly sans chiles, thanks anyway. Ulf, just about a year ago, you actually did the first installation of JForum. Assuming you weren't under the influence of 7-Chile Jelly, what possessed you to do that?
Ulf: A few months earlier I had downloaded and installed JForum for a project related to my day job. So when the JR moderators discussed what to do about the forum software -improve UBB, switch to something else or write our own- JForum was mentioned along with questions like "what can it do", "what's its source code like" etc. Since I knew what was involved in installing it -quite easy really: run a DB script and get a web app running-, I put it on a test server, so that people could play around with it. Not for a moment did I imagine that for two of us -Jeanne and Pauline- this would be the starting point for pushing the migration project forward; it was meant as a useful contribution to a discussion, nothing more.
JR: There have been numerous valiant attempts to update our old UBB forums; this is the only one that really made it through. When did you really, really believe that this time javaranch really would move to a new forum? Why?
Ulf: I can't put a precise date on that, so let's see ... the initial install was in January, and then some quick progress was made. I think by June it seemed to have fizzled out. There were also some bad vibes amongst moderators around that time; those certainly dampened my enthusiasm. On top of that, it was summer, so I tried to spend more time away from computers... So I lost track of progress for a while, and was pleasantly surprised to discover that Jeanne had soldiered on, and completed a good deal of the ToDo list items. So when she started talking about what might be involved in actually switching to JForum, I figured I had better get cracking on the items I had signed up for. According to the Subversion history I started committing code in earnest in early November; so I'd say by that time I believed it would happen.
Given the time frame, maybe we should have had a tag line like "JavaRanch on JForum - Change you can believe in" :-)
JR: Yes we can!! Sorry... couldn't help myself...
Jeanne: My article in this newsletter 'The Great Forum Migration Project'
sketches out the timeline for the entire project. There were two
milestones that really jumped out at me. The first was in March 2008.
We did enough proof of concept work to feel comfortable that we could
extend JForum and that it was stable. This was when we got buy-in
from Paul Wheaton - the site owner and trailboss - that this was going
to be THE attempt that we were going to support and that we wouldn't
let scope creep mess things up. We all agreed that the primary goal
was to get JavaRanch up on JForum without losing data and everything
else was secondary. If that meant going without certain features for
a while, so be it. To me, that meant the project would eventually
succeed and it was just a question of when. The second milestone was
November 29, 2008 when we cut over the Moderators Only forum. While we
were ready for some time before, that day represented the "point of no
return." It was the first time we couldn't go back to UBB without
losing any data. The feeling that weekend was awesome because it felt
real!
JR: So people were busy testing an installation of JForum for quite a while. Amit, you were one of the first official testers. In fact, it is rumored that you had very, very close contact with Jeanne during the migration from UBB to JForum. Can you tell us a little about that (without getting either of you in trouble)? ;-)
Amit:
Maybe yes, for two reasons: First Jeanne's patience, where others put me to their ignore list for reasons below, Jeanne showed the courage to reply to my idiocy each and every time. Second I learned somehow that Jeanne will be one of those getting more perks for the new software, so I thought she would recommend my name.
Okay can you replace the second reason with something that safeguards my honesty, like she was my reporting person.
JR: OK, we'll put it on the wish list. What was your favorite bug report?
Amit: Actually I thought all of my bugs would turn out to be nightmares for the implementors, but sadly none did. Still my most favorite bug was the "Service Unavailable bug" which I reported when server maintainance was going on.
JR: Is it true that you reported duplicate bugs to increase your post count?
Amit: That is supposed to be a trade secret, how come you know this? (Note to self : Don't let out your trade secrets, already lost two.)
JR: Forget the bugs, let's talk features! What's your favorite JForum feature?
Amit: Extended bartender's powers! Err, it's the preview button.
Jeanne: Sticky thread for book promos, quick reply, code tags
JR: OK Jeanne, since it's you, we'll give you three favorites. ;-)
Ernest: How easy it is to modify and customize. What else would an inveterate hacker say?
Ulf: My favorite feature is its code base. I really want to give credit to Rafael Steil (JForum's main developer) for this. It's well-organized, has useful abstractions with a well-implemented MVC separation, and many parameters and switches one can tinker with. And it's written in Java, so all of us can contribute: looking at the repository history, 10 moderators have worked on it so far!
Thinking back to UBB (which is written in Perl) where any code modification but the most trivial was a gamble, and not even those moderators who did dare tinker with it (primarily Jim and Ernest) could always predict what might happen as the result of a code change, it's a difference like night and day.
JR: Night and day, indeed. And how does this night time migration project differ from your day job?
Ulf: Not so much, actually. I've worked on web-based Java projects for more than 10 years, so I'm quite familiar with the issues that can arise. Of course, here we're all volunteers working in our spare time, so sometimes people can't contribute as much as they thought they'd be able to, or have to push things back because real life interferes. So any kind of deadline generally doesn't work out well, which is why we have always responded vaguely to any public requests for more features in the forum software, until the very minute we were ready to announce it.
But I enjoy doing the occasional actual programming. In my day job I'm in leadership positions where I don't get to do much of that any more.
Ernest: Unlike most folks, I mostly do desktop software development at work, so working on Web technologies is a nice change of pace for me. Hacking on the Ranch is a great way for me to keep those skills sharp, and the Ranch staff are a great bunch of teachers.
Jeanne: Where to begin. I work for a bank in New York City. There's a lot
more structure and bureaucracy. In some ways, it's nice to be able to
deploy to production without getting five approvals and sign off from
a bunch of people. In other ways, it's dangerous that a change can
get into production without being properly checked. It's also a
different dynamic because almost everyone here at the ranch is a
techie. All the roles had to be filled on this project. I must say
that I really appreciate what our project managers,
operations/deployment/server folks and test group do!
Another aspect was having less confidence that things would go
smoothly. I took January 5th off from work in case there were
problems on our first business day on the new software. And I was
sincerely worried that something big would go wrong. Luckily that
didn't happen. By contrast, I'm never worried about this at my day
job. There's enough process in place to KNOW rather than HOPE that
things work as they should.
JR: New York City? Do they have little moose bags in Central Park, you know the kind you use when you take your moose on a walk?
Jeanne: Unlike the guy with a pet
tiger, I recognize keeping a real moose in a New York City
apartment wouldn't go well. So mine are stuffed. I have a girl
stuffed animal moose on my desk at home and a boy
one at work. (Build a Bear says it a Moose and not a reindeer.)
And my moose are easy to clean up after!
JR: You certainly did a clean job on the migration. Given the long process, you must have faced plenty of decisions. What was the toughest decision you had to make along the way?
Jeanne: When we deviated too much to upgrade when JForum itself does. That
was a big deal because it means that we now own our fork of JForum.
Upgrading again would be another monumental effort and from what
history shows is unlikely to happen. On the bright side, we now have
the skills and expertise to add what we need to our installation.
JR: On January 3rd, 2009 when most people were still nursing leftover holiday hangovers, Ernest, Ulf and Jeanne were busy doing the The Big Move. Can you guys give us an idea of what that day was like for you?
Ernest:
We did the migration on a Saturday morning and my kids were home. Despite repeatedly telling them "Shh, Daddy's working!" they wouldn't really leave me alone, so at one point I was hacking Perl code with a preschooler on my lap. I guess he probably could have taken out the whole Ranch with a few spectacularly mistimed keyboard bashes. Didn't happen, though.
I might be the only team member to feel a little sad and wistful about getting rid of UBB. It was ultimately up to me to "throw the switch" that directed all our traffic to the new site, and immediately after I did it, I felt an instant of "What have I done!" panic. And then I couldn't believe we had finally gotten to that point.
It was hard work but a lot of fun. It's always a great experience to tackle a project with smart, talented people who enjoy what they're doing, and that was definitely the case here. There were a couple of speed bumps, but due to some impressive speed-debugging, we got over them very quickly. Jeanne and Ulf really know the code well.
Jeanne: The migration was LONG. We predicted 3 hours and it went 5. I felt
bad that it got to 8pm Ulf's time. It was also more intense than I expected because we were concentrating pretty much the whole time.
It was cool working on a project where everyone is a techie. Ernest and Ulf were awesome to work with!
Ulf: This was by far the longest chat session I've ever been engaged in. It was also a bit different from previous ones since I have never met Jeanne or Ernest, nor talked with them on the phone, so I have no voices to connect with them.
In a way it's amazing how far the internet has come. You're looking at a web page in the browser, looking at the server that serves that page in an SSH session, talk about what you're seeing via chat, and trade emails back and forth about it all at the same time. This all works in near real time nowadays. None of us were within 2000 kilometers of the servers, and in my case it was more like 8000+ km.
It did run longer than we hoped for, but it's been my experience that initial rollouts of sizeable applications generally proceed less smoothly than predicted. Although our test system ran well for a while, it had been set up and tuned over a period of months. It can easily happen that some required step doesn't get recorded for the "live" rollout, or that things work slightly different on the live system than they do on the test server. So I wasn't surprised by that.
The next week was a bit of a scramble, though. The software had a number of bugs that we hadn't encountered during testing - having a large member community sure helps to trigger those quickly! But by now we're at a point where the log files are pretty clean of unhandled exceptions, so we can start thinking more long-term.
Jeanne: Want a behind-the-scenes peek at how much fun we had? Check out these live chats from different points along the way:
While Ulf was troubleshooting a problem during our Moderators Only migration Thanksgiving weekend:
Jeanne: aww. gmail handles smilies cute! [they are animated]
Ernest: Hey, guys, I've been making waffles this whole time. The
waffles are done now. Pull up a chair!
Jeanne: mmm. imaginary waffles!
Ernest: OK, I will be eating wafles. I'll check back in a few minutes...
Kicking off the migration:
Jeanne: It seems like we should say something ranch appropriate before
kicking this off. I'm not a westerner, so I'll go with something NYC
based (Macy's parade) - 3! 2! 1! Let's have a migration!
Ernest: YeeeeeeeHAWWWWWWWW!
Jeanne: well that's western :)
Ulf: I'll let the experts decide :-)
Ernest: Wait, I am going to get my hat...
Ernest: OK, seriously, I am wearing a cowboy hat [see Ernest's profile
for
picture]
If you've been reading the forums, you've seen a few references to how migrating our data from
Universal Bulletin Board (UBB) (Perl) to JForum (Java) was the majority of the work.
Ernest Friedman-Hill suggested an article on it. The idea to write the article is his;
the words are mine. (As I side note, I handled the user and forum/thread migration.
Ulf handled migrating the thread "watches." While technically the thread watches are data,
I'm going to be focusing on the rest. A lot of the analysis was done by other moderators on
previous attempts at this project - I don't remember who so i can't give credit where it is due.)
Architecture differences
UBB stores its data in text files, while JForum uses a database. Needless to say this is different!
How big is it?
Some interesting stats:
Item
#
# Threads
423,004
# Posts
1,875,753
# Users
181,490
# Forums
71
Item
KB in UBB
KB in JForum
Users
2,448,068
369,864
Threads (and post data)
6,767,764
1,293,640
Database sizes were queried using the tip in article
and do not include indexes. File sizes were counted using du - ks.
Item
# Lines of Code*
# Classes
# Commits to SVN
Users
726
5
14
Threads (and post data)
1,040
4
21
* Doesn't count 4 classes (about 200 lines of code) shared between the two.
Initial analysis
The first thing we did was analyze what data we needed to migrate.
I created a text file listing all the JForum tables
and what we needed to pre-fill them with if anything. The ones involving data migration
were forums, topics (threads), posts/data, private messages and watches.
We decided the private messages looked like more effort to migrate than it was worth and
decided to migrate everything else.
Migrating forums
It looked faster to manually create the forums then to write a migration script.
We used the JForum UI to create the initial forum list. Then I did a database export
to get a CSV (comma separated value) file containing that data. I manually edited it so
the forum IDs and category IDs matched UBB. Then I imported the table back to the database.
This made migrating the threads/posts easier since we could rely on the forum number
staying the same. The forum migration turne dout to be the easiest part of this process,
and was done fairly quickly.
Migrating users
Since the users existing were a prerequisite to migrating threads, they came next.
I created a spreadsheet listing both the UBB
and JForum data. The first sheet maps JForum fields to where we get the data from and
what they should be originally populated with. The other sheet listed the UBB rows
and what they meant. This analysis largely came from the last attempts at this project.
I then turned this analysis into a Java program that looped through all the UBB users
and created JForum ones. The program had some control logic so we could vary things like:
User range to migrate – useful for migrating in subsets
How often to log output to the screen – Smaller groups are useful at the beginning when
migrating less users and overkill when migrating more than 100,000.
Whether to add or update the database – We were running this script a number of times,
both to test it and for the phased migration. Since the logic is different for adds and updates,
this flag was a failsafe to make sure we weren't accidentally overwriting users.
How many failures to allow before aborting – A bunch of users migrated first were corrupt,
but we wanted to keep going through this. However, if something went wrong (like using the wrong
add/update flag) all would fail and you certainly don't want more than 100,000 failures.
Whether we are in debug mode – At the beginning, it was useful to see many details to see
where things were going wrong. I added a failsafe that you could only migrate one user
in debug mode so we couldn't accidentally kick off the script with thousands of users.
Migrating threads
As long as it was, the approach for migrating users worked so I did the same thing for threads.
The spreadsheet was more complicated because the
data in each UBB file went into three tables. It didn't help that the UBB files were in three
different formats over time and we had to be able to migrate all of them. The last tab of the
spreadsheet maps the UBB and JForum tags to HTML. UBB "helpfully" stored the data as raw HTML
so I had to write the regular expressions to convert them back to JForum tags.
Luckily I like regular expressions.
Testing it
Naturally, I tested on my local machine as I went to make sure things seemed decent.
I copied a couple data files to my local machine. Once those appeared to work and I thought
I was "done", I tried running the program on our test server against the full mass of data.
Through this, I found and fixed more problems. And during our internal testing we found still more.
The last problem detected was that accent marks weren't making it through in user profiles.
Since it was so close to production, we decided to ask people to change this manually.
I have to say that my favorite issue was the ONE thread in all of UBB that had Chinese characters
in the subject. Normally, the moderators would ask the poster to change it. However, this thread
was in the Moderators Only forum. Since there was only one instance of this, I went looking for
a moderator with a Chinese font installed to move that text to the body of the post.
How long did it take
We ran in four phases on the production server:
Birth of JavaRanch (October 31st) – This got the bulk of data onto the new server.
This took a couple days to run. Luckily it could be run unattended since it wrote errors
to a file, so I left it on overnight.
Moderators Only forum data from November 1st (November 29th) – Any data created or edited between
the mass migration and Moderator Only cutover. This took a few minutes since it was just one forum.
All forums from November 1st (December 20th) – We ran an extra incremental migration
Christmas weekend so migration day would go faster. It turned out that this step only took fifteen
minutes so we didn't save much time.
All forums from December 20th (January 3rd) – Final migration. This step took only a
few minutes to run and a good while to troubleshoot. More on that in the next section.
Any production problems?
We ran into one problem during the production run on January 3rd.
The Word Association
thread has more than 5000 posts in it and was too large for my migration script to handle.
We ran it a couple times against just that thread and kept getting "connection reset" errors
when trying to look up the member id (old posts stored a username.) After trying a couple things,
the "solution" was a simple hack:
static Map CACHE_USER = new HashMap();
CACHE_USER.put(userName, result);
Many of the old posts were made by the same set of people. Looking up their data once saved
enough queries to make the thread migration a success. While this is a quick and dirty cache,
it's all we needed here! If you read Ernest's "status e-mail" from the deployment, this was the
rooster we were trying to save rather than have to put down. It was a relief that we could spare
the thread and not lose a piece of ranch history.
Ernest Friedman-Hill understands Java, JForum... and Elvish!
Ernest Friedman-Hill understands Java, JForum... and Elvish!
by a nameless elf
Time got tight for this edition of the journal, so we outsourced some interviewing work to a, well,
a geek elf. Wouldn't you know it, Ernest Friedman-Hill
understands Java... and Elvish! The elf asked Ernest about JavaRanch's move to JForum...
Elf: Aaye. Cormamin lindua ele lle.
Ernest: A star shall shine upon the hour of our meeting!
Elf: Yallume! Amin nowe UBB n'kelaya.
Ernest: By the sea and stars, I was as surprised as anybody to see UBB
go. Although this was a team effort, some distinguished themselves in
battle more than others. I am eternally grateful to those who did, and
they who know who they are.
Elf: Mankoi lle uma tanya? Sut?
Ernest:
In Boulder where the server slept
in air-conditioned peace,
the tortured souls around me wept
At the UBBeast's increase.
In fetid puffs of Perl she'd purr,
In Perl we oft replied,
Enslaved, controlled, betrothed we were,
to this infernal bride.
When e'er we'd boast of slaying her,
she'd chortle through her snout,
"The posts! The members!" she'd intone,
"You'll never get them out!"
"What's more," the wretched slattern swore,
"my feature set's unique!"
"You've slaved for years to give me more
than any man could seek!"
And she was right; no forum e'er
conceived could be her match.
So trapped we were, and never fair
relief we deemed to catch.
Champions are rather made than born,
from life in trying times,
like Eärendil and Aragorn,
immortalized in rhymes.
So heroes fair, with shining hair,
arrived in Louisville,
to wrest our data from her lair
and she herself to kill.
In Boulder where the servers hum
with quiet blinking lights,
a silent vigil we now keep,
through peaceful Java nights.
Elf: Manke tanya tuula?
Ernest: We tried several times, but it was the determination of a small team
of volunteers that finally saw this through. The "UBBeast" still
lives, by the way -- although she's very tiny in her present form.
Her duties are limited to forwarding HTTP requests to their new URLs.
Elf: An lema? Mani nae lle umien? Mankoi lle uma tanya? (Lle ume quel.)
Ernest: When I'm not smiting wyrms at the Ranch, I am a computer scientist at a
U.S. National Laboratory. I work on a range of projects, although most
of my energy these days is focused on "problem set-up environments":
desktop software that prepares simulations to run on the world's
largest and fastest supercomputers. We use Java, and lately, the
Eclipse platform as an application framework.
Elf: Lle naa curucuar. Quel esta.
Ernest: Funny you should say that. Last summer I had a chance to take a lesson
on shooting both compound and recurve bows, and I really enjoyed it. I
was astonished at how well I could shoot after just one lesson.
Sweet water and light laughter till next we meet.
For a window into JavaRanch moderator internal communications,
Ernest wrote this during our January 3rd migration and has agreed to share it here.
Howdy folks,
Quick note to let y'all know how we're doin' with the migration.
The truck's parked out front and the engine's runnin'.
Things are going purty well. A couple cows got out, and we had to rope 'em.
They're all accounted for and waiting in the truck.
All the chickens are rounded up, except for one ornery rooster that
refuses to come down off the barn roof. We're either gonna rope 'im,
or shoot him down if we have to; one way or the other, he'll be dealt
with within the next few minutes.
After that we just have to walk around and pick up the lassos,
and the rain barrels, and the pitchforks, and suchlike. That won't take long.
But it'll be well past 10AM Ranch Time when we've got everything
unpacked at the new spread.
Erik Hatcher, Otis Gospodnetić, and Michael McCandless
This article is taken from Chapter 5 of Lucene in Action, 2nd Edition, by Erik Hatcher,
Otis Gospodnetić, and Michael McCandless. This article addresses using filters in Lucene.
Some projects need more than the basic searching mechanisms. Filters constrain document search space,
regardless of the query. For the book's table of contents, the Author Forum, and other resources,
go to http://manning.com/hatcher3/.
Filtering is a mechanism of narrowing the search space, allowing only a subset of the documents to be considered
as possible hits. They can be used to implement search-within-search features to successively search within a
previous set of results or to constrain the document search space for security or external data reasons. A security
filter is a powerful example, allowing users to only see search results of documents they own even if their query
technically matches other documents that are off limits; we provide an example of a security filter in the section
"Security filters".
You can filter any Lucene search, using the overloaded search methods that accept a Filter parameter.
There are several built-in Filter implementations:
PrefixFilter matches only documents containing terms in a specific field with a specific prefix.
RangeFilter matches only documents containing terms within a specified range of terms.
QueryWrapperFilter uses the results of query as the searchable document space for a new query.
CachingWrapperFilter is a decorator over another filter, caching its results to increase performance
when used again.
Before you get concerned about mentions of caching results, rest assured that it's done with a tiny data structure
(a BitSet) where each bit position represents a document.
Consider, also, the alternative to using a filter: aggregating required clauses in a BooleanQuery. Now let's
discuss each of the built-in filters as well as the BooleanQuery alternative.
Using RangeFilter
RangeFilter filters on a range of terms in a specific field. This is actually very useful, depending on the original type
of the field. If the field is a date field, then you get a date range filter. If it's an integer field, you can filter by
numeric range. If the field is simply textual, for example last names, then you can filter for all names within a
certain alphabetic range such as M to Q.
Let's start with date filtering. Having a date field, you filter as shown in testDateFilter() in Listing 1. Our
book data indexes the last modified date of each book data file as a modified field, indexed as with
Field.Index.NOT_ANALYZED and Field.Store.YES. We test the date RangeFilter by using an all-inclusive query,
which by itself returns all documents.
Listing 1: Using RangeFilter to filter by date range
public class FilterTest extends LiaTestCase {
private Query allBooks;
private IndexSearcher searcher;
private int numAllBooks;
protected void setUp() throws Exception { // #1
super.setUp();
allBooks = new ConstantScoreRangeQuery(
"pubmonth",
"190001",
"200512",
true, true);
searcher = new IndexSearcher(bookDirectory);
TopDocs hits = searcher.search(allBooks, 20);
numAllBooks = hits.totalHits;
}
public void testDateFilter() throws Exception {
String jan1 = parseDate("2004-01-01");
String jan31 = parseDate("2004-01-31");
String dec31 = parseDate("2004-12-31");
RangeFilter filter = new RangeFilter("modified", jan1, dec31, true, true);
TopDocs hits = searcher.search(allBooks, filter, 20);
assertEquals("all modified in 2004", numAllBooks, hits.scoreDocs.length);
filter = new RangeFilter("modified", jan1, jan31, true, true);
hits = searcher.search(allBooks, filter, 20);
assertEquals("none modified in January", 0, hits.scoreDocs.length);
}
#1 setUp() establishes a baseline count of all the books in our index,
allowing for comparisons when we use an all inclusive date filter.
The first parameter to both of the RangeFilter constructors is the name of a date field in the index. In our
sample data this field name is modified; this field is the last modified date of the source data file. The two final
boolean arguments to the constructor for RangeFilter, includeLower and includeUpper, determine whether the
lower and upper terms should be included or excluded from the filter.
Open-ended range filtering
RangeFilter also supports open-ended ranges. To filter on ranges with one end of the range specified
and the other end open, just pass null for whichever end should be open:
filter = new RangeFilter("modified", null, jan31, false, true);
filter = new RangeFilter("modified", jan1, null, true, false);
RangeFilter provides two static convenience methods to achieve the same thing:
More generically useful than RangeFilter is QueryFilter. QueryFilter uses the hits of one query to constrain
available documents from a subsequent search. The result is a DocIdSet representing which documents were
matched from the filtering query. Using a QueryWrapperFilter, we restrict the documents searched to a specific
category:
public void testQueryWrapperFilter() throws Exception {
TermQuery categoryQuery = new TermQuery(new Term("category", "/philosophy/eastern"));
Filter categoryFilter = new QueryWrapperFilter(categoryQuery);
TopDocs hits = searcher.search(allBooks, categoryFilter, 20);
assertEquals("only tao te ching", 1, hits.scoreDocs.length);
}
Here we're searching for all the books (see setUp() in Listing 1) but constraining the search using a filter for a
category which contains a single book. We explain the last assertion of
testQueryFilter() shortly.
QueryWrapperFilter can even replace RangeFilter usage, although it requires a few more lines of code,
isn't nearly as elegant looking and likely has worse performance. The following code demonstrates date filtering
using a QueryWrapperFilter on a RangeQuery:
public void testQueryWrapperFilterWithRangeQuery() throws Exception {
String jan1 = parseDate("2004-01-01");
String dec31 = parseDate("2004-12-31");
Query rangeQuery = new RangeQuery("modified", jan1, dec31, true, true);
Filter filter = new QueryWrapperFilter(rangeQuery);
TopDocs hits = searcher.search(allBooks, filter, 20);
assertEquals("all of 'em", numAllBooks, hits.scoreDocs.length);
}
Security filters
Another example of document filtering constrains documents with security in mind. Our example assumes
documents are associated with an owner, which is known at indexing time. We index two documents; both have
the term info in their keywords field, but each document has a different owner:
public class SecurityFilterTest extends TestCase {
private RAMDirectory directory;
protected void setUp() throws Exception {
directory = new RAMDirectory();
IndexWriter writer = new IndexWriter(directory,
new WhitespaceAnalyzer(),
IndexWriter.MaxFieldLength.LIMITED);
// Elwood
Document document = new Document();
document.add(new Field("owner", "elwood",
Field.Store.YES, Field.Index.NOT_ANALYZED));
document.add(new Field("keywords", "elwood's sensitive info",
Field.Store.YES, Field.Index.ANALYZED));
writer.addDocument(document);
// Jake
document = new Document();
document.add(new Field("owner", "jake",
Field.Store.YES, Field.Index.NOT_ANALYZED));
document.add(new Field("keywords", "jake's sensitive info",
Field.Store.YES, Field.Index.ANALYZED));
writer.addDocument(document);
writer.close();
}
}
Using a TermQuery for info in the keywords field results in both documents found, naturally. Suppose, though,
that Jake is using the search feature in our application, and only documents he owns should be searchable by him.
Quite elegantly, we can easily use a QueryWrapperFilter to constrain the search space to only documents he is
the owner of, as shown in Listing 2.
Listing 2: Securing the search space with a filter
public void testSecurityFilter() throws Exception {
TermQuery query = new TermQuery(new Term("keywords", "info")); //#1
IndexSearcher searcher = new IndexSearcher(directory);
TopDocs hits = searcher.search(query, 10); //#2
assertEquals("Both documents match", 2, hits.totalHits); //#2
Filter jakeFilter = new QueryWrapperFilter( //#3
new TermQuery(new Term("owner", "jake"))); //#3
hits = searcher.search(query, jakeFilter, 10);
assertEquals(1, hits.totalHits); //#4
assertEquals("elwood is safe", //#4
"jake's sensitive info", //#4
searcher.doc(hits.scoreDocs[0].doc).get("keywords")); //#4
}
#1 This is a general TermQuery for info.
#2 All documents containing info are returned.
#3 Here, the filter constrains document searches to only owned by "jake".
#4 Only Jake's document is returned, using the same info TermQuery.
If your security requirements are this straightforward, where documents can be associated with users or roles
during indexing, using a QueryWrapperFilter will work nicely. However, this scenario is oversimplified for most
needs; the ways that documents are associated with roles may be quite a bit more dynamic.
QueryWrapperFilter is useful only when the filtering constraints are present as field information within the index
itself.
A QueryWrapperFilter alternative
You can constrain a query to a subset of documents another way, by combining the constraining query to the
original query as a required clause of a BooleanQuery. There are a couple of important differences, despite the
fact that the same documents are returned from both. If you use CachingWrapperFilter around your
QueryWrapperFilter, you can cache the set of documents allowed, probably speeding up successive searches
using the same filter. In addition, normalized document scores are unlikely to be the same. The score difference
makes sense when you're looking at the scoring formula. The IDF factor may be dramatically different. When
you're using BooleanQuery aggregation, all documents containing the terms are factored into the equation,
whereas a filter reduces the documents under consideration and impacts the inverse document frequency factor.
This test case demonstrates how to "filter" using BooleanQuery aggregation and illustrates the scoring
difference compared to testQueryFilter:
public void testFilterAlternative() throws Exception {
TermQuery categoryQuery = new TermQuery(new Term("category", "/philosophy/eastern"));
BooleanQuery constrainedQuery = new BooleanQuery();constrainedQuery.add(allBooks, BooleanClause.Occur.MUST);constrainedQuery.add(categoryQuery, BooleanClause.Occur.MUST);
TopDocs hits = searcher.search(constrainedQuery, 20);
assertEquals("only tao te ching", 1, hits.scoreDocs.length);
}
The technique of aggregating a query in this manner works well with QueryParser parsed queries,
allowing users to enter free-form queries yet restricting the set of documents searched by an
API-controlled query.
PrefixFilter
PrefixFilter matches documents containing Terms starting with a specified prefix. We can use this to search for all
books published any year in the 1900s:
The biggest benefit from filters comes when they are cached and reused, using CachingWrapperFilter, which
takes care of caching automatically (internally using a
WeakHashMap, so that dereferenced entries get garbage
collected). You can cache any Filter using CachingWrappingFilter. Filters cache by using the IndexReader as the
key, which means searching should also be done with the same instance of IndexReader to benefit from the cache.
If you aren't constructing
IndexReader yourself, but rather are creating an IndexSearcher from a directory, you
must use the same instance of IndexSearcher to benefit from the caching. When index changes need to be
reflected in searches, discard IndexSearcher and IndexReader and reinstantiate.
To demonstrate its usage, we return to the date-range filtering example. We want to use
RangeFilter, but we'd like to benefit from caching to improve performance:
public void testCachingWrapper() throws Exception {
String jan1 = parseDate("2004-01-01");
String dec31 = parseDate("2004-12-31");
RangeFilter dateFilter = new RangeFilter("modified", jan1, dec31, true, true);
CachingWrapperFilter cachingFilter = new CachingWrapperFilter(dateFilter);
TopDocs hits = searcher.search(allBooks, cachingFilter, 20);
assertEquals("all of 'em", numAllBooks, hits.totalHits);
}
Successive uses of the same CachingWrapperFilter instance with the same IndexSearcher instance will
bypass using the wrapped filter, instead using the cached results.
Beyond the built-in filters
Lucene isn't restricted to using the built-in filters.
An additional filter found in the Lucene Sandbox, ChainedFilter, allows for complex chaining of filters.
Writing custom filters allows external data to factor into search constraints; however, a bit of detailed Lucene
API know-how may be required to be highly efficient.
And if these filtering options aren't enough, Lucene adds another interesting use of a filter. The FilteredQuery
filters a query, like IndexSearcher's search(Query, Filter, int) can, except it is itself a query: Thus it can
be used as a single clause within a BooleanQuery. Using FilteredQuery seems to make sense only when using
custom filters, but that is for another day.