If you've been reading the forums, you've seen a few references to how migrating our data from Universal Bulletin Board (UBB) (Perl) to JForum (Java) was the majority of the work. Ernest Friedman-Hill suggested an article on it. The idea to write the article is his; the words are mine. (As I side note, I handled the user and forum/thread migration. Ulf handled migrating the thread "watches." While technically the thread watches are data, I'm going to be focusing on the rest. A lot of the analysis was done by other moderators on previous attempts at this project - I don't remember who so i can't give credit where it is due.)
UBB stores its data in text files, while JForum uses a database. Needless to say this is different!
How big is it?
Some interesting stats:
|Item||KB in UBB||KB in JForum|
|Threads (and post data)||6,767,764||1,293,640|
|Item||# Lines of Code*||# Classes||# Commits to SVN|
|Threads (and post data)||1,040||4||21|
The first thing we did was analyze what data we needed to migrate. I created a text file listing all the JForum tables and what we needed to pre-fill them with if anything. The ones involving data migration were forums, topics (threads), posts/data, private messages and watches. We decided the private messages looked like more effort to migrate than it was worth and decided to migrate everything else.
It looked faster to manually create the forums then to write a migration script. We used the JForum UI to create the initial forum list. Then I did a database export to get a CSV (comma separated value) file containing that data. I manually edited it so the forum IDs and category IDs matched UBB. Then I imported the table back to the database. This made migrating the threads/posts easier since we could rely on the forum number staying the same. The forum migration turne dout to be the easiest part of this process, and was done fairly quickly.
Since the users existing were a prerequisite to migrating threads, they came next. I created a spreadsheet listing both the UBB and JForum data. The first sheet maps JForum fields to where we get the data from and what they should be originally populated with. The other sheet listed the UBB rows and what they meant. This analysis largely came from the last attempts at this project. I then turned this analysis into a Java program that looped through all the UBB users and created JForum ones. The program had some control logic so we could vary things like:
- User range to migrate – useful for migrating in subsets
- How often to log output to the screen – Smaller groups are useful at the beginning when migrating less users and overkill when migrating more than 100,000.
- Whether to add or update the database – We were running this script a number of times, both to test it and for the phased migration. Since the logic is different for adds and updates, this flag was a failsafe to make sure we weren't accidentally overwriting users.
- How many failures to allow before aborting – A bunch of users migrated first were corrupt, but we wanted to keep going through this. However, if something went wrong (like using the wrong add/update flag) all would fail and you certainly don't want more than 100,000 failures.
- Whether we are in debug mode – At the beginning, it was useful to see many details to see where things were going wrong. I added a failsafe that you could only migrate one user in debug mode so we couldn't accidentally kick off the script with thousands of users.
As long as it was, the approach for migrating users worked so I did the same thing for threads. The spreadsheet was more complicated because the data in each UBB file went into three tables. It didn't help that the UBB files were in three different formats over time and we had to be able to migrate all of them. The last tab of the spreadsheet maps the UBB and JForum tags to HTML. UBB "helpfully" stored the data as raw HTML so I had to write the regular expressions to convert them back to JForum tags. Luckily I like regular expressions.
Naturally, I tested on my local machine as I went to make sure things seemed decent. I copied a couple data files to my local machine. Once those appeared to work and I thought I was "done", I tried running the program on our test server against the full mass of data. Through this, I found and fixed more problems. And during our internal testing we found still more. The last problem detected was that accent marks weren't making it through in user profiles. Since it was so close to production, we decided to ask people to change this manually. I have to say that my favorite issue was the ONE thread in all of UBB that had Chinese characters in the subject. Normally, the moderators would ask the poster to change it. However, this thread was in the Moderators Only forum. Since there was only one instance of this, I went looking for a moderator with a Chinese font installed to move that text to the body of the post.
How long did it take
We ran in four phases on the production server:
- Birth of JavaRanch (October 31st) – This got the bulk of data onto the new server. This took a couple days to run. Luckily it could be run unattended since it wrote errors to a file, so I left it on overnight.
- Moderators Only forum data from November 1st (November 29th) – Any data created or edited between the mass migration and Moderator Only cutover. This took a few minutes since it was just one forum.
- All forums from November 1st (December 20th) – We ran an extra incremental migration Christmas weekend so migration day would go faster. It turned out that this step only took fifteen minutes so we didn't save much time.
- All forums from December 20th (January 3rd) – Final migration. This step took only a few minutes to run and a good while to troubleshoot. More on that in the next section.
Any production problems?
We ran into one problem during the production run on January 3rd. The Word Association thread has more than 5000 posts in it and was too large for my migration script to handle. We ran it a couple times against just that thread and kept getting "connection reset" errors when trying to look up the member id (old posts stored a username.) After trying a couple things, the "solution" was a simple hack:
static MapMany of the old posts were made by the same set of people. Looking up their data once saved enough queries to make the thread migration a success. While this is a quick and dirty cache, it's all we needed here! If you read Ernest's "status e-mail" from the deployment, this was the rooster we were trying to save rather than have to put down. It was a relief that we could spare the thread and not lose a piece of ranch history.
CACHE_USER = new HashMap (); CACHE_USER.put(userName, result);