Data Abuse Hurts Everyone

Yesterday was spent moving an old client's Lotus Notes based intranet (I use that term loosely) over to Drupal.  Actually the entire last 2 weeks have been focused on this to some extent, but yesterday was spent primarily bringing across a significant part of the old data.

Old data is always a mine field just waiting to blow up in your face, but it can also be a time for developers to share new and old ideas, with each other, and build solutions that, while they may never use them again, add to their various toolboxes in the way of knowledge.  This project was no different, and so, we set about pulling the data across.

The Challenge:

Lotus Notes (or at least this version of it) exports data in what it calls a Structured Data format.  This essentially means that it is:

  1. Flat Text
  2. Completely Unusable...

I may be overstating number two a bit (obviously since we managed to work with it) but by the standards that we might be used to today... it was unfriendly at best.  This "structured data" looks about like the following:

FirstName: Kris
MiddleInitial: L
LastName: Vanderwater
FullName: Kris L Vanderwater
Etc:etc
Etc:etc

You get the idea...  Now to make matters worse, while everything was at least labeled in some form or fashion, it was not all in the same order from one record to the next... further more there were no formal starting or ending marks for records, and we had roughly 130,000 lines of this through which to parse...  Luckily with a bit of looking we learned that all records did at least end in their own "comment:" field, which made finding the end of records much easier, and we could sort out the rest later.  Now my father is a wiz with vim (vi) and I don't claim to be that savvy there yet, but with a little bit of vim's find and replace pattern-matching-magic, he managed to take all the fields in the document, and turn them into (almost) formal xml tags:

Kris
L
etc

With just a little extra work we hunted down all the individual elements that kept this document from validating as real xml, and we then had a nice semi-structure document that defined all of our data.  While we were at it, we used the comment field to add in formal tags around each record, and then manually added a tag for the sake of a single containing tag (more of that valid xml stuff).

All these things led us to one of my personal favorite technologies... xslt.  If you're not yet familiar with xslt, stop what you're doing and take a few hours to go learn the basics.  Chances are you'll take a few days instead because it'll be too much fun to put down... or maybe I'm just twisted like that.  In any case, our ultimate goal is to have all of this content within a Drupal system which we can then leverage all the wonderful tools we know and love so much against said data.  However, since our end point is Drupal 6, we have a couple of extra steps to take.  Namely we have to create a Drupal 5 install so that we can make use of the node import module.

If you haven't used node import, it's worth looking at.  It maps csv documents (or tsv) to cck fields, and then imports the contents of the csv file into node after node after node, allowing you to quickly re-purpose old content for new uses.  Unfortunately... we aren't there yet.  As I said before our document's fields are not in any proper order.  They're all similar, but sometimes they have fields switched, and furthermore, there's an awful lot of content for which we have no use.  (largely meta data that was Notes specific)  And so we get down and dirty with the xml/xslt.

Xslt is a technology that effectively lets you map xml tags to... well any other sort of content you might desire.  Now, there are a number of different ways to make use of this technology.  Most modern browsers recognize xslt style sheets (and have for ages) and will do client side transformations of the xml content without the need for a server, or any php code with large overhead to manually transform every element within our 130,000 line document.  Now the browser, obviously, still has to parse this, but as I said, down and dirty, and the browser is pretty snappy about it, so we move forward.

First we take a look at all the elements we'll no longer be needing... and we transform them to nothing, in the same step we use our transformation to re-organize the content we DO want into another xml document.  This document is very similar to the document we currently have in our possession except there are no longer any unwanted tags, and all the desired tags are in the same order from one record to the next.  This was the first phase of our transformation process yielding our second xml document, through which we transform a second time... outputting our final cvs document.

With our cvs document in hand we can quickly import our content to Drupal 5, and the relatively easily upgrade that system to Drupal 6.  Ultimately it was an interesting ride, and I thought others in the community might enjoy a description of it.  If you've had some nasty data abuse to deal with lately, we'd love to hear about it and trade solutions.  As I said before, these sorts of experiences are almost always a single use case, but it's nice to see how other people do it.

As a side note, we're in the process of putting that data together in a couple of different useful manners.  One manner that became plausible today was importing a copy of  the data into each individual's personal contact lists for their email clients until we finish the final product, which again, our transformation to a very nice and clean xml document will make very easy for us to output to whatever format might be needed.