May 28th, 2008

Musings on Data

No, not the character from TNG.

Once upon a time, when I worked for a living as virtuous people do, I spent a couple years dealing with huge piles of data describing books. I will refrain from mentioning where the data came from or where I worked at the time; most of my regular readers have a pretty good idea already. The data was not (originally/ever) intended to be viewed by an Ordinary Customer Who Wants to Buy a Book. It was intended for employees of retail (brick and mortar is the classic term of art) bookstore employees answering questions asked by an Ordinary Customer. The data was, not to put too fine a point on it, crap. It was riddled with spelling errors, date of release errors, page number errors, weight amount errors, dimension errors, binding description errors, I could go on but I'm already bored.

One of my early tasks was to make it possible for a room full of people to read e-mail complaining about theses errors and then correct these errors. The goal was to produce information suitable for viewing by Ordinary Customers Who Want to Buy a Book, so they could buy a book from us. It did not take long (by reasonable metrics; it felt like forever) to produce a database that was actually a lot more up to date, useful and, arguably, less error-ridden than other services that charged money to do similar (and which shall, similarly, remain nameless). Certainly, we were cheaper, since we showed people our information for free, figuring they'd buy books from us, and didn't charge, say, libraries or researchers or whatever to also use that data for purposes best known to themselves.

Despite passing this (actually kind of surprisingly) low bar of quality, during my entire tenure at that job and for many years thereafter (quite probably continuing till today), people continued to complain bitterly about inaccuracies in that data. Justifiably so. I mean, really, like it's that hard to spell Sarah Ban Breathnach's name, and to realize that the "last name" happens to consist of two words, rather than one. (Oh, don't get me started.)

While I was working on these steaming piles of ... data, I was also watching a steaming pile of ... money grow in the form of stock options which resulted in my dreams of house/condo lust blossoming into a lot of time spent on the then-very-new online version of NWMLS, where I attempted to figure out what I could afford and where I could afford it and whether that matched what I wanted and where I wanted it. After visiting numerous open houses without benefit of agent, I eventually acquired an agent who turned out to be useful in that she listened to my description of what I wanted and eventually found me the condo I bought prior to it being listed on any MLS. The online NWMLS of 1996 and 1997 was a frustrating experience worth continuing with that I _believe_ I was sympathetic about but probably spent a lot of time screaming at. Which is only fair. I screamed at my job and my data there as well.

Fast forward 10ish years. (OMG -- wait, I've already commented recently about how it's 2008, so everyone, presumably, knows.) I am once again hanging out on online MLS sites (Zillow, a NE Coldwell Banker site, etc.) and related real estate information sites (RealtyTrac). While the data has improved dramatically (the pictures are often really great and quite useful when I've compared them to a walkthrough of a place), it is still flawed. Interestingly, one of the biggest flaws lies in a piece of information that everyone expects you to know a lot about and builds quite a lot of derived data on: square footage.

Back in the day (and I only found this out recently), NWMLS only listed finished square footage. Supposedly. I knew then, and I know now to be skeptical of square footage claims, but during the boom, square footage inflation got tricky enough that NWMLS now lists three numbers: finished, unfinished and approximate total. You can see where having all these different numbers would tend to muck up a derived number like price/square foot. Our here in NE, R. claims that Massachusetts says your square footage can only include finished. But R., like everyone else on the planet, has a source confusion problem. While I'm _reasonably_ confident that that was once the rule, I question whether (a) everyone -- or anyone -- adhered to it much less (b) continues to do so. Lately, the real estate blogs have been all over it, and everyone seems to think this is (a) a problem and (b) fixable.

I question both.

When you sign on a house, you aren't buying square footage, any more than you are buying a "title" when you buy a book. In every important sense, you want the house, not the number; you want the contents, not the words on the cover. (Yes, I've thought of several exceptions, too.) This is a rule of thumb thing, and as a weary wrangler of crappy data, I am to some degree tempted to say, Enough! Use it as a first cut and don't get too wrapped up in the details. As long as you found Simple Abundance, you shouldn't care too much whether it was listed as Breathnach, Sarah Ban or Ban Breathnach, Sarah -- unless you _ARE_ Sarah, in which case I commiserate with you on our collective incompetence and will do what little I can to correct it and apologize profusely when the correction proves inadequately sticky and the problem returns.

Most of the major problems associated with square footage inflation are a result of people building complicated derivative theories and calculations on top of them so if we could all just knock it off, we'd be happier.

But we won't, will we?

In the meantime, I've been really amused to learn a couple of particular oddities like, "officially" square footage is calculated by measuring the external foundation and then multiplying by livable floors. Guess what happens if you have a cathedral ceiling or a two story library which has a balcony for a (very limited) second floor? And Black and Decker makes a sonic measuring device that'll supposedly tell you the "real" room area and volume with a couple of point and clicks.

And in case it wasn't obvious, the theme here is: Problems What Happen When You Take Crappy Data that Is Useful to Experts Helping Ordinary Folk and Show It to the Ordinary Folk