Once upon a time, in the early 1990s, when I was working for DEC, I volunteered to compare, word by word, a scanned in _Can Such Things Be_ (book and text file supplied to me) and correct the text file to match the book (OCR not being perfect back in the day). This was fun. It was part of the Wiretap project, and for a while, was included in Project Gutenberg's works, but doesn't seem to be anymore (they have a newer version).

Should you be curious to view the results, they can be found here:


It can also be found elsewhere.

(I had a different last name back then, when I was married to my first husband.)

As a result of this experience, I learned a variety of things, but for the purposes of this post, I learned that scanned in books had a lot of things that needed to be fixed.

Some time before that, while in college, I did a little work on a robotics lab (yeah, really), partly on text-to-speech (among other things, I added a module so if the robot ran across a roman number, it would read it correctly, instead of saying something like L-V-I-I-I. Or worse.). In the course of doing that, I learned a little about the current state-of-the-art of speech-to-text and boy, howdy was I unimpressed.

Time goes by. You can now say, "Two" to an evil phone menu and sometimes it will recognize it (and if your toddler yells in the background, your selection will be misunderstood and hopefully generate a, sorry, didn't quite catch that, and hopefully won't be misunderstood as a request for spanish). I do get that OCR has improved, and that you can do all kinds of massaging to try to "understand" what's on the page via spell-checking and grammar-checking and all that stuff. There is, unfortunately, a big problem with applying spell-checkers and grammar-checkers to out-of-print books, particularly old ones. A little quote from Coleridge should make this clear:

"Water, water, every where
Nor any drop to drink."

Any reasonable grammar checker and possibly some spell checker is going to clean that up in a way that's going to do damage to the text.

How, then, is it possible to do a search of the text of google books? Well, my guess is that this is just Beautiful Magic. They've got a Pretty Darn Good idea of what the text says, and that's what they use for pattern matching off your search. But they don't ever _show_ you their idea of what the text says. They just map that back to the scan, and show you the entire scanned page. Hopefully, any error they made will be really hard to detect as a result. Beautiful Magic.

I assume (and this may be a large assumption) that the version of Google Books available on the iPhone, Android and others is essentially the same as what I see on my laptop: an image, more or less, the scanned in-image of the text. This cannot be done on the kindle for a couple of reasons. First, I don't think even the new display is up to the task (altho I could be wrong about that). Second, the cost of sending that crap over sprint's EVDO network would rapidly cause problems for the Sprint/Amazon partnership. There's no charge to the kindle user to download Amazon's DRMed Mobi/PRC/wtf files because that shit is tiny. You start schlepping images around and everyone sits down to have a little chat about Cost. A chat that may or may not need to happen with all-you-can-eat plans on iPhones. I have no idea how those work. I do know that my EVDO plan for my card when I had it, and my Centro now has some Very Interesting small print.

Obviously, it's easy enough now to take any Mobi/PRC/wtf book file (so, basically, any of the Project Gutenberg-like stuff) and stick it on your kindle and read it (just use the freaking USB cable). Plenty of people already had. But the Google plan to scan entire academic libraries sort of brings the game to a new level. I would love to hear any ideas about how google books could become something that might work on the kindle. Handwaving around the scanned image vs. text problem is amusing, but not tremendously helpful. I might by a detailed technical explanation of how to massage OCR to get the text good enough -- but it's going to have to include innovation from the last decade-ish (because the technology did not exist around the time I retired). And it has to explain why google would store and display the scanned image instead of showing you the textual interpretation. If it's good enough for the kindle, it should be good enough for google to display.
