February 4th, 2012

"I beard the First Lady say, 'Oh my Cod.'"

Try googling that. Go ahead.

The top result is a peek into Google Books edition of HarperCollins reprint of Gail Collins _excellent_ book about political gossip, _Scorpion Tongues_ (review forthcoming at some point. Probably). I love Gail Collins. You should read her columns and her books. This book is actually a great book that is suffering from a very, very, very bad edition.

HarperCollins, however, really ought to be ashamed of itself for perpetrating this low-quality an ebook. That subhead _ought_ to read, "I heard the First Lady say, 'Oh my God.'"

And it's one of a very, very long list of errors which give every evidence of having been introduced by OCRing a p-book, _failing_ to do even minimal QA on it (like spell checking) and sending it out into the world in multiple e-book formats. While spell check wouldn't have found this particular error, it would have found these errors: "discreedy" for "discreetly", "romande" for "romance", "coundess" for "countless". And the process of doing spell check should have made clear the need to improve the rest of their process.

If anyone out there knows what is going on at HarperCollins, specifically, who is doing their conversion and why is this stuff sliding through the process, I'd love to hear the details. I'm currently in search of anyone else who has researched this issue. It makes a mockery of the self-published-books-are-poor-quality argument.

Excavating the backlist conversion process


The article seems like reasonable coverage, however, I'm going to point to some comments.

Doug says, "Backlist conversion are a pain to do correctly, because it starts with an OCR scan. It takes some careful proofing to clean up the scan errors, and even then you’ll end up with some errors left in." Which, I might add, I'm prepared to forgive -- but not hosts of spellchecker catchable errors in addition to leftovers from the scan + proof process.

That supports my hypothesis that they are in fact doing this via OCR off a p-copy [ETA: almost certainly an error on my part. I think they are doing OCR off a printer-reader e-copy] NOT by converting an underlying electronic file. It doesn't completely answer the question of how something escaped the proofing process with as many problems as Collins' book.

A debate over Doug's further assertion about the merit of bothering (he doesn't think it is, as a general rule) ensues which is interesting, but not directly relevant to the OCR/process question. It is, however, relevant to the publishers probable attitude towards customers of their backlist (Not Positive, shall we say?).

ETA: I suppose the business model might be, wait until someone complains, then fix what they complained about, on the theory that if no one is complaining, why waste the time/energy/resources/money on making the rest of it right? I wonder what I think of that. I'm a big fan of worse/cheap is better and 80/20 stuff, but this just feels like contempt for the customer and I _don't_ approve of that.



Detailed discussion of what's involved in turning a p-book into an e-book. The author presumes a level of competence and quality control which makes it reasonable to pick on public domain, scanned-and-proofed-by-volunteers books like _Bleak House_ for substituting "goat" for "gout", and concluding that portion of the diatribe with, "There is no substitute for a good proofreader checking the text against the original." Wow. If _only_ the world were that good. I'm just looking to have the obvious and distracting errors beaten back by, say, an order of magnitude. This author assumes publishers are hiring this work out, and about half the article is devoted to explaining how you can sort available services, why you shouldn't assume you can do it yourself with some tools you found online and similar.

Edited still more: a quick glance around elance suggests there are people around the world eager to bid on small conversion jobs.

And more: One -- but only one -- of the problems is that the underlying digital format for many p-books is a printer-ready digital file, frequently a pdf. This pdf has ligatures in it (makes sense for producing a p-book; e-books not so much unless they understand the representation of the ligature -- which they won't). Amazing hoops must be jumped through to back-convert the ligatures to "ordinary" text. Sometimes. Seems to depend on the ligature in question and the target format.

The world will always need compiler engineers who don't feel much like actually working on compilers.

Edited still more: http://publishingperspectives.com/2011/08/error-free-ebooks/

Nice set of quotes from relevant parties. I'm currently trying to figure out if anyone in the process is doing consistently better or worse than anyone else, or if this is Just One of Those Thing that is waiting for better tools/automation.

"The worst affected seemed to be even older books, such as Ayn Rand’s Atlas Shrugged, where a reader complained that the number “1” and the letter “I” were often switched."

I have not mentioned this, but in _Scorpion Tongues_, Al Gore appears at least three times as A1 Gore. It's a stunningly comprehensive collection of errors (which is what it would take to get me to complain, because I'm such a booster of e-reading in general, I'll downplay all negative aspects to ensure the transition sticks).

The author compares error free books to cars which drive themselves. I find this sort of ironic in the era of the Google Car, and stand by my assertion that what's really needed to deal with this problem is better tools. If tools are only getting you 90% of the way through the process (and so many of the problems being complained about in online fora involve trying to pipeline tools that don't integrate with each other to deal with some fiddly problem that Should Just Work), then yeah, the output is going to be dog shit. And not healthy, firm, easily picked-up-with-a-plastic-bag dog shit.

Cool Feature!

Every detail page on Amazon for a kindle edition has a Feedback box with the option to report bad formatting. Sweet! It's been there for months if not years and I just never noticed. I'll remember it now, though.

I discovered it via this:


Which is worth reading through -- the comments are good, too. I've been really lucky. This is the first book I've read on the kindle (and I have hundreds) that had anything like this level of problem. I haven't mentioned it, but _Scorpion Tongues_ also suffers from the italics-not-turning-off and bold-not-turning-off problems described by commenters as occurring in other books.

I suppose I shouldn't be surprised. I remember mass market paperback runs where an entire signature was missing/duplicated. Bad enough in a short story collection, but horrifying in a novel. And there were plenty more books that weren't that obviously unreadable, but had missing and duplicated lines of text. I know this isn't new. It's just a real shock to run into it in something that the publisher thinks I should be paying $10.99 for.

Self-Driving Cars

I just blogged about someone who asserted an equivalency between error free books and self-driving cars. I feel like we could probably improve the quality of some ebooks out there now (either by throwing more manual labor at it, a la websites in the mid 1990s, or improved tools, a la websites in the 2000s), but error free is unlikely unless you're using some technical definition of error free (no one can find the remaining errors, unless they are (a) trained professionals or equivalently skilled and motivated amateurs and (b) consuming the book in an effort to find the errors, not to actually read the book. That's a hypothetical technical definition of error free which I might accept as attainable.).

I also feel like we already _have_ a prototype self-driving car (that google thing) and also? Check out this coverage:


Not as snotty as the headline. Here, a bit of a rah rah piece:


And at the NYT, of course, they get into the legal issues:


The third article is referred to by the second article. These are arbitrarily selected examples of very recent coverage: every gadget blog in existence appears to have covered this in the wake of CES and many of them are pointed at by the second article mentioned above.

I'm working on a theory about "legal issues". I think "legal issues" exist for two reasons, the first of which is obvious and offensive to mention (no, not for the lawyers' benefit; that's just a nice side benefit of preserving that portion of the status quo which has not completely petrified in place). The second is to create an uncertain environment in which eager, aggressive entrepreneurs and early adopters can play around and maybe come up with something compelling for the rest of us. It needs to be uncertain, so that if what they come up with is dangerous, scary or too disruptive, we can shut it down and require a do-over. But it needs to exist in some form, or we would all just petrify in place (and the disruption that eventually ensued would be really bad).

Once the entrepreneurs/early adopters have demonstrated both that the new thing could work, is desirable, and won't be horribly disruptive, legal issues are resolved somehow and then we're all stuck with quality issues to make it palatable to the mass market. Which is where we are with e-books.

This is my theory, anyway, as to the public policy purpose of "legal issues" as a standard delay in innovation.

Support from the NYT article for this theory:

“Why would you even put money into developing it?” he asked. “I see this as a huge barrier to this technology unless there are some policy ways around it” — though he noted that there were precedents for Congress adopting such policies."

"Legal issues" force people to answer questions like, "Hey yeah it'd be cool, but how are you gonna make money/get people to let you use it IRL/etc."

I do think some of the issues raised in the article are just _begging_ for mockery. If we were all really that worried about tons of metal hurtling towards each other, there are all kinds of things we could do to address that problem (mandatory interlocks on vehicles to ensure the operator at least had access to someone with no or no-alcohol breath and a valid license, speed governors on all cars, lower speed limits, lighter cars, separation of trucks from passenger vehicles, jersey barriers or medians separating high speed traffic everywhere, more frequent (or any) vehicle checks to insure they are in good working order, etc.) that we don't do because we've decided the current tradeoff is better than those changes.

Super cool that Brad Templeton got a funny into the NYT! The world has become nerdly wonderful.

ETA: A reasonably obvious next step:


Self-driving, electric, oriented towards delivering packages.

Lazy Saturday

R. went up to Mayberry to pick up the generator, which finally came in. Once he got home, I went out to help him remove it from the van (it's under 300 pounds, thus doable between the two of us). Once on the ground, it has wheels and he somehow managed to get it where he wanted it to be. I don't ask questions. I do, however, appreciate the results.

T. and I went grocery shopping. I should know better than to go grocery shopping on Saturday, but worse, this is Superbowl weekend. Duh. Serious brain fart going to the store today. Oh well.

Where's My Water (the Swampy game on the iPad and other platforms) has a lot of levels. Something over 150 plus some additional puzzle levels. T. seems to be experimenting with trying to play the whole thing through in one day (we have to help with a few levels where the timing is critical, but he can do virtually all of them). Battery is an issue.

R. took A. to the playground and they are now sleeping.

In the meantime, I've (obviously) been wasting time online. In between poking at ebook formatting issues and trying to figure out under what circumstances I would be willing to sit in a vehicle which did not require a human operator and/or purchase one, I went over to the Baen books store and picked up a free copy of Schmitz's Telzey Amberdon. I've done this before, but not bothered to keep copies on each kindle as I upgrade. Today's experience was pretty amazing: Baen is all set up to email is (not any big surprise there) with great directions how. And then Amazon sent me an email saying they're backing up the content for me this time. Of course, when I buy a book from Baen ebooks, they also remember that I bought it and have it available for re-download when I want that.

This is the way the world should work. It took a few years to get here (and clearly, we have a few years to go in terms of digitizing past content, but that's universally true), but this is a fairly reasonable environment for e-reading. I can shop somewhere else, and it Just Works with library management at my hardware vendor (Amazon) and primary content supplier (ditto). True, it is off in the Personal Documents section, but that I can live with.