How do you raise a software bug with a book publisher?
Recently, I bought an eBook which has a bug. I'd like to explain what the bug is, why it is a problem, and how I'm trying to get it corrected.
Amazon sells eBooks in KF8 format. That is an ePub with some proprietary extras. ePub is a standard based off HTML5. You can read the ePub 3 specification but, basically, it is a .zip of HTML files. If you unzip an eBook, you can read the source code behind it.
When trying to read a Kindle book on a non-Kindle device, I noticed a bug. Some words were not displaying. I took a look at the underlying source code, and found this:
Sometimes words and letters were wrapped with a pagebreak
span like this:
HTML<span id="pg5" epub:type="pagebreak">‘But</span> of course!’
When I tried to read the book using KOReader the word "‘But" didn't appear. Why? Let's take a look at the ePub3 specification concerning page breaks:
pagebreak * A separator denoting the position before which a break occurs between two contiguous pages in a statically paginated version of the content. * HTML usage context: phrasing and flow content, where the value of the carrying elements title attribute takes precedence over element content for the purposes of representing the pagebreak value
Here's the problem - eBooks can have page numbers. Despite "Page Numbers in eBooks Considered Harmful" lots of publishers still use them. I guess it is kind of useful if you want to refer to something on a printed page - but eReaders allow you to change font size and line spacing, so the concept of a page is somewhat nebulous.
The way the spec is written, means that you can write something like:
HTML<span epub:type="pagebreak" id="page_123_a" title="123">123</span>
You use the id
for internal linking and the title
attribute for the value.
Because of this, most eReaders do not display the physical page number inside the span. It has no semantic content for the reader, and breaks flow. If they did display it, you might end up reading text like this:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean vel 9 risus at metus molestie tincidunt. Donec aliquet aliquam lorem, ...
So KOReader deliberately ignores any text which is wrapped with epub:type="pagebreak"
.
If you look at the example ePubs provided by the International Digital Publishing Forum, you'll see that the majority of their spans are self-closing.
HTML<span epub:type="pagebreak" id="p123-a" title="123"/>
Very occasionally, you see something which has just a page number in it:
<span epub:type="pagebreak" title="169" id="Page_169">169</span>
I checked with KOReader and they confirmed that they were following the spec. I agree with them. There's no reason to wrap readable content in a metadata span like that.
I've checked several other books from different publishers. None of them abuse pagebreaks
this way. I think Penguin Random House are doing it wrong and I would like to correct them.
Reporting it
I've previously reported buggy ebooks to vendors. But because the Kindle app doesn't exhibit this problem, I thought it was futile going via Amazon. So I thought I'd try going directly to the publisher.
Sadly, Penguin UK's GitHub repo is dead. Their dedicated digital publishing team haven't tweeted in 2 years.
Their contact page has a suggestion for what to do if there is an error in an eBook:
It is best if you return the book to the original bookshop from which it was purchased; they should be happy to exchange it for a perfect copy. If you have any difficulty with this then please return the cover and title page of the book to us.
Hmmm....
I dropped them an email, and got back this very reasonable reply:
Thank you for reaching out and bringing this to our attention. The distinction you’ve made is already a part of our specification; this was an oversight which we’re looking into as a result of your input—it bypassed checks both because it validates and also passes visual checks on all the major platforms we screen for. This isn’t reflective of our entire library and should be limited to specific titles which we’re currently investigating.
That's fair enough. The rendering quirk is specification compliant - but hard to spot because of the Kindle monoculture.
Change the spec, change the world
I've made a suggestion on GitHub that the spec should be clarified. I don't think it's particularly obvious that content in a pagebreak
may not be displayed.
Most resources agree that the content of a pagebreak
should either be blank, or be the page number.
If you include the page numbers as text content within a
span
ordiv
, the pages will be more easily accessible to both sighted users and users using assistive technologies. This method has been employed in previous DAISY standards. The potential downside, however, is that mainstream user agents will not provide equivalent functionality to turn off unwanted content, forcing users to hear and view the page numbers. Digital Accessible Information System (DAISY)
Whose fault is it anyway?
This is a tricky one. I think Penguin have undoubtedly made a mistake with the way they publish ePubs. But, so far, KOReader is the only rendering engine I've found which suppresses the content of pagebreak
s by default.
Generally speaking, a user wouldn't want to display page numbers on an eBook. Software could have a user defined toggle to switch them on or off. Luckily, KOReader has a variety of style-sheets for rendering eBooks - so I picked one which displayed pagebreak
content.
Software is hard.
What links here from around this blog?