Giuda Ballerino @blog Thanks for your work! Reply | Reply to original comment on kolektiva.social 2025-10-19 13:49
A few days ago, someone called PixelMelt published a way for Amazon’s customers to download their purchased books without DRM. Well… sort of.
In their post “How I Reversed Amazon’s Kindle Web Obfuscation Because Their App Sucked” they describe the process of spoofing a web browser, downloading a bunch of JSON files, reconstructing the obfuscated SVGs used to draw individual letters, and running OCR on them to extract text.
But the harder problem was with the OCR. The code was designed to visually centre each extracted glyph. That gives a nice amount of whitespace around the character which makes it easier for OCR to run. The only problem is that some characters are ambiguous when centred:
When I ran the code, lots of full-stops became midpoints, commas became apostrophes, and various other characters went a bit wonky.
That made the output rather hard to read. This was compounded by the way line-breaks were treated. Modern eBooks are designed to be reflowable – no matter the size of your screen, lines should only break on a new paragraph. This had forced linebreaks at the end of every displayed line – rather than at the end of a paragraph.
I decided that OCRing an entire page would yield better results than single characters. I was (mostly) right. Here’s what a typical page looks like after de-obfuscation and reconstruction:
As you can see – the typesetting is good for the body text, but skew-whiff for the title. Bold and italics are preserved. There are no links or images.
As in the original code, I took the SVG path of the character and rendered it as a monochrome PNG. Rather than centring the glyph, I used the height and width provided in the glyphs.json file. That gave me a directory full of individual letters, numbers, punctuation marks, and ligatures. These were named by fontKey (bold, italic, normal, etc).
The page_data_0_4.json has a width and height of the page. I created a white PNG with the same dimensions. The individual characters could then be placed on that.
In the page_data_0_4.json each run of text has a fontKey – which allows the correct glyph to be selected. There’s also a fontSize parameter. Most text seems to be (the ludicrously precise) 19.800001. If a font had a different size, I temporarily scaled the glyph in proportion to 19.8.
Each glyph has an associated xPosition, along with a transform which gives X and Y offsets. That allows for indenting and other text layouts.
Once every character from that page had been extracted, resized, and placed – the page was saved as a monochrome PNG.
For a more useful HTML style layout, the hOCR output can be used: tesseract page_0022.png output -l eng hocr
Images aren’t downloaded. I took a brief look and, while there are links to them in the metadata, they’re downloaded as encrypted blobs. I’m not clever enough to do anything with them.
The OCR can’t pick out semantic meaning. Chapter headings and footnotes are rendered the same way as text.
This is very far from perfect. It can give you a visually similar layout to a book you have purchased from Amazon. But it won’t be reflowable.
Processing all the JSON files and OCRing all the images is relatively quick. But tweaking and assembling is still fairly manual.
Personally, I’ve just stopped buying books from Amazon. I find that Kobo is often cheaper and their DRM is easy to bypass. But if you have many books trapped in Amazon – or a book is only published there – this is a barely adequate way to liberate it for your personal use.
2 thoughts on “Improving PixelMelt’s Kindle Web Deobfuscator”
Firstly, the downloader was hard-coded to only work with the .com site. That fix was simple – do a search and replace on amazon.com with amazon.co.uk. Easy!
The characters were then pasted on to the blank page.
Tesseract 5 is a fast, modern, and reasonably accurate OCR engine for Linux.
Running tesseract page_0022.png output -l eng produced a .txt file with all the text extracted.
Or, a PDF with embedded text: tesseract page_0022.png output -l eng pdf
OCR isn’t infallible. Even with a high resolution image and a clear font, there were some errors.
Layout is flat. The image of the page might have an indent, but the outputted text won’t.
The text will be reasonably accurate. But there will be plenty of mistakes.
You can get an HTML layout with hOCR. But it will be missing formatting and links.
Tanquist @blog@pluralisticI’m about half-way through Cory Doctorow’s Enshittification. Your work belongs with other examples he cites of attempts to un-enshittify our world. From generic ink suppliers bypassing the chip readers in HP printers to phone hacks for gig workers, I admire the work of the resistance! Reply | Reply to original comment on masto.ai 2025-10-19 13:26
Reply | Reply to original comment on masto.ai 2025-10-19 13:26
Reply | Reply to original comment on kolektiva.social 2025-10-19 13:49
2 thoughts on “Improving PixelMelt’s Kindle Web Deobfuscator” Tanquist @blog@pluralisticI’m about half-way through Cory Doctorow’s Enshittification. Your work belongs with other examples he cites of attempts to un-enshittify our world. From generic ink suppliers bypassing the chip readers in HP printers to phone hacks for gig workers, I admire the work of the resistance! Reply | Reply to original comment on masto.ai 2025-10-19 13:26 Giuda Ballerino @blog Thanks for your work! Reply | Reply to original comment on kolektiva.social 2025-10-19 13:49 More comments on Mastodon.



You must be logged in to post a comment.