Following my previous post about updating the publishing platform of this blog, I realized that I dug myself into a hole. The new workflow was pretty sweet. To the point where I wrote my blog posts a lot more frequently than before, as you can probably tell.
The problem was that I wanted to edit and process the blog post inside Google Docs, where I have a great workflow for editing, reviews, collaboration, etc. And then I want to push that same document to the blog. The killer for me is that I want that to be a smooth process, and the end text should fit into the blog. That means, if I want to emphasize something, it should be seen in the blog as bold. And if I want to write some code, that should work as well. In fact, the reason that I started this process is that it got so annoying to post code to the blog.
I’m using Google Docs’ export functionality to get the HTML back, and I did some basic cleaning to get it blog-ready instead of being focused on visual fidelity. I was using HTML Agility Pack to do that, and it turned out to be the wrong tool for the job. The issue is that it processed the data as if it were an XML document. I actually got a lot of track record with XML, so that wasn’t the issue. The problem is that I wanted to do a series of non-trivial things with the HTML, and there aren’t any off-the-shelf facilities to do that in .NET that I could find.
For example, given how important it is to me to show code snippets properly, I wanted to be able to grab them from the document, figure out what language I’m actually using there and syntax highlight it properly. There isn’t anything like that in .NET, all the libraries I found were for JavaScript.
You know the adage about: Let’s rewrite it in Rust? I rewrote my entire publishing process to JavaScript. Which then led me to another adventure. How can I do two contrary things? When I’m writing this document, I want to be able to just write the code. When I publish it, I want to see the syntax highlighted code, properly formatted and working.
Google Docs has support for writing code blocks inline (for some small number of languages), which is great for the editing process. However, the HTML that this generates is beyond atrocious. What is even worse, in HTML, it doesn’t align things properly using fixed-sized fonts, etc. In other words, it is almost there, but not quite.
When analyzing the Google Docs output, I noticed a couple of funny characters in the code output. Here is what it looks like. I believe this is a bug in the export process, probably related to the way code blocks work in Google Docs.
Dear Googlers, if you are reading this, please make a note that this thing has just been Hyrum's Law. It is an observable state, and I’m relying on it to do important tasks. Don’t break this in the future.
It turns out these are actually a pair of Unicode characters. More specifically, they are Unicode characters that are marked for private use:
- 0xEC03 - appears to be used to mark the beginning of a code block
- 0xEC02 - appears to be used to mark the end of a code block
Note the “appears”, and my blatant disregard for things like software maintenance discipline and all things proper and good in the world of Computer Science. This is a project where there are no rules, there is one customer, and he can code 🙂.
As mentioned earlier, while extracting the Google Doc as HTML and processing it, I encounter those Unicode markers that delineate the code section. This is good, because in terms of HTML itself, what it is doing inside is a… mess. Getting the actual text as it is supposed to be is not easy. So I exported the file again, as text. Those markers are showing up in the textual edition as well, which made things a lot easier for me.
With all of this done, allow me to show you some truly horrifying beautiful code:
let blocks = []; for (const match of text.data.matchAll(/\uEC03(.*?)\uEC02/gs)) { const code = match[1].trim(); const lang = flourite(code, { shiki: true, noUnkown: true }).language; const formattedCode = Prism.highlight(code, Prism.languages[lang], lang); blocks.push("<hr/><pre class='line-numbers language-" + lang + ">" + "<code class='line-numbers language-" + lang + "'>" + formattedCode + "</code></pre><hr/>"); } let inCodeSegment = false; htmlDoc.findAll().forEach(e => { var text = e.getText().trim(); if (text == "") { e.replaceWith(blocks[codeSegmentIndex++]); inCodeSegment = true; } if (inCodeSegment) { e.extract(); } if (text == "") { inCodeSegment = false; } })
That isn’t a lot of code, but it does plenty. We scan through the textual version of the document and find all the code blocks using a regular expression. We then try to figure out what language I’m using and apply code formatting during the publication process (this saves the need to change anything on the blog, which is nice, especially since we have to take into account syndication).
I push the code snippets into an array and then I process the actual HTML document using the DOM and find all the code snippets. I replace the start marker with the actual formatted code and continue to discard all the other elements until I hit the end of the code segment. The rest of the code remains pretty much the same as before.
I was writing this in VS Code and copilot suggested the following code for handling images:
htmlDoc.findAll('img').forEach(img => { if (img.attrs.hasOwnProperty('src')) { let src = img.attrs.src; let imgName = src.split('/').pop(); let imgData = entries.find(e => e.entryName === 'images/' + imgName).getData(); let imgType = imgName.split('.').pop(); let imgSrc = 'data:image/' + imgType + ';base64,' + imgData.toString('base64'); img.replaceWith('<img src="' + imgSrc + '" style="float: right"/>'); } })
In other words, instead of uploading the images as separate files, I can just encode them into the blog post directly. I like that idea very much because it means that I don’t have to store the images elsewhere.
Given that I don’t have any npm packages to abandon, I don’t know if I can call myself a JavaScript developer, but I did put the full code up for people to take a peek and then recoil.