Join Oren Eini, CEO of RavenDB, as he explores the design and implementation of RavenDB’s indexing engine Corax, its impact on indexing and query performance, and how the engine addresses common challenges such as slow data retrieval, high hosting expenses, and sluggish development processes. You’ll also gain valuable insights into the architecture's performance costs and its ability to unlock efficiency in data handling.
This happened a few minutes ago, I got a call from an unknown number. That was my wife’s work number, and she called to ask me an urgent question, it seems:
“Can you tell me how to compress a PDF file?” she asked.
For the next part, it might be better if I paint you the whole picture. Imagine bullet time, where everything slows down, and I start to analyze the question and my possible answer. The following thoughts run through my mind during that time.
- PDF files are already compressed by default.
- Pretty sure that the file format is already using compression.
- You could strip unneeded elements from the file, removing fonts is one example, I think.
- If there are images, can probably downscale or re-sample them to reduce their size.
- What about just running this through Zip?
- Where did this question come from?
That took about two seconds in real time. The decision tree for any possible answer here grew exponentially. I had to make a call.
“No, that isn’t easily possible,” I answered.
I got some more details as well.
“This is for uploading a document to the XYZ system, it only accepts up to 4MB files, but this PDF is 5.5MB. I guess I can just scan this document as two separate pages instead of one, right?”
A workaround found, and a detailed dive into lossless vs. lossy compression compared to the file format choice avoided, I agreed that this was probably the best option and finished my coffee, pondering the ethical dilemma of answering the actual question or the intended question.
I’m currently playing with a Secret Project (code-named Hugin right now) and for that purpose, I literally ordered all the available Raspberry Pi in Israel. That last statement sounds like a joke, but we checked six to eight places, and our order quantity exceeds the inventory in the country. They are flying the units to us as you read this.
I would love to hear what you think I’m doing, by the way. Please share your thoughts on the matter in the comments.
For Hugin, I’m playing with Pi Zero 2 W, which is about the size of a lighter. They are small, and somewhat underpowered, but really cool. They also run RavenDB surprisingly well, but I’ll touch on that in a later post.
The drawback of the Zero is that basically it has two ports: a micro USB and a mini-HDMI. There is also a micro USB for power, but for doing stuff with it, just those two ports. If you are like me, you have more micro USB power cables than you know what to do with. However, micro USB on-the-go connectors or mini-HDMI are far rarer these days.
I want this to be useful and easy, so I started thinking about how I could make it simpler to work with. Then I realized that the Zero model I’m using (2 W) has built-in wifi, and that meant that I could start getting smart. The idea is that we can turn the Zero into an access point, so all you’ll need is to plug it into power (using a micro USB cable you likely already have), wait half a minute, and connect to the machine.
Once I had the idea, I delved deep into figuring out how to make it work. I managed, and the entire process is pretty simple from a user perspective, but it was anything but to make it work.
For the rest of this post, I will be working with the Raspberry Pi Zero 2 W, using Raspberry Pi OS Lite (Legacy, 32 bits) (Debian Bullseye). I tested this on a range of Pis (I apparently got lots, from Raspberry Pi 3 B to the Raspberry Pi 400), and it worked on everything I tried.
I actually tried quite hard to get it working on the Raspberry Pi OS (the non-legacy, which is Debian Bookworm). However, I couldn’t get it to behave the way I wanted it to. Setting up a wifi hotspot on Bookworm is easy, but getting it to bind DNS and DHCP to a particular device was beyond my capabilities.
From my reading, it doesn’t look like I’m the only one running into issues here.
The basic idea is that on connecting to a WiFi network, most devices will check connectivity and display the captive portal page if needed. In this case, we simply provide the captive portal page to our application. Hence, the only thing you need to do is to connect to the hotspot, and everything else is handled for you.
This blog post was really helpful figuring things out.
How this works, however, is a whole other matter. I’m assuming that you are running on a clean slate, booting for the first time on the clean image of Raspberry Pi Lite (Bullseye). The first thing to do is to set up the wifi, DNS, and DHCP, like so:
sudo raspi-config nonint do_wifi_country IL
sudo rfkill unblock wifi
sudo apt-get install -y nginx dnsmasq dhcpcd
We first set up the country for wifi, unblock it, and install nginx,dnsmasq and dhcpcd. Our next step it update /etc/wpa_supplicant/wpa_supplicant.conf to create the actual hotspot:
ctrl_interface=DIR=/var/run/wpa_supplicant GROUP=netdev
update_config=1
country=IL
network={
ssid="MyHotSpot"
mode=2
key_mgmt=NONE
frequency=2412
}
We define the MyHotSpot network as an open (key_mgmt=NONE) access point (mode=2). We need to plug this into the DHCP configuration in /etc/dhcpcd.conf:
hostname
clientid
persistent
option rapid_commit
option domain_name_servers, domain_name, domain_search, host_name
option classless_static_routes
option interface_mtu
require dhcp_server_identifier
slaac private
env wpa_supplicant_conf=/etc/wpa_supplicant/wpa_supplicant.conf
interface wlan0
static ip_address=10.1.1.1/24
The last part is the most important bit. We pull the wpa_supplicant configuration that we previously defined, apply it to the WiFi device (wlan0), and register a static IP 10.1.1.1 for that interface. Basically, the WiFi interface will use that IP address as the gateway for clients connecting to it. Those clients need to get their own IP addresses, and that is the role of dnsmasq (no idea why it isn’t a dhcpcd that does it, it’s literally in the name). Here is the relevant configuration file /etc/dnsmasq.conf:
listen-address=10.1.1.1
no-hosts
log-queries
log-facility=/var/log/dnsmasq.log
dhcp-range=10.1.1.2,10.1.1.254,72h
dhcp-option=option:router,10.1.1.1
dhcp-authoritative
dhcp-option=114,http://awesome.appliance/
dhcp-option=160,http://awesome.appliance/
# Resolve everything to the portal's IP address.
address=/#/10.1.1.1
# Android Internet Conectivity Test Domains
address=/clients1.google.com/127.0.0.1
address=/clients3.google.com/127.0.0.1
address=/connectivitycheck.android.com/127.0.0.1
address=/connectivitycheck.gstatic.com/127.0.0.1
There is a lot going on here. We define the DHCP range from which clients will get their IPs and set the router for this connection. We also define option 114 (and 160, which is a legacy one) to instruct the client that it needs to first visit that URL before it connecting to the wider internet.
Finally, we set up DNS in such a way that all DNS entries go to the server, except for a certain set of known domains used by some Android phones to check for an internet connection. We’ll touch on that in a bit.
In short, all of this configuration basically tells the Zero to create a WiFi hotspot with IP 10.1.1.1, assign connected devices IP addresses in the range 10.1.1.2 .. 10.1.1.254, set the DNS server for those devices to 10.1.1.1, and resolve any DNS query to IP 10.1.1.1. Also, if they care to, there is a specific URL users need to visit to get things started. In short, we are trying to guide the user to take us to the right place.
One problem we have, however, is that we didn’t set up anything to respond to HTTP requests. That is why we installed nginx earlier. We configure it using /etc/nginx/sites-available/default:
server {
listen 80 default_server;
listen [::]:80 default_server;
server_name _;
location / {
return 302 http://awesome.appliance;
}
}
server {
listen *:80;
server_name awesome.appliance;
root /var/appliance/web;
autoindex on;
}
The idea here is simple. Everything before basically directs the client to the server, all domains go to it, etc. So when a connection comes, we tell nginx that it should return a 302 response (redirect) to the portal endpoint we have.
If the client is requesting the http://awesome.applianceaddress, however, we serve an actual website.
All of this together ends up with an open access point that, upon connection, will direct you to a web page. This is a walled garden, of course, since we assume that the Zero is connected only to the power.
Now that this is solved, you need to figure out what function you want the appliance to actually have.
When we started working on Corax (10 years ago!), we had a pretty simple mission statement for that: “Lucene, but 10 times faster for our use case”. When we actually started implementing this in code (early 2020), we had a few more rules about the direction we wanted to take.
Corax had to be faster than Lucene in all scenarios, and 10 times faster for common indexing and querying scenarios. Corax design is meant for online indexing, not batch-oriented like Lucene. We favor moving work to indexing time and ensuring that our data structures on disk can work with no additional processing time.
Lucene was created at a time when data size was much smaller and disks were far more expensive. It shows in the overall design in many ways, but one of the critical aspects is that the file design for Lucene is compressed, meaning that you need to read the data, decode that into the in-memory data structure, and then process it.
For RavenDB’s use case, that turned out to be a serious problem. In particular, the issue of cold queries, where you query the database for the first time and have to pay the initialization cost, was particularly difficult. Now, cold queries aren’t really that interesting, from a benchmark perspective, you have to warm things up in every software (caches are everywhere, from your disk to your CPU). I like to say that even memory has caches (yes, plural) because it is so slow (L1, L2, L3 caches).
With Lucene’s design, however, whenever it runs an indexing batch, it creates a new file, and to start querying after that means that you have a “cold start” for that file. Usually, those files are small, but every now and then Lucene needs to merge several files together and then we have to pay the cold start price for a large amount of data.
The issue is that this sometimes introduces a high latency spike (hitting us in the P999 targets), which is really hard to smooth over. We spent a lot of time and engineering resources ensuring that this doesn’t have a big impact on our users.
One of the design goals for Corax was to ensure that this doesn’t happen. That we are able to get consistent performance from the system without periodic maintenance tasks. That led us to a very different internal design. The persistent data structures that we use are meant to be used as is, without initial processing.
Everything has a cost, and in this case, it means that the size of Corax on disk is typically somewhat larger than Lucene. The big advantage is that the amount of memory being used by Corax tends to be significantly lower. And in today’s world, disks are far cheaper than memory. Corax’s cold start time is orders of magnitude faster than Lucene’s cold start time.
It turns out that there is a huge impact in another scenario as well, completely unexpected. We continuously run performance tests on our system, and we got some ridiculous results when testing query performance using encrypted databases.
When you use encryption at rest, RavenDB ensures that the only time that your data is decrypted is when there is an active transaction using the data. In other words, even in-memory buffers are encrypted. That applies to documents as well as indexes. It does not apply to the in-memory data that Lucene holds in its cache, though. For Corax, however, all of its state is encrypted.
When we run our benchmark on encrypted database queries, we expect to see either roughly the same performance between Corax and Lucene or see Lucene edging out Corax in this scenario, since it can use its cache without paying decryption costs.
Instead, we got really puzzling results. I tried showing them in bar chart format, but I literally couldn’t make the data fit in a reasonable size. The scenario is testing queries on an encrypted database, using an m5.xlarge instance on AWS. We are hitting the server with 500 queries/second, and testing for the 99.99 percentile performance.
Indexing Engine | 99.99% percentile (ms) | 99.99% percentile (seconds) |
Lucene | 40,210 | 40.21 |
Corax | 186 | 0.18 |
Take a look at those numbers! Somehow Corax is absolutely smoking Lucene’s lunch. And I was quite surprised about that. I mean, I’m happy, I guess, that the indexing engine we spent so much time on is doing this well, but any time that we see a performance number that we cannot explain we need to figure out what is going on.
Here is the profiler output for this benchmark, using Lucene.
As you can see, the vast majority of the time is spent decrypting pages. And we are decrypting pages belonging to a stream. Those are the Lucene files, stored (encrypted in this case) inside of Voron. The issue is that the access pattern that Lucene is using forces us to touch large parts of the file. It usually reads a very small portion each time, but in various locations. Given that the data is encrypted, we have to decrypt each of those locations.
Corax, on the other hand, keeps the persistent data structure in such a way that when we need to access specific pages only. That means that in terms of the number of pages touched by Corax or Lucene for this particular scenario, Lucene is using a lot more. You’ll usually not notice that since Voron (our storage engine) is memory mapped and those accesses are cheap. When using encrypted storage, however, we need to decrypt the data first, so that was very noticeable.
It’s interesting to note that this also applies to instances where there is a memory pressure involved. Corax would tend to touch a lot less memory and have a smaller working set, while Lucene will generate more page faults.
Really interesting results, and I’m both happy and amused that totally different design decisions have led to such a big impact in this scenario. In short, Corax is fast, really fast, and in many more scenarios than we initially thought.
Following my previous post about updating the publishing platform of this blog, I realized that I dug myself into a hole. The new workflow was pretty sweet. To the point where I wrote my blog posts a lot more frequently than before, as you can probably tell.
The problem was that I wanted to edit and process the blog post inside Google Docs, where I have a great workflow for editing, reviews, collaboration, etc. And then I want to push that same document to the blog. The killer for me is that I want that to be a smooth process, and the end text should fit into the blog. That means, if I want to emphasize something, it should be seen in the blog as bold. And if I want to write some code, that should work as well. In fact, the reason that I started this process is that it got so annoying to post code to the blog.
I’m using Google Docs’ export functionality to get the HTML back, and I did some basic cleaning to get it blog-ready instead of being focused on visual fidelity. I was using HTML Agility Pack to do that, and it turned out to be the wrong tool for the job. The issue is that it processed the data as if it were an XML document. I actually got a lot of track record with XML, so that wasn’t the issue. The problem is that I wanted to do a series of non-trivial things with the HTML, and there aren’t any off-the-shelf facilities to do that in .NET that I could find.
For example, given how important it is to me to show code snippets properly, I wanted to be able to grab them from the document, figure out what language I’m actually using there and syntax highlight it properly. There isn’t anything like that in .NET, all the libraries I found were for JavaScript.
You know the adage about: Let’s rewrite it in Rust? I rewrote my entire publishing process to JavaScript. Which then led me to another adventure. How can I do two contrary things? When I’m writing this document, I want to be able to just write the code. When I publish it, I want to see the syntax highlighted code, properly formatted and working.
Google Docs has support for writing code blocks inline (for some small number of languages), which is great for the editing process. However, the HTML that this generates is beyond atrocious. What is even worse, in HTML, it doesn’t align things properly using fixed-sized fonts, etc. In other words, it is almost there, but not quite.
When analyzing the Google Docs output, I noticed a couple of funny characters in the code output. Here is what it looks like. I believe this is a bug in the export process, probably related to the way code blocks work in Google Docs.
Dear Googlers, if you are reading this, please make a note that this thing has just been Hyrum's Law. It is an observable state, and I’m relying on it to do important tasks. Don’t break this in the future.
It turns out these are actually a pair of Unicode characters. More specifically, they are Unicode characters that are marked for private use:
- 0xEC03 - appears to be used to mark the beginning of a code block
- 0xEC02 - appears to be used to mark the end of a code block
Note the “appears”, and my blatant disregard for things like software maintenance discipline and all things proper and good in the world of Computer Science. This is a project where there are no rules, there is one customer, and he can code 🙂.
As mentioned earlier, while extracting the Google Doc as HTML and processing it, I encounter those Unicode markers that delineate the code section. This is good, because in terms of HTML itself, what it is doing inside is a… mess. Getting the actual text as it is supposed to be is not easy. So I exported the file again, as text. Those markers are showing up in the textual edition as well, which made things a lot easier for me.
With all of this done, allow me to show you some truly horrifying beautiful code:
let blocks = []; for (const match of text.data.matchAll(/\uEC03(.*?)\uEC02/gs)) { const code = match[1].trim(); const lang = flourite(code, { shiki: true, noUnkown: true }).language; const formattedCode = Prism.highlight(code, Prism.languages[lang], lang); blocks.push("<hr/><pre class='line-numbers language-" + lang + ">" + "<code class='line-numbers language-" + lang + "'>" + formattedCode + "</code></pre><hr/>"); } let inCodeSegment = false; htmlDoc.findAll().forEach(e => { var text = e.getText().trim(); if (text == "") { e.replaceWith(blocks[codeSegmentIndex++]); inCodeSegment = true; } if (inCodeSegment) { e.extract(); } if (text == "") { inCodeSegment = false; } })
That isn’t a lot of code, but it does plenty. We scan through the textual version of the document and find all the code blocks using a regular expression. We then try to figure out what language I’m using and apply code formatting during the publication process (this saves the need to change anything on the blog, which is nice, especially since we have to take into account syndication).
I push the code snippets into an array and then I process the actual HTML document using the DOM and find all the code snippets. I replace the start marker with the actual formatted code and continue to discard all the other elements until I hit the end of the code segment. The rest of the code remains pretty much the same as before.
I was writing this in VS Code and copilot suggested the following code for handling images:
htmlDoc.findAll('img').forEach(img => { if (img.attrs.hasOwnProperty('src')) { let src = img.attrs.src; let imgName = src.split('/').pop(); let imgData = entries.find(e => e.entryName === 'images/' + imgName).getData(); let imgType = imgName.split('.').pop(); let imgSrc = 'data:image/' + imgType + ';base64,' + imgData.toString('base64'); img.replaceWith('<img src="' + imgSrc + '" style="float: right"/>'); } })
In other words, instead of uploading the images as separate files, I can just encode them into the blog post directly. I like that idea very much because it means that I don’t have to store the images elsewhere.
Given that I don’t have any npm packages to abandon, I don’t know if I can call myself a JavaScript developer, but I did put the full code up for people to take a peek and then recoil.
Fungible is a funny word, mostly because you are most likely familiar with the term from NFT (non-fungible tokens) and other similar scams. At its core, it is the idea that for certain things, the instance doesn’t matter, just the amount.
The classic example is that if I lend you a 50$ bill, and you give me back two 20$ bills and a 10$ bill, you’ve still given me back my money. That is even though you very clearly didn’t. I didn’t get the same physical 50$ paper bill back, I got bills for that same amount. On the other hand, if I give you my dog for the weekend, I would be quite upset if I got back three different dogs, even if the total weight is the same.
This is actually a lot more than I want to know about fungibility, to be honest. But it turns out that if you are running a cloud business or just use the cloud in general, you have to be well-versed in the matter. Because in the cloud, money isn’t fungible. In fact, it doesn’t behave a lot like money at all.
Let’s assume that we are a cloud company called cloud.example.com, offering VPS for ourr users. You are in charge of writing the billing code, and it is pretty simple, right? Here is some code that can compute the charges:
function compute_charges(custId, start, end) { let total = 0; let predicate = instance => (instance.custId === custId && instance.started < end) && (instance.ended > start || instance.ended == null); for (let instance of query_instances(predicate)) { total += instance.hours_running(start, end) * instance.price_per_hour; } return total; }
As you can see, there isn’t much there. We find all the instances that were running in the billing period and then calculate the total hours they ran during that period. Please note, this is a simplified model as we aren’t dealing with stopping & starting instances, etc.
The output of the compute_charges() function is a number, which will presumably be handed over to be charged over a credit card. There are other things that we need to do as well (generate an invoice, have a usage report, etc), but I want to focus on the money issue here.
The simplest model is that at the end of the billing period, we charge the customer (using a credit card, for example) and receive our payment. Everyone is happy and we can go home, hopefully richer.
The challenge arises when we want to offer additional options to the customer. For example, we may be willing to give the customer a discount if they are going to commit to a minimum amount of money they’ll spend each month. We may want to offer them upfront payment options or give monetary incentives to a particular aspect of the business (run on ARM instances instead of X64, for example).
Each time that we make such an offer, we are going to be turning around and (significantly) complicating the way we bill the customer. Let’s talk about something as simple as committing to run an instance for a whole year. No upfront payment, just a commitment to pay for a particular server for a year. In AWS or Azure, that would be Reserved Instances, so you are likely very familiar with the idea.
How is that going to be expressed in code? Probably something like this:
function compute_charges(custId, start, end) { let total = 0; let predicate = instance => /*..redacted.*/; var hrsPerIns = {}; for (let i of this.instances(predicate)) { let hours = i.hours_running(start, end); hrsPerIns[i.type] = hours + (hrsPerIns[i.type] || 0); total += hours * i.price_per_hour; } for (let c of this.commitmentsFor(custId, start, end)) { let hours = c.committed_time(start, end); let hoursUsed = hrsPerIns[c.type] || 0; let unusedCommittedHours = Math.max(0, hours - hoursUsed); total += unusedCommittedHours * this.instance(c.type).price_per_hour; } return total; }
To be clear, the code above is not a good way to handle such a task, but it does show in a pretty succinct way the hidden complexities. In this case, if you didn’t meet your commitment, we’ll charge you for the unused commitment as well.
A more complex system would have to account for discounted rates while using the committed values, for example. And in that case, the priority of applying such rates between different matching commitments.
Other aspects may be giving the user a discount for a particular level of usage. So the first 100GB are priced differently from the rest, applying a free tier and… you get the point, I think. It gets complex.
Note that at this point, we aren’t even talking about money yet, we are discussing computing the charges. The situation is more interesting when we move to the next stage. On the face of it, this seems pretty simple, all you need to do is charge the credit card, no?
Okay, maybe you need to send an invoice, but that is about it, right?
Well… what happens if the customer made an upfront payment for one of those commitments? Or just accidentally paid twice last month and now has credit on your system.
I’m going to leave aside the whole complexity around payments bouncing (which is a whole other interesting topic) and how to deal with the actual charging. Right now I want to focus on the nature of money itself.
Imagine you have a commitment with a customer for an 8-core / 64 GB VPS server for a whole year. And they paid upfront, getting a nice discount along the way. How would you record that in your system?
The easiest is to create the notion of credit for the user, which you deduct whenever you need to charge them. So we’ll first compute the charges, then deduct the existing credits, and debit the customer if anything remains. This is simple, easy to work with, and wrong.
Remember that discount the user received? They paid for that particular VPS type, and if you now need to charge them for anything else (such as storage charges), that money cannot come from the funds paid for the VPS.
In other words, the money the customer paid is not fungible. It isn’t applicable for any charge, it is colored. It is dedicated to a particular purpose. And managing that turns out to be pretty complex. Mostly because we are trying to fit everything into the debits and credits on the account.
A better model is to avoid using money, in the same way that if you mix inches and centimeters you’ll eventually end up in a bad place on Mars. The solution is to treat each individual charge as its own “currency”.
In other words, when computing the charges, we aren’t trying to find the cost of running a particular instance for the billing period. We are trying to find how many “cost units” we have for that time period.
Instead of getting a single number that we’ll charge the customer, we’ll obtain a detailed set of the changes in question. Not as money, but as cost units. Think about those in a similar way to currency. Note that all the units are multiples of 730 hours (number of hours per month, on average).
compute_charges(custId, start, end) => { custId: 'customers/3291-B', start: '2024-01-01', end: '2024-01-31', costs: [ {type: '8Cores-64GB-hours', qty: 2190}, {type: '4Cores-32GB-hours', qty: 730}, {type: 'disk-5000-iops', qty: 2920}, ], }
The next step after that is to get your allocated budget for the same billing period, which will look something like this:
compute_budget(custId, start, end) => { custId: 'customers/3291-B', start: '2024-01-01', end: '2024-01-31', commitments: [ {type: '8Cores-64GB-hours', qty: 2190}, {type: '4Cores-32GB-hours', qty: 1460}, {type: 'disk-5000-iops', qty: 730}, ], }
In other words, just as we compute the charges based on the actual usage for that billing period, we apply the same approach on the commitments we have. The next stage is to just add all of those together. In this case, we’ll end up with the following:
- 8Cores-64GB-hours ⇒ 0 (we used as much as we committed to)
- 4Cores-32GB-hours ⇒ -730 (we committed to more than we used)
- Disk-5000-iops ⇒ 2190 (remaining use after applying commitment, priced as you go)
We aren’t done yet, after commitments, there are other plans that we may need to run. For example, we’ll provide you with some global discounts for VM rental (which doesn’t apply to disks, however). Working at the level of cost units (or colors, or currency, whatever term you like) allows us to apply those things in a very fine-grained manner. More importantly, the end result and all its intermediate steps are very clear. That is quite important when you look at a six-figure bill with hundreds of line items and you want to see whether the billing matches your contract or not.
As you can imagine, given the inherent complexity of the system, being able to clearly “show your work” is quite important. Especially when there is a misunderstanding or questions are being raised (and there will be).
What we have done now is compute the actual charges based on their type, but we need to convert that to real money. There are several steps along this process:
- We need to charge all the active commitments. Those may have been pre-paid (in which case there is no current charge), but they may have a (fixed) monthly cost that we need to add to the current invoice.
- We need to perform a “currency conversion” between the units we have and actual money. In the example above, we have a negative number of units (for 4Cores-32GB-hours), as we committed to more hours than we actually used. We are still being charged for this by applying the rate from the commitment.
- On the other hand, when we examine the disk costs, we used more than we committed to. Here we need to make a decision about what price we’ll charge the user. It can be the commitment price or the pay-as-you-go price. So even for the same currency we may have different rules.
After all of this is done, we are now left with a final number. The actual amount of money that we need to charge the customer. This is the point at which we check if the customer has any credit already paid in the system or if we need to make an actual charge. That aspect is complicated by whether you are charging a credit card (same for any other automatic billing option) or issuing an invoice to be paid manually.
For a manual invoice, you now have a whole other process. For example, you may offer discounts for the customer if they pay within 14 days versus the usual 30, or charge a fee for paying within 60 days, etc.
I’m not touching on collections or what to do when you fail to charge the customer. It is shockingly common to encounter payment failures. To the point where we never had a single payment run that didn’t include at least several such cases. The reasons range from deal size too big to (temporary) lack of funds to suspicious-seeming activity. You need to be able to handle that as well. But those are topics for another post.
In this post, my aim was to discuss just the issue of the complexity of money in the cloud business. I find the model of treating the charges as separate “currencies” to be a nice one overall, but I would love to hear about other people’s experiences in this matter.
A not insignificant part of my job is to go over code. Today I want to discuss how we approach code reviews at RavenDB, not from a process perspective but from an operational one. I have been a developer for nearly 25 years now, and I’ve come to realize that when I’m doing a code review I’m actually looking at the code from three separate perspectives.
The first, and most obvious one, is when I’m actually looking for problems in the code - ensuring that I can understand what is going on, confirming the flow makes sense, etc. This involves looking at the code as it is right now.
I’m going to be showing snippets of code reviews here. You are not actually expected to follow the code, only the concepts that we talk about here.
Here is a classic code review comment:
There is some duplicated code that we need to manage. Another comment that I liked is this one, pointing out a potential optimization in the code:
If we define the code using the static keyword, we’ll avoid delegate allocation and save some memory, yay!
It gets more interesting when the code is correct and proper, but may do something weird in some cases, such as in this one:
I really love it when I run into those because they allow me to actually explore the problem thoroughly. Here is an even better example, this isn’t about a problem in the code, but a discussion on its impact.
RavenDB has been around for over 15 years, and being able to go back and look at those conversations in a decade or so is invaluable to understanding what is going on. It also ensures that we can share current knowledge a lot more easily.
Speaking of long running-projects, take a look at the following comment:
Here we need to provide some context to explain. The _caseInsensitive variable here is a concurrent dictionary, and the change is a pretty simple optimization to avoid the annoying KeyValuePair overload. Except… this code is there intentionally, we use it to ensure that the removal operation will only succeed if both the key and the value match. There was an old bug that happened when we removed blindly and the end result was that an updated value was removed.
In this case, we look at the code change from a historical perspective and realize that a modification would reintroduce old (bad) behavior. We added a comment to explain that in detail in the code (and there already was a test to catch it if this happens again).
By far, the most important and critical part of doing code reviews, in my opinion, is not focusing on what is or what was, but on what will be. In other words, when I’m looking at a piece of code, I’m considering not only what it is doing right now, but also what we’ll be doing with it in the future.
Here is a simple example of what I mean, showing a change to a perfectly fine piece of code:
The problem is that the if statement will call InitializeCmd(), but we previously called it using a different condition. We are essentially testing for the same thing using two different methods, and while currently we end up with the same situation, in the future we need to be aware that this may change.
I believe one of the major shifts in my thinking about code reviews came about because I mostly work on RavenDB, and we have kept the project running over a long period of time. Focusing on making sure that we have a sustainable and maintainable code base over the long haul is important. Especially because you need to experience those benefits over time to really appreciate looking at codebase changes from a historical perspective.
RavenDB will be participating in the DevWeek hackathon in February. The hackathon is now live, and we are offering prizes worth 4,000 USD for the top two winners.
The hackathon is open to both attendees of the DevWeek conference and the general public. The challenge we put forth is building a sharing platform in a community. I’m excited to see what kind of solutions will be submitted.
I will also be personally attending the DevWeek conference and would be very happy to meet you in person. Happy hacking!
I've been writing this blog since 2004. That means I have been doing this for twenty years, which is frankly unbelievable to me. The actual date is sometime in April, so I’ll probably do a summary post then about that.
What I want to talk about today is a different aspect. The mechanism and processes I use to write blog posts. A large part of the reason I write blog posts is that it helps me understand and organize my own thoughts. And in order to do that effectively, I have found that I need very little friction in the blogging process.
About a decade ago, Google Reader was shut down, and I’m still very bitter about that. It effectively killed a significant portion of the blogging audience and made the ergonomics of reading blogs a lot harder. That also led people to use walled gardens to communicate with others, instead of the decentralized network and feed aggregators. A side effect of that decision is that blogging tools have stopped being a viable thing people spend time or money on.
At the time, I was using Windows Live Writer, which was a high-quality editor and had a rich plugin system. Microsoft discontinued it at some point, it became an open-source project, and even that died. The website is no longer functional and even in terms of the GitHub project, the last commit was 5 years ago.
I’m still using Open Live Writer to write the majority of my blog posts, but given there are no longer any plugins, even something as simple as embedding code in my posts has become an… annoyance. That kills the ergonomics of blogging for me.
Not a problem, this is Open Source, and I can do that myself. Except… I really don’t have the time to spend on something ancillary like that. I would happily pay (a reasonable amount) for a blogging client, but I’m going to assume that I’m not part of a large enough group that there is a market for this.
Taking the code snippets example, I can go into the code, figure out what is going on there, and add a “code snippet” feature. I estimate that would take several days. Alternatively, I can place the code as a GitHub gist and embed it in the page. It is annoying, but far quicker than going to the trouble of figuring that out.
Another issue that bugs me (pun intended) is a problem with copy/paste of images, where taking screenshots using the Snipping Tool doesn’t paste into Writer. I need to first paste them into Paint, then into Writer. In this case, I assume that Writer doesn’t recognize the clipboard format or something similar.
Finally, it turns out that I’m not writing blog posts in the same manner as I used to. It got to the point where I asked people to review my posts before making them public. It turns out that no matter how many times it is corrected, my brain seems unable to discern when to write “whether” or “whatever”, for example. At this point I gave up updating that piece of software 🙂. Even the use of emojis doesn’t work properly (Open Live Writer mostly predates a lot of them and breaks the HTML in a weird fashion 🤷).
In other words, there are several problems in my current workflow, and it has finally reached the point where I need to do something about it. The last requirement, by the way, is the most onerous. Consider the workflow of getting the following fixes to a blog post:
- and we run => and we ran
- we spend => we spent
Where is my collaborating editing and the ability to suggest changes with good UX? Improving the ergonomics for the blog has just expanded in scope massively. Now it is a full-fledged publishing platform with modern sensibilities. It’s 2024, features like proper spelling and grammar corrections should absolutely be there, no? And what about AI integration? It turns out that predicting text makes the writing process more efficient. Here is what this may look like:
At this stage, this isn’t just a few minor fixes. I should mention that for the past decade and a half or so, I stopped considering myself as someone who can do UI in any meaningful manner. I find that the <table/> tag, which used to be my old reliable method, is not recommended now, for some reason.
This… kind of sucks. I want to upgrade my process by a couple of decades, but I don’t want to pay the price for that. If only there was an easier way to do that.
I started using Google Docs to edit my blog posts, then pasting them into Live Writer or directly to the blog (using a Rich Text Box with an editor from… a decade ago). I had to check the source code for this, by the way. The entire experience is decidedly Developer UX. Then I had a thought, I already have a pretty good process of writing the blog posts in Google Docs, right? It handles rich text editing and management much better than the editor in the blog. There are also options for things like proper workflows. For example, someone can go over my drafts and make comments or suggestions.
The only thing that I need is to put both of those together. I have to admit that I spent quite some time just trying to figure out how to get the document from Google Docs using code. The authentication hurdles are… significant to someone who isn’t aware of how it all plugs together. Once I got that done, I got my publishing platform with modern features. Here is what the end result looks like:
public class PublishingPlatform { private readonly DocsService GoogleDocs; private readonly DriveService GoogleDrive; private readonly Client _blogClient; public PublishingPlatform(string googleConfigPath, string blogUser, string blogPassword) { var blogInfo = new MetaWeblogClient.BlogConnectionInfo( "https://ayende.com/blog", "https://ayende.com/blog/Services/MetaWeblogAPI.ashx", "ayende.com", blogUser, blogPassword); _blogClient = new MetaWeblogClient.Client(blogInfo); var initializer = new BaseClientService.Initializer { HttpClientInitializer = GoogleWebAuthorizationBroker.AuthorizeAsync( GoogleClientSecrets.FromFile(googleConfigPath).Secrets, new[] { DocsService.Scope.Documents, DriveService.Scope.DriveReadonly }, "user", CancellationToken.None, new FileDataStore("blog.ayende.com") ).Result }; GoogleDocs = new DocsService(initializer); GoogleDrive = new DriveService(initializer); } public void Publish(string documentId) { using var file = GoogleDrive.Files.Export(documentId, "application/zip").ExecuteAsStream(); var zip = new ZipArchive(file, ZipArchiveMode.Read); var doc = GoogleDocs.Documents.Get(documentId).Execute(); var title = doc.Title; var htmlFile = zip.Entries.First(e => Path.GetExtension(e.Name).ToLower() == ".html"); using var stream = htmlFile.Open(); var htmlDoc = new HtmlDocument(); htmlDoc.Load(stream); var body = htmlDoc.DocumentNode.SelectSingleNode("//body"); var (postId, tags) = ReadPostIdAndTags(body); UpdateLinks(body); StripCodeHeader(body); UploadImages(zip, body, GenerateSlug(title)); string post = GetPostContents(htmlDoc, body); if (postId != null) { _blogClient.EditPost(postId, title, post, tags, true); return; } postId = _blogClient.NewPost(title, post, tags, true, null); var update = new BatchUpdateDocumentRequest(); update.Requests = [new Request { InsertText = new InsertTextRequest { Text = $"PostId: {postId}\r\n", Location = new Location { Index = 1, } }, }]; GoogleDocs.Documents.BatchUpdate(update, documentId).Execute(); } private void StripCodeHeader(HtmlNode body) { foreach(var remove in body.SelectNodes("//span[text()='']").ToArray()) { remove.Remove(); } foreach (var remove in body.SelectNodes("//span[text()='']").ToArray()) { remove.Remove(); } } private static string GetPostContents(HtmlDocument htmlDoc, HtmlNode body) { // we use the @scope element to ensure that the document style doesn't "leak" outside var style = htmlDoc.DocumentNode.SelectSingleNode("//head/style[@type='text/css']").InnerText; var post = "<style>@scope {" + style + "}</style> " + body.InnerHtml; return post; } private static void UpdateLinks(HtmlNode body) { // Google Docs put a redirect like: https://www.google.com/url?q=ACTUAL_URL foreach (var link in body.SelectNodes("//a[@href]").ToArray()) { var href = new Uri(link.Attributes["href"].Value); var url = HttpUtility.ParseQueryString(href.Query)["q"]; if (url != null) { link.Attributes["href"].Value = url; } } } private static (string? postId, List<string> tags) ReadPostIdAndTags(HtmlNode body) { string? postId = null; var tags = new List<string>(); foreach (var span in body.SelectNodes("//span")) { var text = span.InnerText.Trim(); const string TagsPrefix = "Tags:"; const string PostIdPrefix = "PostId:"; if (text.StartsWith(TagsPrefix, StringComparison.OrdinalIgnoreCase)) { tags.AddRange(text.Substring(TagsPrefix.Length).Split(",")); RemoveElement(span); } else if (text.StartsWith(PostIdPrefix, StringComparison.OrdinalIgnoreCase)) { postId = text.Substring(PostIdPrefix.Length).Trim(); RemoveElement(span); } } // after we removed post id & tags, trim the empty lines while (body.FirstChild.InnerText.Trim() is " " or "") { body.RemoveChild(body.FirstChild); } return (postId, tags); } private static void RemoveElement(HtmlNode element) { do { var parent = element.ParentNode; parent.RemoveChild(element); element = parent; } while (element?.ChildNodes?.Count == 0); } private void UploadImages(ZipArchive zip, HtmlNode body, string slug) { var mapping = new Dictionary<string, string>(); foreach (var image in zip.Entries.Where(x => Path.GetDirectoryName(x.FullName) == "images")) { var type = Path.GetExtension(image.Name).ToLower() switch { ".png" => "image/png", ".jpg" or "jpeg" => "image/jpg", _ => "application/octet-stream" }; using var contents = image.Open(); var ms = new MemoryStream(); contents.CopyTo(ms); var bytes = ms.ToArray(); var result = _blogClient.NewMediaObject(slug + "/" + Path.GetFileName(image.Name), type, bytes); mapping[image.FullName] = new UriBuilder { Path = result.URL }.Uri.AbsolutePath; } foreach (var img in body.SelectNodes("//img[@src]").ToArray()) { if (mapping.TryGetValue(img.Attributes["src"].Value, out var path)) { img.Attributes["src"].Value = path; } } } private static string GenerateSlug(string title) { var slug = title.Replace(" ", ""); foreach (var ch in Path.GetInvalidFileNameChars()) { slug = slug.Replace(ch, '-'); } return slug; } }
You’ll probably not appreciate this, but the fact that I can just push code like that into the document and get it with proper formatting easily is a major lifestyle improvement from my point of view.
The code works with the document in two ways. First, in the Document DOM (which is quite complex), it extracts the title of the blog post and afterward updates it with the document ID. But the core of this code is to extract the document as a zip file, grab everything from there, and push that to the blog. I do some editing for the HTML to get everything set up properly, mostly editing the links and uploading the images. There is also some stuff happening with CSS scopes that I frankly don’t understand. I think I got it right, which is fine for now.
This cost me a couple of evenings, and it was fun. Nothing earth-shattering, I’ll admit. But it’s the first time in a while that I actually wrote a piece of code that was immediately useful. My blogging queue is rather full, and I hope that with this new process it will be easier to push the ideas out of my head and to the blog.
And with that, it is now 01:26 AM, and I’m going to call it a night 🙂.
And as a final thought, I had just made several changes to the post after publication, and it went smoothly. I think that I like it.
I spoke with Jaime recently in the Modern .NET Podcast:
In this episode of The Modern .NET Show podcast, Oren Eini, a seasoned developer with over 20 years of experience in the .NET field, discussed the evolution of the .NET framework and the complexities that come with it. Eini highlighted the rapid pace of change in the language, from the introduction of generics at version 2.0 to switch expressions and pattern matching in the latest versions. While these new features allow for more concise code, Eini acknowledged that they also increase the scope and complexity of learning C# from scratch.
Would love to hear your feedback.