Ayende @ Rahien

filter by tags archive

architecture (596) rss
bugs (444) rss
challanges (123) rss
community (372) rss
databases (472) rss
design (891) rss
development (617) rss
hibernating-practices (69) rss
miscellaneous (591) rss
performance (385) rss
programming (1058) rss
raven (1407) rss
ravendb.net (485) rss
reviews (184) rss

2024
- December (1)
- November (2)
- October (1)
- September (3)
- August (5)
- July (10)
- June (4)
- May (6)
- April (2)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- July (5)
- June (15)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- April (14)
- March (15)
- February (13)
- January (16)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- August (16)
- July (11)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
2020
- December (10)
- November (13)
- October (15)
- September (6)
- August (9)
- July (9)
- June (17)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- August (8)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- February (19)
- January (23)
2018
- December (15)
- November (14)
- October (19)
- September (18)
- August (23)
- July (20)
- June (20)
- May (23)
- April (15)
- March (23)
- February (19)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- August (23)
- July (21)
- June (24)
- May (21)
- April (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- October (22)
- September (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- February (33)
- January (36)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- August (96)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- May (115)
- April (73)
- March (124)
- February (102)
- January (68)
2006
- December (95)
- November (53)
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- March (143)
- February (78)
- January (64)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- February (84)
- January (31)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

RavenDB Workshops - Deep dive into practical use of Document Data Modeling

Aug 19 2024

Caching documents in RavenDB: The good, the bad and the ugly

time to read 13 min | 2479 words

Tweet Share Share 2 comments

Tags:

RavenDB has a hidden feature, enabled by default and not something that you usually need to be aware of. It has built-in support for caching. Consider the following code:

async Task<Dictionary<string, int>> HowMuchWorkToDo(string userId)
{
    using var session = _documentStore.OpenAsyncSession();
    var results = await session.Query<Item>()
        .GroupBy(x =>new { x.Status, x.AssignedTo })
        .Where(g => g.Key.AssignedTo == userId && g.Key.Status != "Closed")
        .Select(g => new 
        {
            Status = g.Key.Status,
            Count = g.Count()
        })
        .ToListAsync();


    return results.ToDictionary(x => x.Status, x => x.Count);
}

What happens if I call it twice with the same user? The first time, RavenDB will send the query to the server, where it will be evaluated and executed. The server will also send an ETag header with the response. The client will remember the response and its ETag in its own memory.

The next time this is called on the same user, the client will again send a request to the server. This time, however, it will also inform the server that it has a previous response to this query, with the specified ETag. The server, when realizing the client has a cached response, will do a (very cheap) check to see if the cached response matches the current state of the server. If so, it can inform the client (using 304 Not Modified) that it can use its cache.

In this way, we benefit twice:

First, on the server side, we avoid the need to compute the actual query.
Second, on the network side, we aren’t sending a full response back, just a very small notification to use the cached version.

You’ll note, however, that there is still an issue. We have to go to the server to check. That means that we still pay the network costs. So far, this feature is completely transparent to the user. It works behind the scenes to optimize server query costs and network bandwidth costs.

We have a full-blown article on caching in RavenDB if you care to know more details instead of just “it makes things work faster for me”.

Aggressive Caching in RavenDB

The next stage is to involve the user. Enter the AggressiveCache() feature (see the full documentation here), which allows the user to specify an additional aspect. Now, when the client has the value in the cache, it will skip going to the server entirely and serve the request directly from the cache.

What about cache invalidation? Instead of having the client check on each request if things have changed, we invert the process. The client asks the server to notify it when things change, and until it gets notice from the server, it can serve responses completely from the local cache.

I really love this feature, that was the Good part, now let’s talk about the other pieces:

There are only two hard things in Computer Science: cache invalidation and naming things.

-- Phil Karlton

The bad part of caching is that this introduces more complexity to the system. Consider a system with two clients that are using the same database. An update from one of them may show up at different times in each. Cache invalidation will not happen instantly, and it is possible to get into situations where the server fails to notify the client about the update, meaning that we didn’t clear the cache.

We have a good set of solutions around all of those, I think. But it is important to understand that the problem space itself is a problem.

In particular, let’s talk about dealing with the following query:

var emps = session.Query<Employee>()
    .Include(x => x.Department)
    .Where(x => x.Location.City == "London")
    .ToListAsync();

When an employee is changed on the server, it will send a notice to the client, which can evict the item from the cache, right? But what about when a department is changed?

For that matter, what happens if a new employee is added to London? How do we detect that we need to refresh this query?

There are solutions to those problems, but they are super complicated and have various failure modes that often require more computing power than actually running the query. For that reason, RavenDB uses a much simpler model. If the server notifies us about any change, we’ll mark the entire cache as suspect.

The next request will have to go to the server (again with an ETag, etc) to verify that the response hasn’t changed. Note that if the specific query results haven’t changed, we’ll get OK (304 Not Modified) from the server, and the client will use the cached response.

Conservatively aggressive approach

In other words, even when using aggressive caching, RavenDB still has to go to the server sometimes. What is the impact of this approach when you have a system under load?

We’ll still use aggressive caching, but you’ll see brief periods where we aren’t checking with the server (usually be able to cache for about a second or so), followed by queries to the server to check for any changes.

In most cases, this is what you want. We still benefit from the cache while reducing the number of remote calls by about 50%, and we don’t have to worry about missing updates. The downside is that, as application developers, we know that this particular document and query are independent, so we want to cache them until we get notice about that particular document being changed.

The default aggressive caching in RavenDB will not be of major help here, I’m afraid. But there are a few things you can do.

You can use Aggressive Caching in the NoTracking mode. In that mode, the client will not ask the server for notifications on changes, and will cache the responses in memory until they expire (clock expiration or size expiration only).

There is also a feature suggestion that calls for updating the aggressive cache in a background manner, I would love to hear more feedback on this proposal.

Another option is to take this feature higher than RavenDB directly, but still use its capabilities. Since we have a scenario where we know that we want to cache a specific set of documents and refresh the cache only when those documents are updated, let’s write it.

Here is the code:

public class RecordCache<T>
{
    private ConcurrentLru<string, T> _items = 
        new(256, StringComparer.OrdinalIgnoreCase);
    private readonly IDocumentStore _documentStore;


    public RecordCache(IDocumentStore documentStore)
    {
        const BindingFlags Flags = BindingFlags.Instance | 
            BindingFlags.NonPublic | BindingFlags.Public;
        var violation = typeof(T).GetFields(Flags)
            .FirstOrDefault(f => f.IsInitOnly is false);
        if (violation != null)
        {
            throw new InvalidOperationException(
                "You should cache *only* immutable records, but got: " + 
                typeof(T).FullName + " with " + violation.Name + 
                " which is not read only!");
        }


        var changes = documentStore.Changes();
        changes.ConnectionStatusChanged += (_, args) =>
        {
            _items = new(256, StringComparer.OrdinalIgnoreCase);
        };
        changes.ForDocumentsInCollection<T>()
            .Subscribe(e =>
            {
                _items.TryRemove(e.Id, out _);
            })
            ;
        _documentStore = documentStore;
    }


    public ValueTask<T> Get(string id)
    {
        if (_items.TryGetValue(id, out var result))
        {
            return ValueTask.FromResult(result);
        }
        return new ValueTask<T>(GetFromServer(id));


    }


    private async Task<T> GetFromServer(string id)
    {
        using var session = _documentStore.OpenAsyncSession();
        var item = await session.LoadAsync<T>(id);
        _items.Set(id, item);
        return item;
    }
}

There are a few things to note about this code. We are holding live instances, so we ensure that the values we keep are immutable records. Otherwise, we may hand the same instance to two threads which can be… fun.

Note that document IDs in RavenDB are case insensitive, so we pass the right string comparer.

Finally, the magic happens in the constructor. We register for two important events. Whenever the connection status of the Changes() connection is modified, we clear the cache. This handles any lost updates scenarios that occurred while we were disconnected.

In practice, the subscription to events on that particular collection is where we ensure that after the server notification, we can evict the document from the cache so that the next request will load a fresh version.

Caching + Distributed Systems = 🤯🤯🤯

I’m afraid this isn’t an easy topic once you dive into the specifics and constraints we operate under. As I mentioned, I would love your feedback on the background cache refresh feature, or maybe you have better insight into other ways to address the topic.

Jul 26 2024

Indexing only recent data - adventures with large datasets & archiving

time to read 4 min | 618 words

Tweet Share Share 1 comments

Tags:

We recently got a support request from a user in which they had the following issue:

We have an index that is using way too much disk space. We don’t need to search the entire dataset, just the most recent documents. Can we do something like this?

from d in docs.Events
where d.CreationDate >= DateTime.UtcNow.AddMonths(-3)
select new { d.CreationDate, d.Content };

The idea is that only documents from the past 3 months would be indexed, while older documents would be purged from the index but still retained.

The actual problem is that this is a full-text search index, and the actual data size required to perform a full-text search across the entire dataset is higher than just storing the documents (which can be easily compressed).

This is a great example of an XY problem. The request was to allow access to the current date during the indexing process so the index could filter out old documents. However, that is actually something that we explicitly prevent. The problem is that the current date isn’t really meaningful when we talk about indexing. The indexing time isn’t really relevant for filtering or operations, since it has no association with the actual data.

The date of a document and the time it was indexed are completely unrelated. I might update a document (and thus re-index it) whose CreationDate is far in the past. That would filter it out from the index. However, if we didn’t update the document, it would be retained indefinitely, since the filtering occurs only at indexing time.

Going back to the XY problem, what is the user trying to solve? They don’t want to index all data, but they do want to retain it forever. So how can we achieve this with RavenDB?

Data Archiving in RavenDB

One of the things we aim to do with RavenDB is ensure that we have a good fit for most common scenarios, and archiving is certainly one of them. In RavenDB 6.0 we added explicit support for Data Archiving.

When you save a document, all you need to do is add a metadata element: @archive-at and you are set. For example, take a look at the following document:

{
    "Name": "Wilman Kal",
    "Phone": "90-224 8888",
    "@metadata": {
        "@archive-at": "2024-11-01T12:00:00.000Z",
        "@collection": "Companies",
     }
}

This document is set to be archived on Nov 1st, 2024. What does that mean?

From that day on, RavenDB will automatically mark it as an archived document, meaning it will be stored in a compressed format and excluded from indexing by default.

In fact, this exact scenario is detailed in the documentation.

You can decide (on a per-index basis) whether to include archived documents in the index. This gives you a very high level of flexibility without requiring much manual effort.

In short, for this scenario, you can simply tell RavenDB when to archive the document and let RavenDB handle the rest. RavenDB will do the right thing for you.

Jul 05 2024

Reading unfamiliar codebases quickly: LMDB

time to read 4 min | 683 words

Tweet Share Share 2 comments

Tags:

Reading code is a Skill (with a capital letter, yes) that is really important for developers. You cannot be a good developer without it.

Today I want to talk about one aspect of this. The ability to go into an unfamiliar codebase and extract one piece of information out. The idea is that we don’t need to understand the entire system, grok the architecture, etc. I want to understand one thing about it and get away as soon as I can.

For example, you know that project Xyz is doing some operation, and you want to figure out how this is done. So you need to look at the code and figure that out, then you can go your merry way.

Today, I’m interested in understanding how the LMDB project writes data to the disk on Windows. This is because LMDB is based around a memory-mapped model, and Windows doesn’t keep the data between file I/O and mmap I/O coherent.

LMDB is an embedded database engine (similar to Voron, and in fact, Voron is based on some ideas from LMDB) written in C. If you are interested in it, I wrote 11 posts going through every line of code in the project.

So I’m familiar with the project, but the last time I read the code was over a decade ago. From what I recall, the code is dense. There are about 11.5K lines of code in a single file, implementing the entire thing.

I’m using the code from here.

The first thing to do is find the relevant section in the code. I started by searching for the WriteFile() function, the Win32 API to write. The first occurrence of a call to this method is in the mdb_page_flush function.

I look at this code, and… there isn’t really anything there. It is fairly obvious and straightforward code (to be clear, that is a compliment). I was expecting to see a trick there. I couldn’t find it.

That meant either the code had a gaping hole and potential data corruption (highly unlikely) or I was missing something. That led me to a long trip of trying to distinguish between documented guarantees and actual behavior.

The documentation for MapViewOfFile is pretty clear:

A mapped view of a file is not guaranteed to be coherent with a file that is being accessed by the ReadFile or WriteFile function.

I have my own run-ins with this behavior, which was super confusing. This means that I had experimental evidence to say that this is broken. But it didn’t make sense, there was no code in LMDB to handle it, and this is pretty easy to trigger.

It turns out that while the documentation is pretty broad about not guaranteeing the behavior, the actual issue only occurs if you are working with remote files or using unbuffered I/O.

If you are working with local files and buffered I/O (which is 99.99% of the cases), then you can rely on this behavior. I found some vaguereferences to this, but that wasn’t enough. There is this post that is really interesting, though.

I pinged Howard Chu, the author of LMDB, for clarification, and he was quick enough to assure me that yes, my understanding was (now) correct. On Windows, you can mix memory map operations with file I/O and get the right results.

The documentation appears to be a holdover from Windows 9x, with the NT line always being able to ensure coherency for local files. This is a guess about the history of documentation, to be honest. Not something that I can verify.

I had the wrong information in my head for over a decade. I did not expect this result when I started this post, I was sure I would be discussing navigating complex codebases. I’m going to stand in the corner and feel upset about this for a while now.

May 07 2024

Corax Query Plan visualization

time to read 3 min | 487 words

Tweet Share Share 10 comments

Tags:

Corax is the new indexing and querying engine in RavenDB, which recently came out with RavenDB 6.0. Our focus when building Corax was on one thing, performance. I did a full talk explaining how it works from the inside out, available here as well as a couple of podcasts.

Now that RavenDB 6.0 has been out for a while, we’ve had the chance to complete a few features that didn’t make the cut for the big 6.0 release. There is a host of small features for Corax, mostly completing tasks that were not included in the initial 6.0 release.

All these features are available in the 6.0.102 release, which went live in late April 2024.

The most important new feature for Corax is query plan visualization.

Let’s run the following query in the RavenDB Studio on the sample data set:

from index 'Orders/ByShipment/Location'
where spatial.within(ShipmentLocation, 
                  spatial.circle( 10, 49.255, 4.154, 'miles')
      )
and (Employee = 'employees/5-A' or Company = 'companies/85-A')
order by Company, score()
include timings()

Note that we are using the includetimings() feature. If you configure this index to use Corax, issuing the above query will also give us the full query plan. In this case, you can see it here:

You can see exactly how the query engine has processed your query and the pipeline it has gone through.

We have incorporated many additional features into Corax, including phrase queries, scoring based on spatial results, and more complex sorting pipelines. For the most part, those are small but they fulfill specific needs and enable a wider range of scenarios for Corax.

Over six months since Corax went live with 6.0, I can tell that it has been a successful feature. It performs its primary job well, being a faster and more efficient querying engine. And the best part is that it isn’t even something that you need to be aware of.

Corax has been the default indexing engine for the Development and Community editions of RavenDB for over 3 months now, and almost no one has noticed.

It’s a strange metric, I know, for a feature to be successful when no one is even aware of its existence, but that is a common theme for RavenDB. The whole point behind RavenDB is to provide a database that works, allowing you to forget about it.

Apr 04 2024

Vacations, leaky abstractions, and modeling concerns, oh my!

time to read 22 min | 4283 words

Tweet Share Share 0 comments

Tags:

Our task today is to request (and obtain approval for) a vacation. But before we can make that request, we need to handle the challenge of building the vacation requesting system. Along the way, I want to focus a little bit on how to deal with some of the technical issues that may arise, such as concurrency.

In most organizations, the actual details of managing employee vacations are a mess of a truly complicated series of internal policies, labor laws, and individual contracts. For this post, I’m going to ignore all of that in favor of a much simplified workflow.

An employee may Request a Vacation, which will need to be approved by their manager. For the purpose of discussion, we’ll ignore all other aspects and set out to figure out how we can create a backend for this system.

I’m going to use a relational database as the backend for now, using the following schema. Note that this is obviously a highly simplified model, ignoring many real-world requirements. But this is sufficient to talk about the actual issue.

After looking at the table structure, let’s look at the code (again, ignoring data validation, error handling, and other rather important concerns).

app.post('/api/vacations/request', async (req, res) => {
    const { employeeId, dates, reason } = req.body;


    await pgsql.query(`BEGIN TRANSACTION;`);
    const managerId = await pgsql.query(
      `SELECT manager FROM Employees WHERE id = $1;`,
      [employeeId]).rows[0].id;
    const vacReqId = await pgsql.query(
      `INSERT INTO VacationRequests (empId,approver,reason,status)
       VALUES ($1,$2,$3,'Pending') RETURNING id;`,
       [employeeId,managerId,reason]).rows[0].id;


    for(const date of date) {
        await pgsql.query(
          `INSERT INTO VacationRequestDates
           (vacReqId, date, mandatory ,notes)
           VALUES ($1, $2, $3, $4);`, 
          [vacReqId, d.date, d.mandatory, d.notes]);
    }
     
    await pgsql.query(`COMMIT;`);


    res.status(201).json({ requestId: result.rows[0].id });
});

We create a new transaction, find who the manager for the employee is, and register a new VacationRequest for the employee with all the dates for that vacation. Pretty simple and easy, right? Let’s look at the other side of this, approving a request.

Here is how a manager is able to get the vacation dates that they need to approve for their employees.

app.get('/api/vacations/approval', async (req, res) => {
  const { whoAmI } = req.body;
 
  const vacations = await pgsql.query(
    `SELECT VRD.id, VR.empId, VR.reason, VRD.date, E.name,
           VRD.mandatory, VRD.notes
    FROM VacationRequests VR
    JOIN VacationRequestDates VRD ON VR.id = VRD.vacReqId
    JOIN Employees E ON VR.empId = E.id
    WHERE VR.approver = $1 AND VR.status = 'Pending'`,
    [whoAmI]);


  res.status(200).json({ vacations });
});

As you can see, most of the code here consists of the SQL query itself. We join the three tables to find the dates that still require approval.

I’ll stop here for a second and let you look at the two previous pieces of code for a bit. I have to say, even though I’m writing this code specifically to point out the problems, I had to force myself not to delete it. There was mental pressure behind my eyes as I wrote those lines.

The issue isn’t a problem with a lack of error handling or security. I’m explicitly ignoring that for this sort of demo code. The actual problem that bugs me so much is modeling and behavior.

Let’s look at the output of the previous snippet, returning the vacation dates that we still need to approve.

id	empId	name	reason	date
8483	391	John	birthday	2024-08-01
8484	321	Jane	dentist	2024-08-02
8484	391	John	birthday	2024-08-02

We have three separate entries that we need to approve, but notice that even though two of those vacation dates belong to the same employee (and are part of the same vacation request), they can be approved separately. In fact, it is likely that the manager will decide to approve John for the 1st of August and Jane for the 2nd, denying John’s second vacation day. However, that isn’t how it works. Since the actual approval is for the entire vacation request, approving one row in the table would approve all the related dates.

When examining the model at the row level, it doesn’t really work. The fact that the data is spread over multiple tables in the database is an immaterial issue related to the impedance mismatch between the document model and the relational model.

Let’s try and see if we can structure the query in a way that would make better sense from our perspective. Here is the new query (the rest of the code remains the same as the previous snippet).

SELECT VRD.id, VR.empId, E.name, VR.reason,
    (
        SELECT json_agg(VRD)
        FROM VacationRequestDates VRD
        WHERE VR.id = VRD.vacReqId
    ) AS dates
FROM VacationRequests VR
JOIN Employees E ON VR.empId = E.id
WHERE VR.approver = $1 AND VR.status = 'Pending'

This is a little bit more complicated, and the output it gives is quite different. If we show the data in the same way as before, it is much easier to see that there is a single vacation request and that those dates are tied together.

id	empId	name	reason	status	date
8483	391	John	birthday	Pending	2024-08-01and 2024-08-02
8484	321	Jane	dentist	Pending	2024-08-02

We are going to ignore the scenario of partial approval because it doesn’t matter for the topic I’m trying to cover. Let’s discuss two other important features that we need to handle. How do we allow an employee to edit a vacation request, and how does the manager actually approve a request.

Let’s consider editing a vacation request by the employee. On the face of it, it’s pretty simple. We show the vacation request to the employee and add the following endpoint to handle the update.

app.post('/api/vacation-request/date', async (req, res) => {
  const { id, date, mandatory, notes, vacReqId } = req.body;
 
 if(id typeof == 'number') {
  await pgsql.query(
    `UPDATE VacationRequestDates
    SET date = $1, mandatory = $2, notes = $3
    WHERE id = $4`,
    [date, mandatory, notes, id]);
 }
 else {
  await pgsql.query(
    `INSERT INTO VacationRequestDates (date, mandatory, notes, vacReqId)
    VALUES ($1, $2, $3, $4)`,
    [date, mandatory, notes, vacReqId]);
 }
 
  res.status(200);
});


app.delete('/api/vacation-request/date', async (req, res) => {
  const { id } = req.query;
 
  await pgsql.query(
    `DELETE FROM VacationRequestDates WHERE id = $1`,
    [id]);


  res.status(200);
});

Again, this sort of code is like nails on board inside my head. I’ll explain why in just a bit. For now, you can see that we actually need to handle three separate scenarios for editing an existing request date, adding a new one, or deleting it. I’m now showing the code for updating the actual vacation request (such as the reason for the request) since that is pretty similar to the above snippet.

The reason that this approach bugs me so much is because it violates transaction boundaries within the solution. Let’s assume that I want to take Thursday off instead of Wednesday and add Friday as well. How would that be executed using the current API?

I would need to send a request to update the date on one row in VacationRequestDates and another to add a new one. Each one of those operations would be its own independent transaction. That means that either one can fail. While I wanted to have both Thursday and Friday off, only the request for Friday may succeed, and the edit from Wednesday to Thursday might not.

It also means that the approver may see a partial state of things, leading to an interesting problem and eventually an exploitable loophole in the system. Consider the scenario of the approver looking at vacation requests and approving them. I can arrange things so that while they are viewing the request, the employee will add additional dates. When the approver approves the request, they’ll also approve the additional dates, unknowingly.

Let’s solve the problem with the transactional updates on the vacation request and see where that takes us:

app.post('/api/vacation-request/update', async (req, res) => {
  const { varRecId, datesUpdates } = req.body;
  await pgsql.query(`BEGIN TRANSACTION;`);


  for (const { op, id, date, mandatory, notes } of datesUpdates) {
    if (op === 'delete') {
      await pgsql.query(`DELETE FROM VacationRequestDates
        WHERE id = $1;`,
        [id]);
    }
    else if (op === 'insert') {
      await pgsql.query(`INSERT INTO VacationRequestDates
        (varRecId, date, mandatory, notes)
        VALUES ($1, $2, $3, $4);`,
        [varRecId, date, mandatory, notes]);
     
    }
    else {
      await pgsql.query(`UPDATE VacationRequestDates
        SET date = $1, mandatory = $2, notes = $3
        WHERE id = $4;`,
        [date, mandatory, notes, id]);
    }
  }


  await pgsql.query(`COMMIT;`);
  res.status(200);
});

That is… a lot of code to go through. Note that I looked into Sequelize as well to see what kind of code that would produce when using an OR/M, it wasn’t meaningfully simpler.

There is a hidden bug in the code above. But you probably won’t notice it no matter how much you’ll look into it. The issue is code that isn’t there. The API code above assumes that the caller will send us all the dates for the vacation requests, but it is easy to get into a situation where we may edit the same vacation requests from both the phone and the laptop, and get partial information.

In other words, our vacation request on the database has four dates, but I just updated three of them. The last one is part of my vacation request, but since I didn’t explicitly refer to that, the code above will ignore that. The end result is probably an inconsistent state.

In other words, to reduce the impedance mismatch between my database and the way I work with the user, I leaned too much toward exposing the database to the callers. The fact that the underlying database is storing the data in multiple tables has leaked into the way I model my user interface and the wire API. That leads to a significant amount of complexity.

Let’s go back to the drawing board. Instead of trying to model the data as a set of rows that would be visually represented as a single unit, we need to actually think about a vacation request as a single unit.

Take a look at this image, showing a vacation request form. That is how the business conceptualizes the problem: as a single cohesive unit encompassing all the necessary data for submitting and approving a vacation request.

Note that for real systems, we’ll require a lot more data, including details such as the actual vacation days taken, how they should be recorded against the employee’s leave allowance, etc.

The important aspect here is that instead of working with individual rows, we need to raise the bar and move to working with the entity as a whole. In modeling terms, this means that we won’t work with rows but with Root Aggregate (from DDD terminology).

But I already have all of this code written, so let’s see how far I can push things before I even hit my own limits. Let’s look at the code that is required to approve a vacation request. Here is the first draft I wrote to do so.

app.post('/api/vacation-request/approve', async (req, res) => {
  const { varRecId, approver, status } = req.body;


  const res = await pgsql.query(`UPDATE VacationRequests
   SET status = $1 WHERE id = $2 and approver = $3;`,
    [status, varRecId, approver]);
 
  if (res.rowCount == 0) {
    res.status(400)
      .send({ error: 'No record found or wrong approver' });
  }


  res.status(200);
});

Which will give me the vacation requests that I need to approve:

id	empId	name	reason	status	date
8483	391	John	birthday	Pending	2024-08-01 and 2024-08-02

And then I actually approve it using:

POST /api/vacation-request/approve
{"varRecId": 8483, "approver": 9341, "status": "Approved"}

What is the problem now? Well, what happens if the employee modifies the vacation request between the two requests? The approver may end up approving the wrong details. How do we fix that?

You may think that you can use locking on the approve operation, but we actually have just a single statement executed, so that doesn’t matter. And given that we have two separate requests, with distinct database transactions between them, that isn’t even possible.

What we need to implement here is called Offline Optimistic Concurrency. In other words, we need to ensure that the version the manager approved is the same as the one that is currently in the database.

In order to do that, we need to modify our schema and add a version column to the VacationRequests table, as you can see in the image.

Now, any time that I make any modification on the VacationRequest, I must also increment the value of the Version field and check that it matches my expected value.

Here is an example of how this looks like when the Employee is adding a new date to the vacation request. I shortened the code that we previously looked at to update a vacation request, so you can more clearly see the changes required to ensure that changes in the request will be detected between requests.

app.post('/api/vacation-request/insert-date', async (req, res) => {
  const { varRecId, version,  } = req.body;
  await pgsql.query(`BEGIN TRANSACTION;`);


  const res = await pgsql.query(`UPDATE VacationRequests
   SET version = version + 1
    WHERE id = $1 and version = $2;`,
    [varRecId, version]);


  if (res.rowCount == 0) {
    res.status(400)
      .send({ error: 'No record found or wrong version' });
  }


  await pgsql.query(`INSERT INTO VacationRequestDates
        (varRecId, date, mandatory, notes)
        VALUES ($1, $2, $3, $4);`,
    [varRecId, date, mandatory, notes]);


  await pgsql.query(`COMMIT;`);
  res.status(200);
});

And on the other side, approving the request is now:

app.post('/api/vacation-request/approve', async (req, res) => {
  const { varRecId, approver, version, status } = req.body;


  const res = await pgsql.query(`UPDATE VacationRequests
   SET status = $1 and version = version + 1
   WHERE id = $2 and approver = $3 and version = $4;`,
    [status, varRecId, approver, version]);
 
  if (res.rowCount == 0) {
    res.status(400)
      .send({ 
         error: 'No record found or wrong approver or version'
       });
  }


  res.status(200);
});

We need to send the version to the client when we read it, and when we approve it, we need to ensure that we send the version back, to verify that there have been no changes.

I have to say, given that I set out to do something pretty simple, I’m actually shocked at how complex this all turned out to be. The solution above also requires cooperation from all entities. If I’m ever writing some code that modifies the vacation requests or manages them manually (for maintenance purposes, debugging, etc) I need to also remember to include the version updates.

When I started writing this blog post, I intended to also show you how you can model the same situation differently. But I think that this is quite long enough already, and I’ll complete the proper modeling concerns in the next post.

Mar 28 2024

RavenDB’s storage engine: Voron–unlocking the secret

time to read 1 min | 103 words

Tweet Share Share 0 comments

Tags:

A couple of months ago I had the joy of giving an internal lecture to our developer group about Voron, RavenDB’s dedicated storage engine. In the lecture, I’m going over the design and implementation of our storage engine.

If you ever had an interest on how RavenDB’s transactional and high performance storage works, that is the lecture for you. Note that this is aimed at our developers, so we are going deep.

You can find the slides here and here is the full video.

Mar 06 2024

RavenDB and Two Factor Authentication

time to read 2 min | 270 words

Tweet Share Share 0 comments

Tags:

RavenDB is typically accessed directly by your application, using an X509 certificate for authentication. The same applies when you are connecting to RavenDB as a user.

Many organizations require that user authentication will not use just a single factor (such as a password or a certificate) but multiple. RavenDB now supports the ability to define Two Factor Authentication for access.

Here is how this looks like in the RavenDB Studio:

You are able to generate a certificate as well as register the Authenticator code in your device.

When using the associated certificate, you’ll not be able to access RavenDB. Instead, you’ll get an error message saying that you need to complete the Two Factor Authentication process. Here is what that looks like:

Once you complete the two factor authentication process, you can select for how long we’ll allow access with the given certificate and whatever to allow just accesses from the current browser window (because you are accessing it directly) or from any client (you want to access RavenDB from another device or via code).

Once the session duration expires, you’ll need to provide the authentication code again, of course.

This feature is meant specifically for certificates that are used by people directly. It is not meant for APIs or programmatic access. Those should either have a manual step to allow the certificate or utilize a secrets manager that can have additional steps and validations based on your actual requirements.

You can read more about this feature in the feature announcement.

Mar 05 2024

RecordingTechnology & Friends - Oren Eini on the Corax Search Engine

time to read 1 min | 101 words

Tweet Share Share 0 comments

Tags:

When Oren Eini originally developed RavenDB, he used the Lucene library to implement indexing. Eventually, his team encountered limitations with this strategy, so they created the Corax search engine, which improved query execution time significantly. Oren discusses the challenges involved in creating this engine and the approaches they took to overcome these challenges.

Part 1:

Part 2:

Feb 23 2024

RavenDB, Raspberry Pi & Hugin Appliance: oh my!

time to read 4 min | 792 words

Tweet Share Share 2 comments

Tags:

RavenDB can run on the Raspberry Pi, it is actually an important use case for us when our users are deploying RavenDB as part of Internet of Things systems. We wanted to showcase RavenDB’s performance and decided that instead of scaling up and showing you how well RavenDB does ridiculous loads, we’ll go the other way around. We’ll go small, and let you directly experience how efficient RavenDB is.

You can look at the demo unit directly on this page.

We decided to dial it down yet further, and run RavenDB on the Raspberry Pi Zero.

This tiny computer is about the size of a cigarette lighter and is small enough to comfortably fit on your keychain. Most Raspberry Pis are impressive machines given their cost, more than powerful enough to power real applications.

Here is what this actually looks like, with me as a reference for size 🙂.

However, just installing RavenDB on the Zero isn't much of a challenge or particularly interesting, to be honest. We wanted to do something that would be both fun and useful. One of the features we want users to explore is the ability to run RavenDB in appliance mode. The question is, what sort of an appliance will we build?

A key part of our thinking was that we wanted to show something that works with realistic data sizes. We wanted to have an actual use case for this, beyond just showing a toy. One of the things that I always find maddening about being disconnected is that I feel like half my brain has been cut away.

We set out to fix that, the project is to create a knowledge system inside the Pi Zero that would be truly Plug & Play. That turned out to be quite a challenge, but I think we met it in a very nice manner.

We went to archive.org and got some of the Stack Exchange data sets. In particular, we got the datasets that are most interesting for DevOps scenarios. In particular, we have raspberrypi.stackexchange.com, unix.stackexchange.com, serverfault.com, and superuser.com.

I find it deliciously recursive that we can use the Raspberry Pi Zero to store the dataset about the Raspberry Pi itself. We loaded all those datasets into the Zero, for a total of about 7.5 GB, and over 4.2 million documents were stored there.

Note that this is using RavenDB’s document compression, which reduced the total size by over 50% over the original dataset size.

Next was the time to actually make this accessible. Just working with RavenDB directly to query the data is cool, for sure, but we wanted to be useful.

So we built a portal to access the data. Here is what it looks like when you enter it for the first time:

We offer full search capabilities and complete offline access to all those data sets. Perfect when you are stuck in the middle of nowhere and urgently need to remember that awk syntax or how to configure networking on a stubborn device.

Another aspect that we have to consider is how this can work? The Raspberry Pi Zero is a tiny device, and actually working with it can be annoying. It needs Micro-USB power but has no ethernet or standard USB ports. For display, it uses a mini HDMI port. That means that you can safely assume that you’re likely to have a power cable for it, but not much else.

We want to provide a good solution, so what do we do? The Raspberry Pi Zero we use does have a wifi chip, so we took things further and set it up as an access point with a captive portal.

You can read exactly how we configured that in this post.

In other words, the expected deployment model is to plug this into power, wait 30 seconds for the machine to boot, and then connect to the “Hugin” wireless network. You will then land directly into the application, able to deep dive into the questions of your choice.

We have been giving away those appliances at the DevWeek conference, and we got a really good reaction from users. Beyond the coolness factor, the fact that we can run a high-performance system on top of a… challenging hardware platform (512MB RAM, 1Ghz RAM, SD Card for disk) and still provide sub-100ms response times is quite amazing.

You can view the project page here, the entire thing is Open Source, and you can explore how we are able to do that on GitHub.

Jan 03 2024

RavenDB HTTP Compression: Bandwidth & Time reductions

time to read 1 min | 183 words

Tweet Share Share 2 comments

Tags:

I recently talked about how RavenDB is now using ZStd as the default compression algorithm for backups. That led to a reduction both in the amount of storage we are consuming for backups and a significant reduction in the time to actually run the backups.

We have been exploring where else we can get those benefits and the changes were recently released in RavenDB 6.0.2.

RavenDB now supports ZStd for HTTP compression, which you can control using the DocumentConventions.HttpCompressionAlgorithm.

You can find all the gory details about the performance impact in the release announcement here.

The really nice thing is that you can expect to see about a 50% reduction in the amount of bandwidth being used at comparable or better timings. That is especially true if you are using bulk inserts, where the benefit is most noticeable.

If you are running on the cloud, that matters a lot, since a reduction in bandwidth to and from the database translates directly into dollars being saved.

Oren Eini

Oren Eini

CEO of RavenDB

Caching documents in RavenDB: The good, the bad and the ugly

Aggressive Caching in RavenDB

Conservatively aggressive approach

Caching + Distributed Systems = 🤯🤯🤯

Indexing only recent data - adventures with large datasets & archiving

Data Archiving in RavenDB

Reading unfamiliar codebases quickly: LMDB

Corax Query Plan visualization

Vacations, leaky abstractions, and modeling concerns, oh my!

RavenDB’s storage engine: Voron–unlocking the secret

RavenDB and Two Factor Authentication

RecordingTechnology & Friends - Oren Eini on the Corax Search Engine

RavenDB, Raspberry Pi & Hugin Appliance: oh my!

RavenDB HTTP Compression: Bandwidth & Time reductions

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed