Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

[email protected] +972 52-548-6969

Posts: 7,527
|
Comments: 51,162
Privacy Policy · Terms
filter by tags archive
time to read 3 min | 504 words

For an article I am writing, I wanted to compare a RavenDB model to a relational model, and I stumbled upon the following Northwind.NET project.

I plugged in the Entity Framework Profiler and set out to watch what was going on. To be truthful, I expected it to be bad, but I honestly did not expect what I got. Here is a question, how many queries does it take to render the following screen?

image

The answer, believe it or no, is 17:

image

You might have noticed that most of the queries look quite similar, and indeed, they are. We are talking about 16(!) identical queries:

SELECT [Extent1].[ID]           AS [ID],
       [Extent1].[Name]         AS [Name],
       [Extent1].[Description]  AS [Description],
       [Extent1].[Picture]      AS [Picture],
       [Extent1].[RowTimeStamp] AS [RowTimeStamp]
FROM   [dbo].[Category] AS [Extent1]

Looking at the stack trace for one of those queries led me to:

image

And to this piece of code:

image

You might note that dynamic is used there, for what reason, I cannot even guess. Just to check, I added a ToArray() to the result of GetEntitySet, and the number of queries dropped from 17 to 2, which is more reasonable. The problem was that we passed an IQueryable to the data binding engine, which ended up evaluating the query multiple times.

And EF Prof actually warns about that, too:

image

At any rate, I am afraid that this project suffer from similar issues all around, it is actually too bad to serve as the bad example that I intended it to be.

time to read 2 min | 400 words

In theory, there is no difference between theory and real life.

In my previous blog post, I discussed my belief that the best value you get from learning is learning the very basic of how our machines operate. From learning about memory management in operating systems to the details of how network protocols like TCP/IP work.

Some of that has got to be theoretical study, actually reading about how those things work, but theory isn’t enough. I don’t care if you know the TCP specs by heart, if you haven’t actually built a real system with it, and experienced the pain points, it isn’t really meaningful. The best way to learn, at least from my own experiences, is to actually do something.

Because that teaches you several very interesting things:

  • What are the differences between the spec and what is actually implemented?
  • How to resolve common (and not so common problems)?

The later is probably the most important thing. I think that I learned most of what I know about HTTP in the process of building an RSS feed reader. I learned a lot about TCP from implementing a proxy system, and I did a lot of learning from a series of failed projects regarding distributed programming in general.

I learned a lot about file systems and how to work with file based storage from Practical File System Design and from building Rhino Queues and Rhino DHT. In retrospect, I did a lot of very different projects in various areas and technologies.

The best way that I know to get better is to do, to fail, and to learn from what didn’t work. I don’t know of any shortcuts, although I am familiar with plenty of ways of making the road much longer (and usually not very pleasant).

In short, if you want to get better, pick something that you don’t know how to do, and then do it. You might fail, you likely will, but you’ll learn a lot from failing.

I keep drawing a blank when people ask me to suggest options for things to try building, so I thought that I would ask the readers of this blog. What sort of things do you think would be useful to build? Things that would push most people out of their comfort zone and make them learn the fundamentals of how things work.

time to read 3 min | 599 words

One of the questions that I routinely get asked is “how do you learn”. And the answer that I keep giving is that I had accidently started learning things from the basic building blocks. I still count a C/C++ course that I took over a decade ago as one of the chief reasons why I have a good grounding in how computers actually operate. During that course, we had to do anything from building parts of the C standard library on our own to construct much of the foundation of C++ features in plain C.

That gave me enough understanding of how things are actually implemented to be able to grasp how things are behaving elsewhere. Digging deep into the implementation is almost never a wasted effort. And if you can’t peel away the layer of abstractions, you can’t really say that you know what you are doing.

For example, I count myself ignorant in all manners about WCF, but I have full confidence that I can build a system using it. Not because I understand WCF itself, but because I understand the arena in which it plays. I don’t need to really understand how a certain technology works, if I already know what are the rules it has to play with.

Picking on WCF again, if you don’t know firewalls and routers, you can’t really build a WCF system, regardless of how good your memory is about the myriad ways of configuring WCF to do you will. If you can’t use WireShark to figure out why the system is slow to respond to requests, it doesn’t matter if you can compose an WCF envelope message literally on the back of a real world envelope.  If you don’t grok the Fallacies of Distributes Computing, you shouldn’t be trying to build a real system where WCF is used, regardless of whatever certificate you have from Microsoft.

The interesting bit is that for most of what we do, the rules are fairly consistent. We all have to play in Turing’s sand box, after all.

What this means is that learning the details of IP and TCP will be worth it over and over again. Understanding things like memory fetch latencies would be relevant in 5 years and in ten. Knowing what actually goes on in the system, even if it at a somewhat abstracted level is important. That is what make you the master of the system, instead of its slave.

Some of the things that I especially value, and that is of the top of my head and isn’t a closed list are:

  • TCP / UDP – how do they actually work.
  • HTTP – and implications (for example, state management).
  • The Fallacies of Distributed Computing.
  • Disk based storage – efficiently working with it, how file system works.
  • Memory management in OS and your environment.

Obviously, this is a very short list, and again, it isn’t comprehensive.  It is just meant to give you some indications for things that I have found to be useful over and over and over again.

That kind of knowledge isn’t something that is replaced often, and it will help you understand how anyone else has to interact with the same constraints. In fact, it often allows you to accurately guess how they solve a certain problem, because you are aware of the same alternatives that the other side had to solve.

In short, if you seek to be a better developer, dig deep and learn the real basic building blocks for our profession.

In my next post, I’ll discuss strategies for doing that.

time to read 2 min | 269 words

TLDR;

Replication topologies make my head hurt.

One of our customers had an interesting requirement, several months ago:

image

Basically, he wanted to write a document at node #1, and have it replicate, through node #2, to node #3. That was an easy enough change, and we did that. But then we got another issue from a different customer, who had the following topology:

image

And that client problem is that when making a write to node #1, it would be replicated to nodes 2 – 4, each of which would then try to update the other two with the new replication information (it would skip node #1 because it is the source). That would cause… issues, because they already had that document in place.

In order to resolve that, I added a configuration option, which controls whatever the node that we replicate to should receive only documents that were modified on the current node, or whatever we need to include documents that were replicated to us from other nodes as well.

It is a relatively small change, code wise. Of course, documenting this, and all of the options that follows is going to be a much bigger task, because now you have to make a distinction between replicating nodes, gateway nodes, etc.

time to read 1 min | 54 words

Recently I had a spate of posts showing up, and then reporting 404. The actual reason behind that was that the server and the database disagreed with one another with regards to the timezone of the post (one was using UTC, the other local). Sorry for the trouble, and this shouldn’t happen any longer.

time to read 5 min | 884 words

This StackOverflow question indicate that it is half a bug and half a feature, but that it sure as hell looks like a bug to me.

Let us assume that we have a couple of endpoints in our application, called http://localhost:8080/secure and http://localhost:8080/public. As you can imagine, the secure endpoint is… well, secure, and requires authentication. The public endpoint does not.

We want to optimize the number of request we make, so we specify PreAuthenticated = true; And that is where all hell break lose.

The problem is that it appears that when using request with entity body (in other words, PUT / POST) with PreAuthenticate = true, the .NET framework will issue a PUT / POST request with empty body to the server. Presumably to get the 401 authentication information. At that point, if the endpoint that it happened to have reached is public, it will be accepted as a standard request, and processing will be tried. The problem here is that it has an empty body, so that has a very strong likelihood of failing.

This error cost me a day and a half or so. Here is the full repro:

static void Main()
{
    new Thread(Server)
    {
        IsBackground = true
    }.Start();

    Thread.Sleep(500); // let the server start

    bool secure = false;
    while (true)
    {
        secure = !secure;
        Console.Write("Sending: ");
        var str = new string('a', 621);
        var req = WebRequest.Create(secure ? "http://localhost:8080/secure" : "http://localhost:8080/public");
        req.Method = "PUT";

        var byteCount = Encoding.UTF8.GetByteCount(str);
        req.UseDefaultCredentials = true;
        req.Credentials = CredentialCache.DefaultCredentials;
        req.PreAuthenticate = true;
        req.ContentLength = byteCount;

        using(var stream = req.GetRequestStream())
        {
            var bytes = Encoding.UTF8.GetBytes(str);
            stream.Write(bytes, 0, bytes.Length);
            stream.Flush();
        }

        req.GetResponse().Close();

    }

}

And the server code:

public static void Server()
{
    var listener = new HttpListener();
    listener.Prefixes.Add("http://+:8080/");
    listener.AuthenticationSchemes = AuthenticationSchemes.IntegratedWindowsAuthentication | AuthenticationSchemes.Anonymous;
    listener.AuthenticationSchemeSelectorDelegate = request =>
    {

        return request.RawUrl.Contains("public") ? AuthenticationSchemes.Anonymous : AuthenticationSchemes.IntegratedWindowsAuthentication;
    };

    listener.Start();

    while (true)
    {
        var context = listener.GetContext();
        Console.WriteLine(context.User != null ? context.User.Identity.Name : "Anonymous");
        using(var reader = new StreamReader(context.Request.InputStream))
        {
            var readToEnd = reader.ReadToEnd();
            if(string.IsNullOrEmpty(readToEnd))
            {
                Console.WriteLine("WTF?!");
                Environment.Exit(1);
            }
        }

        context.Response.StatusCode = 200;
        context.Response.Close();
    }
}

If we remove pre authenticate is set to false, everything works, but then we have twice as many requests. The annoying thing is that if it would be trying to authenticate to a public endpoint, nothing would happen, if it were sending the bloody entity body along as well.

This is quite annoying.

time to read 3 min | 537 words

The reason that I said that this is very stupid code?

public static void WriteDataToRequest(HttpWebRequest req, string data)
{
    var byteCount = Encoding.UTF8.GetByteCount(data);
    req.ContentLength = byteCount;
    using (var dataStream = req.GetRequestStream())
    {
        if(byteCount <= 0x1000) // small size, just let the system allocate it
        {
            var bytes = Encoding.UTF8.GetBytes(data);
            dataStream.Write(bytes, 0, bytes.Length);
            dataStream.Flush();
            return;
        }

        var buffer = new byte[0x1000];
        var maxCharsThatCanFitInBuffer = buffer.Length / Encoding.UTF8.GetMaxByteCount(1);
        var charBuffer = new char[maxCharsThatCanFitInBuffer];
        int start = 0;
        var encoder = Encoding.UTF8.GetEncoder();
        while (start < data.Length)
        {
            var charCount = Math.Min(charBuffer.Length, data.Length - start);

            data.CopyTo(start, charBuffer, 0, charCount);
            var bytes = encoder.GetBytes(charBuffer, 0, charCount, buffer, 0, false);
            dataStream.Write(buffer, 0, bytes);
            start += charCount;
        }
        dataStream.Flush();
    }
}

Because all of this lovely code can be replaced with a simple:

public static void WriteDataToRequest(HttpWebRequest req, string data)
{
    req.ContentLength = Encoding.UTF8.GetByteCount(data);

    using (var dataStream = req.GetRequestStream())
    using(var writer = new StreamWriter(dataStream, Encoding.UTF8))
    {
        writer.Write(data);
        writer.Flush();
    }
}

And that is so much better.

time to read 3 min | 549 words

We had the following code:

public static void WriteDataToRequest(HttpWebRequest req, string data)
{
    var byteArray = Encoding.UTF8.GetBytes(data);

    req.ContentLength = byteArray.Length;

    using (var dataStream = req.GetRequestStream())
    {
        dataStream.Write(byteArray, 0, byteArray.Length);
        dataStream.Flush();
    }
}

And that is a problem, because it allocates the memory twice, once for the string, once for the buffer. I changed that to this:

public static void WriteDataToRequest(HttpWebRequest req, string data)
{
    var byteCount = Encoding.UTF8.GetByteCount(data);
    req.ContentLength = byteCount;
    using (var dataStream = req.GetRequestStream())
    {
        if(byteCount <= 0x1000) // small size, just let the system allocate it
        {
            var bytes = Encoding.UTF8.GetBytes(data);
            dataStream.Write(bytes, 0, bytes.Length);
            dataStream.Flush();
            return;
        }

        var buffer = new byte[0x1000];
        var maxCharsThatCanFitInBuffer = buffer.Length / Encoding.UTF8.GetMaxByteCount(1);
        var charBuffer = new char[maxCharsThatCanFitInBuffer];
        int start = 0;
        var encoder = Encoding.UTF8.GetEncoder();
        while (start < data.Length)
        {
            var charCount = Math.Min(charBuffer.Length, data.Length - start);

            data.CopyTo(start, charBuffer, 0, charCount);
            var bytes = encoder.GetBytes(charBuffer, 0, charCount, buffer, 0, false);
            dataStream.Write(buffer, 0, bytes);
            start += charCount;
        }
        dataStream.Flush();
    }
}

And I was quite proud of myself.

Then I realized that I was stupid. Why?

FUTURE POSTS

  1. RavenDB Performance: 15% improvement in one line - 15 hours from now

There are posts all the way to Dec 02, 2024

RECENT SERIES

  1. RavenDB Cloud (2):
    26 Nov 2024 - Auto scaling
  2. Challenge (75):
    01 Jul 2024 - Efficient snapshotable state
  3. Recording (14):
    19 Jun 2024 - Building a Database Engine in C# & .NET
  4. re (33):
    28 May 2024 - Secure Drop protocol
  5. Meta Blog (2):
    23 Jan 2024 - I'm a JS Developer now
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}