ReachableCode – The ostream of conciousness of a <random> developer

C++26 Reflections adventures & compile time UML

Posted on 31 July 20253 August 2025 by gianluca

The first thing I do every time I need to learn a new codebase is to start drawing the UML diagram of its classes, and usually give up soon after I started. The process of doing it manually is certainly useful, but now with reflections I figure it would be fun to try to generate it instead.

With C++26 reflections[1] the general consensus is that the magnitude of language change is comparable to what happened with C++11. After my little experiment with it, I would cautiously agree. So how does one go about creating a (Plant)UML diagram at compile time? With something like this.

With the spoilers out of the way let’s dig in the details.

P2996[1] introduces a couple of operators, namely lift ^^ and splice [: :]. The first one “lifts” a type or variable into a “meta” space, and the “splice” one (which imho should have been called “grounding operator”) does the opposite.

The first thing to understand is that regardless of what we apply the lift operator (^^) to, it creates a std::meta::info type. This means that some info objects will be reflections of types, and some will be reflections of values. This creates confusion in my head, as at times one needs to check which kind of info something is, but there are good reasons for this and are well explained in section 2.2 of the paper. With that in mind let’s start coding.

int main() {
  MyClass s;
  std::string_view dot_graph_uml = make_class_graph<MyClass>();

  std::cout << dot_graph_uml << std::endl;
}

(edit: thanks u/katzdm-cpp for the suggestion to use std::string_view) ~~You’ll notice right away there is an ugly, and seemingly avoidable char pointer. Let’s get back to that later.~~ Not much happens in main() besides us calling function template that is just a wrapper for what we actually care about:

template<typename U>
consteval std::string_view make_class_graph() { 
  std::string graph = "@startuml \nskinparam linetype ortho \n";

  std::vector<std::meta::info> already_drawn;
  graph += make_class_graph_impl(^^U, already_drawn);
  graph += "@enduml";

  return std::define_static_string(graph);
}

Here we see the first interesting thing, something called std::define_static_string[2]. What this does is take what is a compile time std::string and create a string literal, which we can return from a consteval function. If we were to try to return the std::string we would be greeted with the compiler error

<source>:116:24: note: pointer to subobject of heap-allocated object is not a constant expression
/opt/compiler-explorer/clang-bb-p2996-trunk-20250703/bin/../include/c++/v1/__memory/allocator.h:117:13: note: heap allocation performed here
  117 |     return {allocate(__n), __n};

Which makes sense as we cannot expect to create an object we allocate on the heap at compile time, then have this object exist in the heap also at runtime. This is why we end up with a big fat literal string that holds our final UML diagram, which we can do whatever we want with in main(). Looking at the assembly you’ll see the end result:

 .asciz  "@startuml \nskinparam linetype ortho \ntogether {\n class \"MyClass\"\n  \"MyClass\"*--\"unsigned int\"\n  \"MyClass\"*--\"unsigned int\"\n  \"MyClass\"*--\"unsigned int\"\n  \"MyClass\"*--\"Nested\"\n  \"MyClass\"-up-|>\"MyBase\"\n}\ntogether {\n class \"Nested\"\n  \"Nested\"*--\"int\"\n  \"Nested\"*--\"int\"\n  \"Nested\"*--\"MyClass\"\n}\ntogether {\n class \"MyBase\"\n  \"MyBase\"*--\"vector<int, allocator<int>>\"\n}\ntogether {\n class \"vector<int, allocator<int>>\"\n}\n@enduml"

Now that we have a magic function that “defines a static string” for us, let’s dig into the actual reflection bits. The first thing to note is what we pass to make_class_graph_impl, which shows the use of the lift operator, ^^U. This creates a std::meta::info object that in this case is a reflection of a type. The Impl function itself, as you may have guessed, is meant to be recursive and takes a second argument that we’ll explain later:

consteval std::string make_class_graph_impl(std::meta::info head, std::vector<std::meta::info>& already_drawn)
{
  [...]

  constexpr auto ctx = std::meta::access_context::current();
  std::string uml_diagram;
    
  uml_diagram += "together {\n class " + add_quotes(display_string_of(head)) + "\n";
  
  const std::string indent = "  ";

  // members
  for (std::meta::info field_info : std::meta::nonstatic_data_members_of(head, ctx))
  {
    uml_diagram += indent +  add_quotes(display_string_of(head)) + composition_arrow + add_quotes(display_string_of(remove_ptr_cv_type_of(field_info))) + "\n";
  }
[...]

First let’s talk about the new “context” object: the main paper[1] describes it as “a class that represents a namespace, class, or function from which queries pertaining to access rules may be performed[…]“, and it was introduced in [3] as a mean to resolve the “encapsulation” issue. This context comes in 3 flavours:

std::meta::access_context::current(): for accessing the public stuff in the current scope
std::meta::access_context::unprivileged(): for accessing public stuff in the global scope
std::meta::access_context::unchecked(): for accessing anything in the global scope

After that the interesting things are std::meta::nonstatic_data_members_of(head, ctx), and std::meta::info::display_string_of(head). Those are pretty self explanatory, and they do make “metaprogramming” as easy as “programming”!

It gets a bit trickier later, when we have to recurse

uml_diagram += make_class_graph_impl(remove_ptr_cv_type_of(field_info), already_drawn);

Besides the very convenient fact that we can keep track of what we iterated over already simply using a std::vector<std::meta::info>, we do define a custom function with the terrible name of remove_ptr_cv_type_of.

consteval auto remove_ptr_cv_type_of(std::meta::info r) -> std::meta::info {
  return decay(remove_pointer(is_type(r) ? r : type_of(r)));
}

This does what it says on the box, as in our UML we don’t want to distinguish between const/volatile/pointers whathaveyou. The important and interesting bit however is std::meta::is_type(..). This function tells us what kind of info object we have, as they can be reflections of anything, and therefore we need to test whether or not we have a reflection of a type or a value and apply std::meta::type_of only when necessary. This is a bit of a pain, but imho it’s a small price to pay for Reflections.

That is basically it, we just need to try it out plotting our PlantUML result, which gets printed out at runtime in this case.

References

[1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p2996r12.html

[2] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p3491r3.html

[3] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p3547r0.html

A Doubly-MMapped Contiguous Shared-memory Lock-free Queue

Posted on 22 November 202222 November 2022 by gianluca

I realise the title is a bit of a mouthful. It was either that or “No-copy IPC lock-free queue for variable length messages”. Naming is hard.

DISCLAIMER: It’s not uncommon that one needs to send messages to another process as fast as possible, but 99% of the cases the best answer for that is boost::interprocess[1]. Even in the last 1%, in production one should obviously be very careful before attempting to re-invent the wheel. However, as the spirit of this blog is to learn and show how the sausage is made, we shall do just that 😉

Here is the code for the people that just want to get to the point.

Intro

It’s easy(ish) to create a ring-based lockfree queue that is Single Producer-Single Consumer[2], but it gets tricky if one also needs variable size messages. The problem with variable size messages is that the end of the ring and the beginning of the ring could hold two halves of a single message, so we’d have to copy it into a contiguous buffer before reading it. This is usually okay, but for a (very) small amount of users this is not really an option: needless calls to memcpy should be avoided and messages should be read straight off the queue’s buffer.

There is one way (that I know of) to have a contiguous address space that loops around, and it boils down to mmapping the ring region twice and making sure the two mappings are next to one another. This is what this project accomplishes.

Implementation

To abstract the calls to mmap and friends we use a small RAII class called mapped_memory. This will take care of unmapping things for us when we are done using them. The double mapping magic happens in a factory function that builds our reader and writer queues:

  static Derived queue_factory(const std::string& queue_filename,
                               const std::string& control_block_filename) {
    const size_t size = std::filesystem::file_size(queue_filename);

    // Let's reserve the space before we mmap the buffer twice.
    mapped_memory double_mapping(2 * size);

    // Now we do the mapping of the same file twice in the contiguous region
    // we reserved. This will invalidate the previous mapping btw.
    mapped_memory first_mapping(
      queue_filename, 
      size,
      double_mapping.get_address());
    mapped_memory second_mapping(
      queue_filename,
      size, 
      first_mapping.get_address() + first_mapping.get_length());

    double_mapping.release();
    [...]

Ignoring for the moment the return type Derived, the first step is creating an anonymous mapping which is twice the size of the ring buffer; this is to ensure that we can actually map two regions one next to the other. In fact right after we create the two mappings passing the same filename and size, but crucially we pass the address of the end of the first mapping to the second mapping.

One notable detail of the above function is that it takes a file path to a “control block”, this is a file containing the following data structure:

using std::hardware_destructive_interference_size;

struct control_block {
  alignas(hardware_destructive_interference_size) std::atomic<uint64_t> version;
  std::atomic<uint64_t> next_read_offset;
  alignas(hardware_destructive_interference_size) std::atomic<uint64_t> next_write_offset;
};

We note that we use a feature of C++17: std::hardware_destructive_interference_size[3]. This helps us split the two important members of the header, the reader and writer pointers, into two different cache lines. This is a common feature in lock-free data structures as you don’t want your reader thread/process to invalidate your writer thread/process while moving the its pointer and vice-versa.

As mentioned, we use our queue_factory function to construct the two “queue” classes: queue_reader and queue_writer. The first one is:

class queue_reader : public detail::queue_base<queue_reader> {
public:
  const_view get_buffer(size_t bytes_to_read);
  void pop(size_t bytes_to_pop);
[...]

The reader application has to call get_buffer to attempt reading N bytes, and the return value could be an empty buffer if N bytes are not available (i.e. the writer is being slow). The returned const_view is a non-owning buffer akin to asio::const_buffer (which could be a drop in replacement in production), containing a pointer and a size member. Its interface is

struct const_view {
  explicit operator bool() const { return _ptr; }
  size_t size() const { return _len; }
  const char* data() const { return _ptr; }
[...]

The buffer is overloading operator bool, so it’s easy to check whether the call to get_buffer was successful.

Notably the user is just passed the pointer to the internal buffer, so no copy is done. Once the user read the message, they have to call pop to move the internal reader pointer; this will notify the writer that some space has just freed up.

Similarly the writer class is:

class queue_writer : public detail::queue_base<queue_writer> {
public:
  mutable_view get_buffer(size_t bytes_to_write);
  void push(size_t bytes_to_push);
[...]

The only other noteworthy detail is the conspicuous CRTP[4]: that’s just there as a trick to inherit the factory function we described above, which can then use the Derived template as return type.

Examples

There are two apps in the repo: a reader and a writer. Starting with the writer, after a bit of boilerplate, we can find

queue_writer writer = queue_writer::queue_factory(queue_filepath, control_block_file);
const std::string message("message");

const size_t total_message_size = sizeof(message_header) + message.size();
queue_writer::mutable_view buffer = writer.get_buffer(total_message_size);
if (buffer) {
  const uint64_t tsc = __rdtsc();
  message_header header{1, static_cast<uint32_t>(message.size()), tsc};

  std::memcpy(buffer.data(), &header, sizeof(header));
  std::memcpy(buffer.data() + sizeof(header), message.data(), message.size());

  writer.push(total_message_size);
[...]

In the first line we setup the writer and hard-code a message to send. In a real scenario the message can be anything of any length and obviously not the same every time.

The following lines mention a header. This is a customizable struct that is used by the reader to make sense of the incoming message. In our example the header is:

#pragma pack(push, 1)
struct message_header
{
  uint32_t version;
  uint32_t size;
  uint64_t timestamp;
};
#pragma pack(pop)

We notice right away that the struct is packed, even if the members are carefully placed not to leave any padding anyway. What we want to achieve with packing is actually changing its alignment[6] to 1. This is because we need to read it straight off the buffer via reintepret_cast, and doing so is already technically UB (until we’ll get std::bless/std::start_lifetime_as at least[7][8]), but having it be aligned to its natural alignment could lead to crashes.

After that we just call get_buffer and memcpy into the buffer, followed by a push of the written bytes.

The reader is symmetric as expected:

const queue_reader::const_view header_buffer = reader.get_buffer(sizeof(message_header));
if (header_buffer) {
  const message_header* const header =
    reinterpret_cast<const message_header*>(header_buffer.data());
  const queue_reader::const_view message_buffer = reader.get_buffer(header->size);

  std::string_view read_message(message_buffer.data(), message_buffer.data() + header->size);
  Logger::Info("Read:", read_message, ", timestamp", header->timestamp, ", delta:", delta_tsc);
  reader.pop(header_buffer.size() + message_buffer.size());
[...]

In the above you can see that first we reinterpret_cast the header, then we know how long is the incoming messages so that we can call get_buffer again.

Performance Analysis

In the reader’s and writer’s code you can find a timer and a vector that stores the deltas of TSC[9] between sending and receiving for a million messages. We can run the reader/writer with

taskset -c 6 ./apps/reader --num_messages 1000000
taskset -c 5 ./apps/writer

We are greeted with the output:

Receiving 1000000 messages took: 59490ms                                                                                                                  
INFO: Average: 110.163                                                                                                                                    
INFO: st. dev: 1534.65                                                                                                                                    
INFO: 99th percentile: 238

The total time it took is irrelevant since we are artificially stalling the writer every message to simulate “sparse events”. Feel free to modify the code of the writer to send messages in a loop and measure throughput.

Then we can use an include gnuplot script (see the “plot” directory) to produce the following histograms of TSC Deltas:

The deltas of TSC in the graph are in “cycles” and we notice that the distribution is somewhat discrete. These kind of spectrum reflects the different kind of “cache misses” our application is experiencing. Regardless, we achieved an impressive speed: considering that my test machine has an invariant TSC of about 1696 MHz, we are looking at latencies hovering around 60 nanoseconds here!

Lots more could be done to investigate the performance of this queue and perhaps we will in a follow up post. To name a few things we should definitely do

Run with “perf stat”. Perf can tell us quickly how many cache misses we have, how many branch prediction misses and if the CPU is stalled waiting for instructions or data (here almost certainly the latter)
Check the generated Assembly. This could be fun to see on a x86 platform how the code gets optimised on the hot-path
Plug the code in a more realistic application, and measure performance again. There is only so much information that micro-benchmarking can give you, you need to profile in a real environment.

That’s it for the moment. As usual, if you have any comments and/or corrections and/or suggestion please do let me know 😉

References

[1] https://www.boost.org/doc/libs/1_63_0/doc/html/interprocess.html

[2] https://github.com/facebook/folly/blob/main/folly/ProducerConsumerQueue.h

[3] https://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size

[4] https://en.wikipedia.org/wiki/Curiously_recurring_template_pattern

[5] https://www.oreilly.com/library/view/97-things-every/9780596809515/ch55.html

[6] https://en.wikipedia.org/wiki/Data_structure_alignment

[7] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p0593r4.html#166x-implicit-object-creation-ptrbless

[8] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p2590r0.pdf

[9] https://en.wikipedia.org/wiki/Time_Stamp_Counter

Can Neural Networks learn Physics?

Posted on 29 October 20221 November 2022 by gianluca

There are plenty of times where my brain doesn’t realise that my foot is on a collision course with a piece of forniture, but most time even I can tell if I am going to hit someone walking in the opposite direction. How does that work I wonder? Certainly our brains do not start computing trajectories using Newtonian Mechanics. So how do we “know”?

There are plenty of papers that implement neural networks that beautifully imitate physical simulations[0][1][2]. I wanted to see if I could write a minimal and simple (i.e. dumber) example where we could see this and understand what actually is going on under the hood. Here is the code.

In this small experiment, I wrote a tiny neural network that basically “learns” (i.e. does a fit) the Newtonian parabolic motion of a projectile given a velocity and an initial position

For our purposes, the neural network could have been a straight – and somewhat boring – collection of perceptrons and it would obviously have worked, at least on a finite range of inputs[3]. I thought I’d spice things up adding a layer that computes all possible second order couplings of the previous layer:

where o is the downstream neuron’s and h is the value of the hidden layer’s neuron. For reference the usual perceptron’s activation function would be:

with the hyperbolic tangent tanh optionally replaced by the more popular ReLU[4].

Long story short the Neural Network is customisable with the 2 types of layers, which I called (naming is hard) multiplicative layer and standard layer. It’s instantiated simply as

    Network net(3);
    net.add_layer<MultiplicativeLayer>(3);
    net.add_layer<StandardLayer>(2);

The constructor takes the number of inputs, then one can add as many layers one wants with the only constraint that the last added must have a number of neurons corresponding to the number of outputs. Once the Network is defined, one can just use it via

    const std::vector<float> outputs = net.feed_forward({vel.x, vel.y, time});

The output vector will have the size of the last added layer. The error function (aka the loss function) is computed as

    const float err = net.get_cur_network_error({pos.x * scale, pos.y * scale});

that evaluates the usual sum of squares

Finally we can back-propagate what we wanted to get via

    net.back_propagate({target_pos.x, target_pos.y});

internally this is going to invoke functions on each layer computing gradients and updating the weights.

That’s it. After the training, we plot the outputs of the network in a range that is outside the training values, and the plotted line perfectly overlaps with the expected values:

In perfect agreement with the plot of the curve computed with the actual classical trajectory printed above in blue.

The full code[5] is in a convenient cmake + conan structure. No need to call conan explicitly: it’s sufficient to have conan installed and just use cmake as usual to build (see README.md in repo).

Please feel free to comment and suggest improvements!

Cheers

References

[0] https://arxiv.org/abs/2002.09405
[1] https://medium.com/stanford-cs224w/simulating-complex-physics-with-graph-networks-step-by-step-177354cb9b05
[2] https://ai.googleblog.com/2021/06/learning-accurate-physics-simulator-via.html
[3] https://en.wikipedia.org/wiki/Universal_approximation_theorem
[4] https://en.wikipedia.org/wiki/Rectifier_(neural_networks)
[5] https://github.com/gianlucadelfino/reachablecode/tree/d6fa94cea4aea75dd750f28d5c0bde76ed9396c5/neural_network_physics

UDP Video Streaming with OpenCV and ASIO

Posted on 1 December 202030 October 2022 by gianluca

One of the things that always bothered me about video calling apps, is that what is showed to you on the bottom right corner never reflects the quality of the video your interlocutor is actually receiving. In fact compression is the name of the game in video calling apps, as understandably they try to minimize bandwidth requirements. However many of us have access to fast broadband, which is barely used by these apps, and the quality of the video could (and should) be better.

The answer is obviously to get a proper streaming application, or use an off-the-shelf streaming library, but not for this developer 😉 We are going to write our own streaming application. The idea is to use Asio over UDP to transmit the frames and OpenCV to acquire them.

First thing first, here is a link to the repo which can be built both on Windows or Linux: codename cUDP (read like see-UDP)

Sender

We begin getting the image from our webcam and compressing it into a more manageable size, the amount of compression is up to us but we need to keep in mind that compressing the image is going to take longer the more we want to compress it. If we want to achieve a somewhat steady 30fps we need to keep the processing time of each frame under 33ms.

Unlike in common video calling apps, here we have full control on the compression levels, which can dynamically be adjusted with the WASD keys: w/s to increase/decrease the compression level, a/s to increase/decrease the number of frames per seconds pushed. Furthermore the displayed image on the sender side is exactly what the receiver is going to get (assuming they can keep up with the throughput).

    VideoWindow win(1, "cUDP");
    std::vector&lt;int> compression_params;
    compression_params.push_back(cv::IMWRITE_JPEG_QUALITY);

    // Compression 50%
    compression_params.push_back(50);

The VideoWindow class is my own wrapper that just helps deal with the set up and clean up in a RAII fashion. The first parameter is the ID of the camera which can easily be exposed via console parameter, but out of laziness is hardcoded here.

To handle the networking we use ASIO, which is hopefully going to be come standard in C++23. The UDP protocol does not have a connection, so it’s sufficient to set up a socket and create an endpoint object to use later in send_to():

    ::asio::io_context ioContext;
    ::asio::ip::udp::resolver resolver(ioContext);
    ::asio::ip::udp::socket sender_socket(ioContext);

    sender_socket.open(::asio::ip::udp::v4());
    const int recv_port = 39009;
    ::asio::ip::udp::endpoint recv_endpoint(
        ::asio::ip::address::from_string(recv_address_), recv_port);

Then we come to the main loop of the Sender side. We grab the frame, split it into 1-MTU chunks and send each part separately. The parts might not get there or might get there out of order, so we’ll see how we deal with that on the Receiver side:

    while (true)
    {
      cv::Mat frame = win.getFrame();
      [...]
      
      cv::imencode(".jpg", frame, buffer, compression_params);
      [...]
      
      const int16_t parts_num =
          (buffer.size() + InputBuffer::writable_size() - 1) /
          InputBuffer::writable_size();
      Logger::Debug("Frame id", frame_id, "split in parts", parts_num);
      for (int16_t part_id = 0; part_id &lt; parts_num; ++part_id)
      {
         [...]
         sender_socket.send_to(input_buffer.buffer(), recv_endpoint, 0, err);

Many details are omitted in the above snippets. For instance we use a custom InputBuffer class that helps create the frame chunks, prepending a header to them, and we also have some logic to change dynamically the compression rate and the FPS using the WASD keys.

The header class is nothing surprising, and boils down to the following:

 struct Header
 {
    int32_t frame_id{};
    int32_t part_begin{};
    int16_t part_id{};
    int16_t total_parts{};
    int32_t part_size{};
 };

As mentioned, the more interesting logic is on the receiver side, so let’s get to it.

Receiver

In the receiver there are 2 important classes: FrameStitcher and FrameManager. The former is essentially a buffer that understands how to read the Header and place the arrived chunk in the right spot: the chunks can indeed arrive out of order and we use the “part_begin” member to offset the memcpy into the internal buffer.

struct FrameStitcher
{
  void add(const InputBuffer::Header&amp; h_, ::asio::const_buffer part_)
  {
    memcpy(_image_buffer.data() + h_.part_begin, part_.data(), part_.size());
    _parts_num--;
  }

  bool is_complete() const { return _parts_num == 0; }

  cv::Mat decoded() const
  {
    assert(is_complete());
    return cv::imdecode(_image_buffer, cv::IMREAD_UNCHANGED);
  }

  private:
  int _parts_num;
  std::vector&lt;uchar> _image_buffer;
};

The FrameManager is slightly more complicated because it has to deal with old frames and remove them. This is not so interesting so we’ll skip the details. Suffices to say that the FrameManager holds a map of frames, keyed by their ids, and frameManager.add(..) gets called for each frame chunk we get.

There is another class worth mentioning on the receiver side: the LockFreeSPSC, amicably (and technically improperly) called the “disruptor” in the code. This class is basically a simplified version of folly’s ProducerConsumerQueue, and it’s used here to decouple the thread that receives the data from the network from the thread that processes it. It could possibly have worked with only one thread, but it’s always a good idea to have a single thread dedicated to “spinning” over the socket to receive as fast as possible while pushing the data down into a lockfree queue, while another (or multiple other) thread(s) just poll on the queue.

Conclusions

Our UDP streamer works okay, but it’s a little wasteful: certainly compressing video instead of single frames would have achieved much better performance. However, this is simple and viable, so it’s good for me.

This project is far from over though: next up is audio and then the fun part begins. We’ll look into playing with facial recognition using dlib and I have some other ideas that are too ambitious to announce at the moment 😉

OpenCV and Book Title Recognition

Posted on 29 June 202030 October 2022 by gianluca

During this time of uncertainty the only thing we can count on is that what should be straightforward usually isn’t. I had some experience already with OpenCV and thought that it would have been trivial to bang together a small program to help catalogue some old books on the shelf. My naivety was quickly exposed once the presumed “afternoon project” turned out to take several weekends worth of work. Nonetheless “the longer the journey, the further you get”, so I can say I did learn quite a bit from this project.

Given the amount of code added, the topic would deserve more than one post, but I will try to summarise here and point to the other posts that helped me and attempt to solve similar problems.

The final bounding boxes to feed the text recognition library

TLDR; The code is all in github, it requires opencv 4 which unfortunately must be built with the optional dnn module. We also require the pre-trained EAST text-detection model.

Intro

As mentioned, there are already a number of exhaustive blog posts on the internet that deal with this problem, nevertheless the C++ code in a lot of those leaves a bit to be desired (e.g. it looks like Java), so as part of this project I started my own opencv_utils library, which I am sure I’ll need again. Here is a couple of introductory posts that I found useful for the initial setup up of the EAST model and tesseract.

In the following I will assume that tesseract is already installed and the EAST model pre-trained weights are downloaded and ready to be used.

The New Components

I have added a couple of convenience class for this project, starting with a simple VideoWindow manager to handle the (optional) webcam feed acquisition. In the pushed code, however I commented that out and I just have saved one screenshot that is looping for testing purposes.

Another (not so related) component that will surely be reused is a Logger class, that is small but not trivial; let’s take a look

The Logger itself is a Meyer’s singleton, no surprises there:

    static Logger&amp; Instance()
    {
        static Logger l;
        return l;
    }

The Logging functions however are more interesting, for instance:

    template &lt;typename First, typename... Args>
    static void Debug(First&amp; first, Args&amp;&amp;... args)
    {
        if (Level::DEBUG >= Instance().level)
        {
            std::cout &lt;&lt; "DEBUG: ";
            DebugHelper(first, args...);
        }
    }

This is just a wrapper that accepts any number of arguments and just checks for the logging set level, then calls the following private helper function:

    template &lt;typename First, typename... Args>
    static void DebugHelper(First&amp; first, Args&amp;&amp;... args)
    {
        std::cout &lt;&lt; first;

        if constexpr (sizeof...(Args) > 0)
        {
            std::cout &lt;&lt; " ";
            Logger::DebugHelper(args...);
        }
        else
        {
            std::cout &lt;&lt; std::endl;
        }
    }

Here we use C++17 constexpr if to recursively call ourselves and iterate thought all the available parameters. The function then can be passed any “loggable” parameter in any number, for instance:

Logger::Info("My Value", 3, "Other Value", 4);

Another useful utility is void displayMat(cv::Mat mat, const std::string& windowName_). This small (and expensive), debug function stores a static map of Window objects and it’s called to generate all the useful previews of the different stages (input, computed blob, NN output, any pre-processing before calling tesseract).

Finally I have added a number of utility functions to deal with tesseract itself, whose API is very C-like (and not in a good way).

Last noteworthy utility is a function I wrote that uses HughLines to straighten the text images. Nothing groundbreaking, but useful to have.

The Main Logic

The app simply does the following:

Creates (or imports) the input cv::Mat, either via local image or via webcam.
Runs the image through the EAST text detection model. This helps to feed tesseract only a small amount of information. This is the output:

Takes all the found words, and tries to put together “Titles” by looking at the words that are on the same line. We get:

Calls tesseract to check for possible text
Adds the new titles to a list, rejecting the ones it already found and filtering the ones tesseract was not confident about.

Overall it works but the text recognition is still quite fuzzy, possibly as the input is rather difficult to parse, being that most books have “attention grabbing” fonts and colours, and on top of that some books’ spines are very warn out.

Conclusion

The project was ambitious and I can claim a partial success. The is much that can be done however to improve, mostly at the cost of performance, by pushing through the pipeline different pre-processed versions of the input. That would, however, not be very entertaining, so that is it for now 😉

Monte Carlo Stochastic modelling on OpenCL

Posted on 29 March 202030 October 2022 by gianluca

This post will just be a quick example of how to use OpenCL to generate some simulated stock prices’ “trajectories”. Using CUDA could have been possible as well, and indeed the code could be ported with minimal changes, but then it would have been able to run only on NVidia GPUs while OpenCL runs even on CPUs.

A consideration, before we get into the code, is that this is a simplistic model which is not very useful at all. This post, in fact, is not about how to properly model stocks, but rather just a quick self-contained reference to get up and running with the topic. In this example we will compute 10k different possible trajectories for the price of some stock and generate a histogram with the results.

Without further ado, here is the link to the code.

Right off the bat we should say that we are using the C++ bindings for openCL; you can find them in the repo in the 3rdParties directory. The Cmake file will look for them automatically if you clone the entire repository. Other than that, the README will get you up and running.

Starting with the boilerplate code, at the top there is a couple of macros that we need to change if we want to use our GPU instead of the CPU like I do here:

#define PLATFORM 0
#define DEVICE 0

Besides that the rest of the main.cpp is fairly uninteresting until line [56-60] where I hardcoded some values that will drive the price increase (or decrease). These values are the common parameters described better that I could in many other places [1] [2] [3].

The main thing we need to know is the following iterative formula:

$S_{N+1} \mathrel{{+}{=}} S_N \cdot (dt \cdot rate + \sigma \cdot \varepsilon_{gauss} \cdot \sqrt{dt})$

which becomes in the kernel

share_val += share_val * (dt * rate + variance * epsilon * sqrt(dt));

So we are going to need some Gaussian random variable epsilon, which in C++ would just mean we’d use std::normal_distribution, but in OpenCL we don’t have anything like that. To create random variables in a Kernel we’ll use a library called RandomCL, which you can checkout into the “../3rdParties” directory (see README). To turn the random variable into a Gaussian variable we will employ the good ol’ Box-Muller transform, which we can use to generate 2 numbers at a time:

        // Compute gauss random number using box muller
        const float rand1 = msws_float(state);
        const float rand2 = msws_float(state);

        const float rand1_log = -2 * log(rand1);

        float cos_rand2;
        const float sin_rand2 = sincos(rand2 * two_pi, &amp;cos_rand2);

        const float epsilon1 = sqrt(rand1_log) * cos_rand2;
        const float epsilon2 = sqrt(rand1_log) * sin_rand2;

That’s it. Run it and plot it, here is what you get with the current hard-coded parameters:

References

[1] https://en.wikipedia.org/wiki/Geometric_Brownian_motion

[2] http://sfb649.wiwi.hu-berlin.de/fedc_homepage/xplore/tutorials/sfehtmlnode27.html

[3] https://beta.vu.nl/nl/Images/werkstuk-dmouj_tcm235-91341.pdf

Enumerate() with Structure-Bindings and C++20 Coroutines

Posted on 20 January 2020 by gianluca

TLDR: godbolt links: Clang Gcc

Intro

If there is one thing that it’s easy to do with C style loops and not as easy with modern C++, it’s keeping the loop variable around while we are accessing the elements of a container. The C way, and arguably the best way, is the usual

for (size_t i=0; i &lt; vec.size(); ++i)
{
    // use i and v[i];
}

On the other hand, since the c++11 standard rolled out, we have all been trying to replace the above with the more modern-looking range-for approach:

for (auto&amp;&amp; elem : vec )
{
    // use elem
}

While this certainly improved our code bases, it left us wanting for a way to access the index value in the few cases where we happen to need it.

We should mention the other “standard” way of iterating a container, which is using iterators of course

for (auto iter = v.cbegin(); iter != v.cend(); ++iter)
{
    // use *iter
}

If the itereator is random access we could in principle add the following in the loop to recover the index:

const size_t idx = std::distance(v.cbegin(), idx)

Now that certainly works, but it’s far from modern-looking and it’s very verbose. What if we could have something like Python’s Enumerate that we could use in a for-range loop? Indeed something like that has been proposed already within a set of utility functions to be implemented using ranges[1]! In this post, however, we will look at a possible implementation using Coroutines instead.

Implementation

DISCLAIMER
The following code is experimental and I couldn’t wholeheartedly recommend it as something to be used in production. This certainly falls under the category:
“Your Developers were so preoccupied with whether or not they could, they didn’t stop to think if they should”.

Couroutines are almost here, as they will be part of the C++20 standard. Posts like these [2][3] inspired me to learn more about the topic, and greatly helped avoiding common coroutines pitfalls I would have easily fall into. If you didn’t read them already, I would suggest reading both posts first before continuing.

So, let’s write a small coroutine that would allow us to write:

for (const auto&amp; [idx, item_ptr] : Enumerate(v))
{
   // use idx and item_ptr
}

That is a for range loop with a (size_t) index and the element of the container. Notice that the element in the is a ptr, whereas in principle I would have liked a to a reference. The reason for that is mostly lazyness: as using a pointer automatically gives us a default constructible type to be stored in the generator (as the “promise_type”), and it trivially guarantees no copies of the container’s elements will accidentally be made. That said it is surely possible that the same could be accomplished without sacrificying reference semantics, maybe using some kind of reference wrapper, but let’s roll with the ptr for now.

We will employ a trick we learned in [2] to take care of the case where temporaries passed into the coroutine, which would otherwise lead to dangling references. The technique boils down to writing an “impl” for the actual coroutine and using “partial” forwarding to deduce value template parameters instead of references when we get passed a temporary.

template&lt;typename T>
using range_value_t = std::remove_reference_t&lt;decltype(std::declval&lt;T>().operator[](size_t{}))>;

template&lt;typename Container>
generator&lt;std::pair&lt;size_t, range_value_t&lt;Container>* >> EnumerateImpl(Container v)
{
    for (size_t i =0; i &lt; v.size(); ++i)
    {
        co_yield std::make_pair(i, &amp;v[i] );
    }
}

template&lt;typename Container>
auto Enumerate(Container&amp;&amp; v)
{
  // Notice the explicit template below. see [2]
  return EnumerateImpl&lt;Container>(std::forward&lt;Container>(v));
}

And here is how we use it

size_t usage(const std::vector&lt;size_t>&amp; v)
{
  size_t acc{};
  for (const auto&amp; [idx, item_ptr] : Enumerate(v))
  {
    acc += idx;
    acc += *item_ptr;
  }
  return acc;
}

Let’s break down what happens here:

The structure binding at line in the last snippet is initialized by the coroutine return value, which is a pair with the index and (a pointer to) the value of the container.
The coroutine resumes its execution every loop and uses a C-Style loop to iterate exactly how we would usually have done without coroutines.

Performance

Here is where it gets a bit more complicated, and why I wouldn’t necessary encourage anyone to use this in their code. The C-style loop is perhaps one of the simplest and most common things compilers can optimize and even when it doesn’t get optimized it maps directly to easy assembly. The coroutine variant instead has a clear overhead. Godbolt output demonstrate that even in the optimized case there is quite a lot of set-up and clean-up instructions before and after the loop.

I would expect this to be overall demonstrably slower, but a decent benchmark is always the last word on this kind of matter. We will have another post for this type of analysis, as this one is already pretty long. For the time being I think it’s safe to assume that you shouldn’t use any of this in production code.

Conclusions

It was an interesting experiment, but more work has to be done to optimize this enough before it can be recommended to be used. If you have any ideas on how to improve it, be sure to let me know in a comment 🙂

References

[1] http://open-std.org/JTC1/SC22/WG21/docs/papers/2019/p1894r0.pdf
[2] https://toby-allsopp.github.io/2017/04/22/coroutines-reference-params.html
[3] https://quuxplusone.github.io/blog/2019/07/10/ways-to-get-dangling-references-with-coroutines/
[4] https://en.cppreference.com/w/cpp/language/coroutines

“Interfaces” with C++20 Concepts

Posted on 9 June 2019 by gianluca

Some times it happens that we just want to define an interface, and some of those times we think that we should not use inheritance for some performance-related reasons. Most of these times we are very wrong, and we should just inherit from a base class with (possibly pure) virtual functions. However in this post I will explain an alternative way for the minority of the cases where we actually cannot use inheritance, but we still want to define an interface. We will do so by employing one of the latest features of the C++ standard: Concepts!

The argument against dynamic polymorphism revolves around the v-table (short for “virtual” table), which is how C++ actually implements virtual member functions. The fact is that virtual functions work seemingly by magic, but this magic turns out is just an extra pointer to a hidden struct containing function pointers; this struct is the “v-table” and every derived class has a different one. Hence the “problem”: an invocation of a virtual method is not just one jump to the code of the function, but 2 jumps, because first we need to dereference the pointer of the v-table! (a complete explanation of how the v-table works is beyond the scope of this post and there are a lot of nice explanations around)

The arguments in favour of dynamic polymorphism are many, first and foremost that it is the standard and most importantly simplest way to achieve the result. The emphasis on “standard” cannot be overstated. The best code is the code that is easy to read: people understand commonly used patterns and hate it when they need to read comments to make heads or tails of what’s going on (even worse when there is no comment at all!). A second argument, and one that is often overlooked, is that performance does not have to suffer necessarily: there are circumstances where we get dynamic polymorphism for free! This happens when the compiler manages to “devirtualize” the calls, and more in general when the branch predictor of our CPU is able to correctly guess which function implementation we are going to use, therefore preloading the right instructions and hence avoiding the cost of the extra jump.

That being said, IF you need extreme control over the code and IF you are trying to squeeze every single inch of performance out of your application (2 big IFs), then you probably already know the current common way of declaring interfaces without virtual functions: CRTP. The “Curiously recurring template pattern” is a clever way to achieve what we want to achieve in this post, but it’s not the only way, and perhaps not the nicest way either. In this post we are going to do the same that can be done with CRTP, but we’ll do so (ab)using C++20 Concepts. Here is how.

Let’s first define our “interface”, which will be (you guessed it) a Concept:

template &lt;typename T>
concept Shape = requires(const T&amp; t)
{
    { t.area() } ->float;
};

The above declares that our type should have one const function called “area” that returns a float. This is usually employed as a constraint for template parameter in functions or classes, e.g.

template &lt;Shape S>
void foo(S){}

Which does not compile if the object passed to foo() does not satisfy the Shape concept. For instance if we declare a Rectangle struct that almost satisfies the constraint:

struct Rectangle
{
    float area(); // Forgot the const!
};

int main()
{
    Rectangle r;
    foo(r);
}

we get:

.../src/main.cpp:48:10: error: cannot call function ‘void foo(S) [with S = Rectangle]’
   48 |     foo(r);
      |          ^
.../src/main.cpp:10:6: note:   constraints not satisfied

The question is: is there a way to declaratively state that Rectangle is a Shape, perhaps directly in its header? Can we do so by employing the same concept “technology” that we used for foo()? Yes and no (and yes).

What I personally would like to be able to do is something like the following:

struct Rectangle requires Shape
{
    float area() const;
};

This way the first line of the declaration of the Rectangle type clearly states that this class satisfies (in a way “inherits”) from Shape. The syntax gods however did not allowed us to be this straightforward, as the raison d’etre of the concept feature is not to declare types but to help constraint which types should be allowed in template substitutions.

Enter static_assert:

struct Rectangle
{
    float area() const;
};
static_assert(Shape&lt;Rectangle>);

This does exactly what we’d expect: the clever designers of the concepts feature allowed us to use static_asserts to achieve our goal, even if we need to get to the bottom of the declaration to see it action.

But what if Rectangle was a class template? Then we could not write:

template&lt;typename T>
struct Rectangle
{
    float area() const;
    T base;
    T height;
};
static_assert(Shape&lt;Rectangle&lt;...>>); // We need to provide a template parameter here!

Or better, we could write it like that, but we would also need to come up with a valid template parameter, which in general could be any complex type that might or might not be known by the developer at the time of writing the class (the header might be in a library and it might be a completely unknown user-type). So, instead of coming up with some random template parameter we could use the following trick:

template &lt;typename T>
struct Rectangle
{
    Rectangle() {
        static_assert(Shape&lt;decltype(*this)>);
    }
    float area() const;
    T base;
    T height;
};

The static_assert is moved to the constructor, then we can just use decltype to obtain the instantiated type and do the check! It goes without saying that this also works in the previous case where we didn’t have a template class 😉

So there you have it, for the small small price of declaring a constructor (which most of the times you have anyway), you can make sure that your class satisfies any concept (i.e. “interface”) of your choosing, and all without employing CRTP.

We could take it one step further, and make it look more like inheritance by moving closer to CRTP territory. It’s enough to declare a base class and move the constraint to the base class constructor, then we can inherit from that like we would for normal CRTP:

template &lt;typename T>
struct ShapeBase
{
    ShapeBase() { static_assert(Shape&lt;T>); }
};

template &lt;typename T>
struct Circle : ShapeBase&lt;Circle&lt;T>>
{
    float area() const;
    T radius;
};

Is it better than just adding the static_assert directly in the “Derived” constructor? You choose 😛

All the code is in the repo. For the lazy here is the link with the working example in godbolt.

As usual, suggestions/comments/corrections are welcome 🙂

Cheers!

EDIT:

Following the discussion on reddit, here is the modified (and better!) version without the decltype

template &lt;typename T>
struct Square
{
    Square()
    {
        static_assert(Shape&lt;Square&lt;T>>);
    }
    float area() const;
    T edge;
};

thanks to u/eacousineau for giving me the idea.

EDIT2:

As pointed out by /u/anonymous2729 et al. we could improve still by making the base class trivially default constructible, which we lost once we added a user-defined constructor. To be more specific, a class is trivially default constructible if the following conditions hold:

It is implicitly declared or explicitly defaulted.
A has no virtual functions and no virtual base classes
All the direct base classes of A have trivial constructors
The classes of all the non-static data members of A have trivial constructors

So we can write:

template&lt;class T>
struct ShapeBase {
    ShapeBase() requires(Shape&lt;T>) = default;
};
 
template&lt;class T>
struct Circle : ShapeBase&lt;Circle&lt;T>> {
    float area() const;
    int radius;
};

And the same can be done for the version that does not employ a base class. Here:

template&lt;class T>
struct Circle{
    Circle() requires(Shape&lt;Circle&lt;T>>) = default;
    float area() const;
    int radius;
};

Thank you all for the ideas!

Catching Allocations

Posted on 1 June 2019 by gianluca

(A quick and dirty approach)

For our second post I thought I’d show a quick hack that I have recently used to find unwanted allocations. We know that a call to malloc can take an indefinite amount of time, and therefore it should be shunned from the “hot path”. With just a few lines of code we can quickly verify that our hot path is indeed free of such pesky calls.

There are many better and more professional ways to accomplish the same goal, like using Intel’s VTune or carefully define malloc hooks, but this is probably the fastest way if you just want to have a quick check and don’t plan to keep this code around. Without further ado, let’s bring in the snippet:

void* operator new(size_t s_)
{
    void* ptr = malloc(s_);
    std::cout &lt;&lt; "Allocated " &lt;&lt; s &lt;&lt; " bytes.\n";

    return ptr;
}

That’s it. This will be called any time new would have been called and it can do whatever you want (just mind accessing globals if new is called by multiple threads). Technically we should also overload the operator new[], but the purpose of this post is not to be thorough and professional 😛

Let’s see it in action! You can checkout the full code here which is just a few lines allocating a vector of ints, with or without a call to “reserve”.

int main()
{
    std::vector&lt;int> vec;

    // Uncomment the following line to see the difference in output
    // vec.reserve(1'000'000);

    for (int i = 0; i &lt; 1'000'000; ++i)
    {
        vec.push_back(i);
    }

    return 0;
}

The output without the call to reserve is:

Allocated 4 bytes.
Allocated 8 bytes.
Allocated 16 bytes.
Allocated 32 bytes.
Allocated 64 bytes.
Allocated 128 bytes.
Allocated 256 bytes.
Allocated 512 bytes.
Allocated 1024 bytes.
Allocated 2048 bytes.
Allocated 4096 bytes.
Allocated 8192 bytes.
Allocated 16384 bytes.
Allocated 32768 bytes.
Allocated 65536 bytes.
Allocated 131072 bytes.
Allocated 262144 bytes.
Allocated 524288 bytes.
Allocated 1048576 bytes.
Allocated 2097152 bytes.
Allocated 4194304 bytes.

The output with the reserve instead:

Allocated 4000000 bytes.

Much better 🙂

We could go on talking about page faults, but I’ll leave that for another day!

PS I have improved the CMakeList.txt file in the repo, including warnings and address and undefined behaviour sanitizers (which are enabled in “debug” mode). We’ll keep using this cmake file and perhaps dedicate a post to it in the future.

Qt Hello World

Posted on 27 May 201930 October 2022 by gianluca

As much as I prefer focusing on the low level stuff, having a button to push can be very satisfying. In the spirit of keeping this blog short and to the point I would like to get right away to the minimum project setup required to accomplish just that: a Button!

To the impatient, here is the link to the code 😉

This post will also serve to start building a “standard” cmake project structure with a cmake file which we’ll use across all the projects. As a matter of fact setting up QT to work with cmake is easy only once you did it for the first time, and I hope that this post will somewhat alleviate the initial pain.

Here is all the files we need:

.
├── CMakeLists.txt
├── include
│   └── MainWindow.h
└── src
    └── main.cpp

Let’s start examining the CMakeLists.txt, which is basically the only interesting thing here:

cmake_minimum_required(VERSION 3.0)

project(qtHelloWorld)

find_package(Qt5 REQUIRED COMPONENTS Widgets)

qt5_wrap_cpp(MAINWINDOW_MOC include/MainWindow.h)
add_executable(qtHelloWorld src/main.cpp ${MAINWINDOW_MOC})

target_include_directories(qtHelloWorld PUBLIC include)
target_link_libraries(qtHelloWorld ${QT_QTCORE_LIBRARIES} ${QT_QTGUI_LIBRARIES} Qt5::Widgets)

The action happens in line 7-8: qt5_wrap_cpp. This function is crucial to avoid linker errors that happen whenever we have a class that uses “moc” macros. These are essentially decorators for various classes we need to compose our UI. In this project we have only one: MainWindow

class MainWindow : public QMainWindow
{
    Q_OBJECT
public:
    explicit MainWindow(QWidget* parent_) : QMainWindow(parent_), _button(this)
    {
        this->resize(400, 300);
        _button.setText("Push Here");
        _button.setGeometry(QRect(QPoint(100, 100), QSize(200, 50)));

        connect(&_button, SIGNAL(released()), this, SLOT(handleButton()));
    }

private slots:
    void handleButton()
    {
        _button.setText("Pressed!");
    }

private:
    QPushButton _button;
};

This does nothing but declare a button and have a simple handler for it.

That’s basically it. Feel free to copy these 3 files and import them in your IDE using the CMakeList.txt, or just run cmake manually.