Profiting with Elixir, Erlang and C

Getting distributed systems to 'Rinha de Backend'

On March 10th, the second edition of the "Rinha de Backend" ("Backend Fight"), a very fun hackathon, finished. I participated in this edition, and I bring here my perceptions and insights that I learned along the way.

In this article, we will discuss some technical approaches and decisions, as well as some observations on competitors’ behavior.

A few notes before start reading:

  • I’ll be talking about a Banking API and it involves Database concurrency. To avoid confusion, every time I refer to transaction I’m talking about the Database Transaction. On the other hand, when I refer to operation I’m talking about the Bank Transaction like credit and debit;
    – Sometimes I mention I could see other’s people results. I saw those on Twitter where people were discussing about it;
  • I’ll be mentioning the event as ‘Rinha’. It can be translated as "cockfight". But since it’s the name of the hackathon I will keep it like that.

The challenge

The rinha is a hackathon like any other. A challenge is proposed, you implement it and compete with other developers.

This time the challenge involved concurrency control. The idea was to implement a web application that handles banking operations. Your API would be run in two instances, behind a load balancer.

Each customer of this bank had a minimum limit. Therefore, debit operations could only be performed within that limit.

Furthermore, resources would be limited. Everything should run with 1.5 CPUs and a maximum of 550MB of memory. It includes both instances, the load balancer, the database, and whatever else you need.

The API should have endpoints for credit, debit, and statement.

What we have here is a classic concurrency problem, where two calls on the same account, a debit and a credit, could be answered but should execute consistently. And that’s the challenge. I’ll explain the criteria for winning the competition by the end, but the most important here is the journey.

Architecture show how the test is performed with a load test on a load balancer pointing to the two API instances and both connecting to a database.

As you can see from the diagram, load testing is performed in the application. That’s important to categorize as we will see in the criteria later on.

Erlang (and Elixir) to the rescue

My feeling about Rinha is that it’s an invitation to experimentation, so it’s a great opportunity to try things that you always wanted but couldn’t.

I have been studying Erlang/Elixir for a while. I have dedicated myself to contributing to open-source projects in this ecosystem like Swoosh and Cowboy.

I didn’t want to use a mega framework, like Phoenix, to do something so simple – and that would only be for a competition. I intended to do things manually. So I could feel and see things happening. Like a potter watching the vase taking shape.

My choice was based on the Erlang ecosystem. I used:

– cowboy as the web server;

  • Mnesia as database;
  • NGINX as the load balancer (not Erlang, bear with me).

I wanted to do something simple. Endpoints, database transactions and that’s it. I started with the basic implementation guide which is:

  • make it work;
  • make it right;
  • make it fast.

In addition to Joe Armstrong’s primer (Programming Erlang 14.2) on distributed application development, which is:

  • run the program regularly, running on a single node;
  • execute on two different nodes but on the same machine;
  • run on two different machines.

My goal was to have a structure with a two-node Erlang cluster, with each node having a server capable of handling requests running.

Requests received by the endpoint are directed to Mnesia in a transaction. Since Mnesia is a distributed database, carrying out the operations in a transaction would already guarantee part of the debit banking operation.

What matters

When solving a problem, we should start by solving what matters. This is usually the hardest part.

Despite the validations on customer IDs and operation type, the real problem was to avoid getting the customer’s balance lower than its threshold.

The rules tell us that credit operations can be accepted and processed indefinitely. The restrictions are in the debit operation. It’s necessary to check if that operation will get the balance lower than possible; if so, the application should not process it.

Problem modeling

Based on the problem constraints, I could see that, eventually, applying an eventual consistency could work. Thus, an approach that seemed valid to me was an event-sourcing-like implementation.

All valid operations are saved as they arrive and there is an aggregation table that contains the current balance and the list of the last ten transactions.

This solution worked nicely and in the loading test provided by the organization to validate the solution it passed all validations.

But I wasn’t that happy with the performance. My implementation was being beaten by some stacks like NodeJS. It didn’t make sense. I knew something should be done, and I was pretty sure it was my fault.

Multitenancy

My implementation was being obliterated by JavaScript. I thought It was because of the concurrency on the Log table since all the clients with all their operations are directed to that table.

It seemed reasonable to me to separate that. What if I had multiple log tables, one per client? I did that.

The diagram above shows that. Each request is directed to a specific table based on the client id. Log storing can now be made concurrently.

It didn’t improve that much though. p99 went from ~100ms to ~90ms.

NOTE: The multitenancy was happening for the aggregate table, so we could serve statements concurrently. That’s not in the diagram. It will soon move on from this aggregate table idea.

Docker on Mac – a tragedy

Even though the multitenancy didn’t improve the performance that much, I was pretty sure I would go with that strategy.

I kept trying to change stuff to get a better performance. I tried stuff on Mnesia, Cowboy, and NGINX settings. Nothin worked. Some changes even made the performance worse.

I was using Docker on MacOS. I knew its limitations on MacOS since it uses some Linux technologies that aren’t present in UNIX. Also, in the last edition, which was more about handling as many requests as possible, people mentioned a Docker bottleneck in networking.

I checked it. I executed the application on another machine – with fewer CPU cores and less RAM – and it performed way better. The p99 went from 90ms to 4ms!

I knew Docker would slow it down. I just didn’t expect it would be that much.

Profiling and C code (blazingly fast)

Honestly, with the overall p99 of 4ms, I was already quite satisfied. It came close to implementations I had seen using Rust and Go.

However, something still bothered me. The execution time for validations was below other implementations. My validation p99 was around 107 ms, with a maximum time reaching 110 ms.

Remembering, this wasn’t a horrible time, but I wanted to try it out so I decided to brush the bits. Also taking the opportunity to learn a little more about profiling in Erlang/Elixir.

I discovered that Erlang has a vast amount of tools for profiling. However, not very good examples. These tools are integrated with Elixir through Mix, but there are also no examples of how to integrate this into a project. The examples there are about profiling code passed to the tool, like a simple implementation of a Fibonacci module, instead of a running project.

I did a simple profiling using eprof. eprof provides time information for each function used in the program. No call graph is produced, but eprof has considerably less impact on the program it profiles.

I wanted to see which of the functions in my code were slower when it came to validating transactions. I noticed that a considerable amount of time was being spent parsing JSON. So I thought: how can I improve this?

I was already using the most popular JSON lib in the Elixir ecosystem – the lib that is included in Phoenix. For a moment I even thought about implementing/getting a parser written in C and porting it to Erlang, making it a NIF. But luckily I found jiffy, which is already a NIF (blazingly fast).

I then replaced the jason lib with jiffy. It improved the validation performance taking the p99 validation time from 108ms to 83ms, with a maximum time of 85ms. Reduction of around 23% in time. It was a reasonable improvement I’d say.

I think it could be improved a bit more. But I ran out of time. I had to submit it.

I submitted my implementation with a little bit of Erlang in my Elixir (through the :mnesia calls) and a little bit of C in my Erlang.

The solution from an Erlang/Elixir perspective

We went through the solution I was aiming to submit from a high-level perspective. Now let me show you the solution from an Elixir/Erlang perspective, in terms of processes and how they communicate to each other.

We have a cluster of two nodes. When the application boots it starts a global GenServer, hence a process, for each client. This GenServer holds in its state the balance, list of latest transactions, and the minimum limit of that client. So yeah, it’s in memory.

HTTP requests are received by Cowboy, validated, and then delegated to the respective GenServer. During startup, it also starts a distributed Mnesia and creates the log tables.

The "They are smarter than me" pitfall

Since people were on Twitter sharing their results it could easily get you into a situation where you think you can’t achieve the same thing. It happened to me a bit when I realized my implementation was being destroyed by JavaScript runtimes.

The point is: you are seeing the results only. You don’t know in which machine that implementation was executed nither how they made that. Also, you don’t know how much time that person dedicated to the implementation.

Just ensure you did your best. Of course, you can have some inspiration. Like, in this context, defining a result you saw on Twitter and pursuing the same. But make sure you are not just comparing yourself to someone else and putting yourself down.

Rinha results

I talked a lot about the challenge and the process implementation. I meant that since the journey is what matters here.

But I know you may have some questions now. What happened? Did I win? Did I really profit? Let’s get to it.

The submission time had finished and the final execution came. The code execution was on the following specification.

  

Docker
Docker version 25.0.3, build 4debf41
  

Gatling 3.10.3
openjdk 21.0.1 2023-10-17
OpenJDK Runtime Environment (build 21.0.1+12-Ubuntu-222.04)
OpenJDK 64-Bit Server VM (build 21.0.1+12-Ubuntu-222.04, mixed mode, sharing)
  

CPU
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 57 bits virtual
  Byte Order:            Little Endian
CPU(s):                  4
  On-line CPU(s) list:   0-3
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
    CPU family:          6
    Model:               106
    Thread(s) per core:  2
    Core(s) per socket:  2
    Socket(s):           1
    Stepping:            6
    CPU max MHz:         2800.0000
    CPU min MHz:         800.0000
    BogoMIPS:            5586.87
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid aperfmperf pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512vbmi umip avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx 512_bitalg avx512_vpopcntdq la57 rdpid fsrm arch_capabilities
Virtualization features: 
  Virtualization:        VT-x
  Hypervisor vendor:     Microsoft
  Virtualization type:   full
Caches (sum of all):     
  L1d:                   96 KiB (2 instances)
  L1i:                   64 KiB (2 instances)
  L2:                    2.5 MiB (2 instances)
  L3:                    48 MiB (1 instance)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-3
Vulnerabilities:         
  Gather data sampling:  Unknown: Dependent on hypervisor status
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
  Retbleed:              Vulnerable
  Spec rstack overflow:  Not affected
  Spec store bypass:     Vulnerable
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected
  

Memory (15Gi)
               total        used        free      shared  buff/cache   available
Mem:            15Gi       1.0Gi       9.4Gi       3.0Mi       5.2Gi        14Gi
Swap:             0B          0B          0B

  

Operating System (Ubuntu)
Linux rinha 6.2.0-1019-azure #19~22.04.1-Ubuntu SMP Wed Jan 10 22:57:03 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

I was analyzing this edition of the hackathon and I wondered what would be a reasonable way to rank the submissions. I couldn’t. Like, if everybody fulfills the concurrency challenge, how to differentiate them?

I even posted something like that on Twitter.

The real criteria are the friends we made along the way

It turns out that I was kind of right.

Criteria

The results of the "competition" were based on a fictional SLA:

Rinha de Backend® Inc. will pay an amount of USD 100,000.00 to each API supplier, discounting the fines for possible SLA compensations.

It was related to latency and consistency. The SLA is defined as:

  • 98% of responses are under 250ms
    the penalty is calculated as: (98 – success_percentage) * 1,000.00
  • balance consistency
    for each balance inconsistency, you will be penalized USD 803.01.

Although the criteria did not allow a ranking, I earned my 100K USD as you can see in the result below:

Final notes

You can find my submitted implementation at my GitHub. Keep in mind that this is experimental and hackathon-focused. Despite some strategies described here being useful in the real world, like event sourcing and multitenancy, the way I did was entirely based on a static amount of clients, for example.

Finally, I did some stuff to get performance improvements. I gained a few milliseconds. In the real world, this may not be needed.

So, take the concepts mentioned here, understand them, and have them in your knowledge portfolio. The day when you will need to apply some of those strategies may be closer than you think.

Rinha is a different kind of hackathon. People help each other instead of competing. I believe there will be more Rinha editions. Give it a try and learn something new.

References

We want to work with you. Check out our "What We Do" section!