In designing the Datapath.io software architecture, we were quick to put in place distributed microservices. The goal is to measure a small set of key Internet performance indicators (KPI’s). This came after deciding which programming language to use. To get the desired measurement results, we use Java.
Distributed Microservices Needs
To determine our distributed microservices protocol, we needed requirements. The first need is to have a tool that can communicate with stateless microservices. This is globally distributed microservices through the open Internet.
The second need is it should have a mechanism to maintain a stable connection between the peers. This is crucial as the Internet changes its underlying topology and infrastructure. Firewalls might change, Border Gateway Protocol (BGP) sessions can fail and subnet’s may be unavailable.
AMQP RabbitMQ can withstand these failure scenarios.
The Datapath.io AMQP RabbitMQ Use Case
Currently, we ping the Internet represented by a host in a network prefix. We ping every 30 minutes from over 80 locations around the world. We also keep in mind that the result is transferred to a central storage.
Finally, we looked at AMQP implementations with popular Java clients. This led us to RabbitMQ. It is an implementation made in Erlang with a focus on simplicity and high throughput. Also, it contains the feature of sharding for horizontal scalability. Google used RabbitMQ in a scenario with a message throughput of about 10 Mio per second. This is beyond our current usage, but an impressive use case.
How We Use RabbitMQ
We use the 3.5.6 RabbitMQ server process as a moderator. This distributes our ping requests every 30 minutes and receives the results. RabbitMQ sends messages to different queues that address divergent hosts around the world. The result is any host in a queue knows what it should work on.
This is then compiled with the Erlang HiPE-libraries on Ubuntu 14.04 LTS. It is running on a 16 Core, 64GB memory host with the default configuration.
When a Ping Receiver/Requester begins, it connects itself to the RabbitMQ host through authentication. It will then wait for a message to start working.
The results of a ping process are sent to another queue called the ping sink. The ping sink receives all the messages, transforms them into our data-format, and stores them on the HDFS cluster. This is how Datapath.io utilizes Big Data.
What if the ping sink process crashes?
RabbitMQ is also capable of having a persistent queue. Messages are kept in line until someone asks for them. It is possible to add Time-To-Live to the messages. This will make them removable after a period of time. Think of a measuring station which loses its connection because of a network failure. The messages flood the RabbitMQ server.
The biggest task was to improve the scalability of the message format we used to communicate. This is the problem of “Consumer & Producer”.
If a producer, like the Ping-Requester, produces more messages than the consumer, you have a severe problem. The result is RabbitMQ begins to “throttle” the queue. Keep in mind that the work done in the consumer is already optimized. The solution is as follows:
Scale the host of the consumer (horizontal or vertical). For Datapath.io, neither were an option. The side effects of horizontal scaling (create more consumers) was too complex. The cost of vertical scaling was too high.
The result was using a different solution. We started to buffer the messages on the sender side and built up message packages of ten thousand. This avoided the overhead of AMQP RabbitMQ and reduced throttling to zero.
We now send 70 messages every 30 minutes to ping requesters. This pings all 635K network prefixes, instead of 635K single messages. The difference is 70 instead of the expected 63; due to the clearing algorithm of the buffer.
Keep in mind the use of a Serialization format. This is less overhead, like ‘data-format’. The results for Java or String Serialization may be too great for your scenario.
The throttling feature may seem to be a handicap at first, but its use can be helpful. It will assist in finding bottlenecks inside of your microservice application message flow.