Everyone is talking about Big Data!
But what is it?
You will find hundreds of definitions of this term, and even more scenarios how to use it.
Yes, Datapath.io does Big Data as well. Datapath.io scans (from the statistic point of view) the whole Internet. This is big data network optimization. The Internet is network prefixes and IP-address ranges.
Currently, the Internet is divided into 626,243 prefixes, IP-ranges and networks respectively. We scan these networks every half hour.
On top of this, we do full-scans from different points of view. Measuring-stations as we like say. We need different measuring-stations to give every potential customer the best view from their hosting-site to the Internet.
Currently, we are measuring from more than 55 measuring-stations. Altogether we come up to a number (prefixes * measuring-stations * 2/per hour) of 68,750,000 data rows per hour. This result is from an optimistic point of view. Not every prefix is going to answer the ping request.
Why We Need Big Data
So, what are we measuring?
We are doing standard pings into specific hosts of the target prefixes.
What do we store for our statistics?
When a ping replies to the target-host we store the target-prefix and the round-trip time.
Don‘t you need more information to use for statistics in a time-series context?
How about an identifier of the measuring-station used to ping the target-prefix?
And how about a time stamp – one of the essential keys when using data in a time-series – don‘t you need that?
Of course we need them!
Coming back to the calculation above, we should multiply the number rows with the data length all fields would use on a HDD. A prefix is a 4 byte integer plus the prefix length of at least 1 byte.
Then we need the round trip time as a floating point. Let‘s say the 2 bytes for the decimal and 2 bytes for the fractional parts. Then we need an identifier for the measuring-station used to ping the target prefix. This will be another 4 bytes for the source-IP of the measuring-station. Finally, a time stamp when the ping finished – 8 bytes for the epoch. All together, 21 bytes.
If you multiply that with the number of rows, you will have 1.376 gigabytes per hour of information.
And when it comes to the most important requirement: HOW do we store them?
Our Big Data Solution
We looked for the easiest solution. A tool that can already store a massive amount of data, as well as a tool that is already well tested. That was how we chose to use the NOSQL database Cassandra. We wanted to use up-to-date technology. Also, we would like to do fancy calculations with Map and Reduce.
After a short time of happiness, we noticed how limited we were when querying data. We also struggled with the initial barrier to use and administer Cassandra in our DevOps strategy.
For a startup, it is crucial to have all the needed knowledge in your company. But, sometimes it is impossible to hire expensive experts for technologies like Cassandra.
Another major point is that using a database like Cassandra, we end up storing redundant data. Why would you always store the primary key (e.g. the source-ip of the measuring-station) with every new row? That doesn‘t seem efficient.
Finally, we ended up using our own “storage-system”. We don‘t need random-access and on-demand access to the data. Thus, we used an in-house library to code our own file format, putting the data in a flat-file.
The only thing that changed access was integrating the primary-key of data in the directory path of the file. This is how we eliminated the redundant storage. To divide time-series with a time stamp, we added the timestamp to the filename itself. Like a “partition“.
That solution is currently in production, and fits perfectly for our apache-spark jobs. This is what we use to do our calculations over Big Data. This minimized the amount of redundant stored data to zero. We have complete control over the files, and use simple mechanisms for backups and storage. Such as the Hadoop file system, which we use as the backend for the storage-system.
In conclusion, you can take a look at the “dataformat“ library we used to store the data. This is represented as POJO‘s to a binary flatfile on disk at GitHub.
To learn more about Datapath.io projects. Come see us on GitHub.