When we decided to write our own time series database, we definitely had high expectations and requirements for this tool including:
- fast read/write access
- byte format on disk to save disk usage
- monitorable (logging and passive measurements)
- REST interface for CRUD operations on data
- no sharding
Creation of time series database
First off we created the REST interface, which was written with Spark – not to be confused with the Apache Spark project – and looks and feels almost like a modern nodeJS/express app. To manipulate the data in bulk, we created some bulk requests in order to reduce the HTTP overhead and to save network time.
Going one step deeper, the persistence layer manages access to the database files that are mapped to the keys. The magic of this tool is provided by the keys, because, as well as describing the unique key for a particular time series, they also decide (in addition to an external database schema file) how the database files will be fragmented.
Fragmentation of files and mapping them in memory were one of the bigger challenges when developing this tool. If the fragmentation is too high, you end up having too many files created, which is a problem for some filesystems. We tested other filesystems like XFS but the performance for writing files did not behave linearly, so we ended up using EXT4. On the other hand, if the fragmentation is too low, then the files will be really big, which would be inflexible for backup scenarios and would result in a file header which would be too big to be written. We will come back to that header thing later…
When it actually comes to writing files and data to disk, you have to decide how to manage the „pointer“ to your file. At first, we defined our own byte-format by using the “dataformat” library mentioned in one of my last blog posts:
This would be the just a single (unsigned) value stored in 7 bytes on disk.
Then you could use the putNumber method to write a single value with a given „offset“ to the entity. Therefore you need to calculate the offset which is:
offset(int) = (valueTimestampInSeconds – databaseStartTimestampInSeconds) / databaseIntervalInSeconds
After you have written some values you need to write it to disk which is obviously the bottleneck in this scenario. It is better to do a seek operation, which, in Java NIO, is the SeekableByteChannel.position, on an open filehandle rather than writing the whole file. Since the boom of Solid-State-Disks, the cost of the seek operation has collapsed, which makes it more like a lookup in a Map.
We used this method for storing traffic-per-customer consumption. This use case requires only a few keys – not comparable to the use case of storing the latency measurements. We tried to reapply this method for this for intensive use case but ended up having millions of files on disk which was ineffective.
Aggregating the keys
A better solution was to aggregate „keyspaces“ (multiple keys) in a logical archive on disk so that all write/read operations in a keyspace will work on the same logical archive, and so the number of files required for a single scenario is reduced from millions to one.
As an example of an unaggregated key-form consider the following key and files:
Timeseriesx – Key: e.g.: rtt:aws:eu-central-1:cogent:188.8.131.52/16:median
So for each „fragment“ representing a time period, a file is created for all time series with that time period.
Aggregating the keys for a time period in one file looks like this:
TimeseriesArchive x – Key: e.g.: rtt:aws:eu-central-1:cogent – Time: Start = Date(n), End = Date(m)
That means that a key is separated into a subpart and the main part. All values for a given main part of a key and a time period will be stored in the same file. The subpart of the key is used to determine where in the file the time series will be.
In the case of measuring latency for all the network prefixes from a given transit-link („cogent“ in this example) the amount of created files is reduced from 650K to 1.
Now our header contains a simple Map<String,Integer> which denotes the offset in the file where the subkey is. As the number of values per block is given at creation time (there is one value per block in the above example), the block (and the file itself) will always have the same size.
Come back for the third part of this blog series.