BigData, lightning-fast / The road from SQL to BigData
BigData is one of the big software topics that everyone “gets”, but only a few can actually nail down to bare details. Possibly because, from its first mentioning back in the 1990s, BigData kept evolving along three main vectors:
- In terms of volume: it moved from 1 Tb in the early-1990s to 100s of Petabytes as of 2020;
- In terms of structure: from 100% structured data until 2008-2010, to (mostly) unstructured;
- In terms of velocity: real-time generation and processing for (at least) the last several years.
All these have a significant impact not only on software technologies but also on a business’s bottom line (e.g. one needs to keep an eye on the cost of data storage) and top line (e.g. maximize the value of BigData).
And since there are no two identical companies and requests sets, the solutions are almost always custom-made. Unique assessments and decisions need to be made on a case-by-case basis:
- Can the company keep using any of its currently deployed technologies?
- If not, what are the alternatives and how can one select from the many solutions on the market?
- How can you develop product specifications for a custom(/customised) software solution?
- With whom and how can you develop the software product?
To clear things up, let’s look into a specific usage case that came with not one, but two sets of very specific, and quite conflicting sets of requests involving BigData.
For a recent project involving large amounts of data, we have run into a couple of challenges:
- On one side, we needed real-time storage of information received in waves. This required a parent-child structure to support peaks beyond 10k records/second.
- On the other side, we had to search/retrieve a large quantity of information within a third-party app, store the data, then provide a quick mechanism to search for different criteria. A tree structure should be able to store billions of nodes while allowing for updates.
These are conflicting needs, and there’s definitely no solution on the market, that can accommodate both. That’s why we took some steps back; assessed the needs and created the theoretical frameworks; then adopted, expended, and implemented two different solutions.
The SQL approachSince we have a strong SQL background, our first approach was to store normalized data. We considered the storage space to be the most important factor of the ecosystem and expected the DB-engine to take care of everything else. We, therefore, started to define the first, second, third normal forms for stored data. Everything worked fine until we crossed into the millions-of-rows territory – at which point, the server started to struggle.
Size does matter / Dead body in the closetUsing a relational structure for such large quantities of data resulted (naturally) in slow responses for even the simplest queries. We, therefore, needed to define separate/parallel data structures, until we got stuck. Using index files just increased the DB size without adding more speed. Moreover, maintaining the indexes added latency to the insert event. After some time, collected data becomes irrelevant/useless, so we needed to dispose of it at regular intervals, in order to free-up the database. Unfortunately, doing that in a relational world and in all “additional” tables was way too complicated to execute and maintain.
Big Data first bloodTherefore, we started looking into several NoSQL, BigData solutions: some expensive, others with limitations on free versions. We needed continuity, stability, maturity (e.g. the solution should have previous versions/iterations), ideally with a backing community. Our first trial was with Mongo (due to its large community and the MySQL migration mechanisms), then moved to Crate.io. In the end, both offered limited solutions to our needs and were quite expensive for unlimited size. Then the question of BigData experts came up, and we were equally stuck: there are too many of them, on too many niches, each offering a solution that would change the initial request.
Killing flies with rockets / Simpler is safer (and quicker)Each solution’s footprint was too big for our needs and too difficult to maintain compared to what we actually needed (e.g. rapid storage of data). Therefore, we had to look into simpler and safer solutions. Since all the data are collected by our application, we could then transform it until it fits any structure. It was at this point in time when we took the biggest step back to build the conceptual framework, before choosing a solution: What are the needs (must-haves vs. nice-to-haves)? And what can we do internally, to fit any solution we choose?
Not the same for all? / Pigs don’t fly / Separate paths
Very quickly, we agreed that a single DB version will not solve our application’s main needs. Also, trying to use BD in an unappropriated way will reduce performance. We avoided any data coupling (between SQL and Bigdata) and tried to think Not SQL. For example, it does not matter if we store the same name a million times, it will just occupy more space but will speed up the search.
A database will never work well when used in a different way than it was designed to work. One can try workarounds, only with lost time and suboptimal results. A better approach is to model your thinking and/or business around dedicated tools, than doing workarounds.
Therefore, we have split the big job into two separate main tasks (and two separate databases), each with its own solution:
- One with rapid insertion, without taking a lot of space, but with ease of data deletion.
- The other with rapid search; anything else good to have later on (e.g. Elasticsearch allows for non-homogenous databases, i.e. data structure can be modified on the fly).
After a brief debating, our final choices where:
- Elasticsearch: capacity for tens of billions of entries, with high read-write volumes (5k-10k/sec). If needed, it is easily scalable by installing more instances.
- ClickHouse: could absolutely cover our needs for a maximum of 500 million entries, plus it offers compressed database info, and limited SQL syntax is possible.
ClickHouseClickHouse [link] is a Hybrid between BigData and SQL databases:
- each insert stores the data into a separate and unique file on disk
- then the engine returns to merge the data into proper partitions and delete the file;
- the merge is done internally, and new inserts do not interfere;
- deleting is not easy, but our particular usage case doesn’t need it.
Elasticsearch [link] is a framework and tool suite, best known for its BigData stocking and analysis solution. Their offering also includes various tools to ingest data, predict, visualize etc. We picked Elasticsearch because it solves our needs for quick search and storage.
- It saves non-heterogenous data on a continuous basis, with the option to control the speed.
- Option to retrieve the search result from a flat table in batches (i.e. first set of data are delivered as quickly as possible, then the engine searches for the next one).
- Possibility to define separate indexes (database equivalent) and perform global searches based on index name patterns. Each index will be searched independently on a separate thread.
Splitting the needs into two parts was the first step, but then the implementation involved several other learning points and “jumps” forward.
Each DB is accessed through an HTTP client using proper Java connectors. We will not add to the many “how to’s” already available online, but provide some hints on how we advanced through the project:
- Be aware of timeouts. Both engines have their own agenda, and the responses do not come as we like. Sometimes, when a timeout (error) occurs, we recommend waiting a bit before retrying.
- Search cookies or scroll IDs: we are in an HTTP world, where the results are limited and provided in batches. Always test whether there’s more to be retrieved.
- Allocate enough memory to the DB-engines; otherwise, you can hit a bottleneck that slows down the whole system.
- Avoid updates and deletes – these are sometimes done in an insert/delete way and will add extra-load. It is therefore better to save the already processed data instead of saving, then processing.
- Always check the version compatibility and control the updates. Architectural changes of the DB provider could significantly impact your application. (e.g. Elasticsearch Client -> High level rest Client)
- Use SSDs – otherwise, the classic SATA will slow you down with frequent data partitioning and manipulation.
Our applications are provided in Linux virtual machines and in the beginning, we let the update mechanism automatically download the latest version of each tool. We eventually stopped some of them in order to keep the versioning under control. This way, we avoided a lot of conflicts and incompatibilities.
Tools / Show me the moneyAs usual, these DBs come with minimal tools – so we looked into third parties (and enjoyed!). The main criteria were:
- either free or close to free;
- easy to use;
- no security liabilities.
- for ClickHouse, instead of the command-line client from the Linux console, we used a web client (link)
- for Elasticsearch we used a Chrome plugin (Elasticsearch Head) – it has a nice feature that allows defining queries, then translates them into proper JSON requests.
For our usage cases, we found Kibana and Grafana to be precisely what we needed:
- can be linked to the database;
- friendly interface;
- allow reporting/visualisation.
Due to our very specific needs and the close-knit community, we eventually got to contribute to the Grafana/Clickhouse community for two plugins (vertamedia-clickhouse-datasource and briangann-datatable-panel).
The future is SQL-ish
After going through the whole SQL-Elasticsearch-ClickHouse circle, here are some things that we would love to see happening:
- With a small effort, both DB solutions can get to support SQL users. For example, Clickhouse allows SQL interrogation between max two tables; while Elasticsearch has a tool that can run SQL commands (people rather understand SQL than BigData syntax) but does not allow any join (yet).
- Back logic: Elasticsearch has its own language (painless) that could be used just to calculate values in run time. Something like stored procedures or functions would also be nice; otherwise, data transformation could only be done by duplication. Considering the usual sizes (500Gb – 1Tb), it will double the footprint.