Three solid real-time Big Data alternatives: Spark, Storm and DataTorrent RTS

Three solid real-time Big Data alternatives: Spark, Storm and DataTorrent RTS

The arrival of tools for the real-time analysis of Big Data has brought with it many advantages for companies that need to deal with the constant mass entry of data and extract real value from this flow of information.
BBVAOpen4U
|
07 Aug. 2015

Data, data, data. Value, value,value. And if possible, in real time. The concept of real-time business intelligence has been on the market for some time, but until very recently only a limited number of companies used it. Today, Hadoop's stability makes it the most commonly used platform for analyzing large volumes of data, but when streaming calculations are needed, solutions such as Spark, Storm or DataTorrent RTS are a great choice.

These kinds of practices used to have no real market penetration, for two main reasons: the first, obviously, was the lack of real-time business intelligence tools; the second, that existing solutions were only geared to batch data analysis and were expensive. Spark, Storm and DataTorrent RTS provide a solution to these two problems. 

1. Apache Spark

Apache Spark is undoubtedly the great new star of Big Data analytics. It is an open-code platform for processing data in real time, and may be executed and operated using four types of different languages: Scala, the syntax in which the platform is written; Python; R; and Java. The idea of Spark is to offer advantages in the handling of constant data entries with speeds far above those offered by Hadoop MapReduce.

Some of its key features are:

- Speed in the calculation processes in memory and on disc: Apache promises a calculation speed 100 times quicker than that currently offered by Hadoop MapReduce in memory and 10 times better in disc.

 

- Execution on all types of platforms: Spark can be executed on Hadoop, Apache Mesos, and EC2, in independent cluster mode or in the cloud. In addition, Spark can access numerous databases such as HDFS, Cassandra, HBase or S3, Amazon's data warehouse.

- It incorporates a package of very useful tools for developers: the MLlib library for implementing automated learning solutions and GraphX, Spark's API for computation services with graphs.

- It has other interesting tools: Spark Streaming, which allows the processing of millions of data among the clusters, and Spark SQL which makes it easier to exploit the data through the SQL language.

2. Apache Storm

Apache Storm is an open-source distributed real-time computation system. It allows the simple and reliable processing of large volumes of analytics data (for example, for the continuous study of information from social networks), distributed RPC, ETL processes

While Hadoop carries out batch data processing, Storm does it in real time. In Hadoop the data are entered in a file system (HDFS) and then distributed through the nodes to be processed. When the task is complete, the information returns from the nodes to HDFS to be used. In Storm there is no process with an origin and an end: the system is based on the construction of Big Data topologies that are transformed and analyzed in a continuous process of information entries. 

That is why Storm is something more than a system of Big Data analytics: it is a system for Complex Event Processing (CEP). This type of solution allows companies to respond to the arrival of sudden and continuous data (information collected in real time by sensors, millions of comments generated on social networks such as Twitter, WhatsApp and Facebook, bank transfers…).

It is also of particular interest for developers for a number of reasons:

- It can be used in various programming languages. Storm has been developed in Clojure, a dialect of Lisp which is executed in Java Virtual Machine (JVM). Its great strength is that it offers compatibility with components and applications written in various languages such as Java, C#, Python, Scala, Perl and PHP.

- It is scalable.

- It is fault-tolerant.

- It is easy to install and operate.

3. DataTorrent RTS

DataTorrent RTS is an open-source solution for the batch or real-time processing and analysis of big data. It is an all-in-one tool that aims to revolutionize not only what can be done in the Hadoop MapReduce environment, but also what is already offered in Spark and Storm in performance. The platform is capable of processing billions of events per second and recover any node outages with no data loss and no human intervention.

Some of its key features include:

- Guaranteed event processing.

- High in-memory performance.

- It is scalable

- Fault-tolerance at platform level.

- Easy to execute.

- Applications programmed in Java.

This Big Data solution provides mechanisms for ingesting data from many different sources, directly from external databases or through their integration with native corporate applications. DataTorrent RTS provides technical teams with a group of connectors previously developed for SQL and NoSQL databases, Apache Sqoop, Apache Kafka, Apache Flume and social networks such as Twitter… Anything that generates data.

At the end of the day, these Big Data tools allow companies to discover where their real business opportunities lie, cutting study and analysis times and reducing costs. It is a battle by real-time and predictive models to gain competitiveness and win the game against the competition. 

Follow us on @BBVAAPIMarket