Apache Spark 2.0.: novedades de la nueva versión

Apache Spark 2.0.: new developments in the new version


This open source distributed computing platform includes some interesting improvements for the community: it has become 10 times faster than Spark 16 and comes with a unified API for the development of multi-purpose applications that work with real time data as well as with batches and interactive queries. 

22 Jul. 2016

This is not the first time we have spoken about Apache Spark at BBVAOpen4U. We explained its main features when it was still an emerging technology and confirmed its absolute success when it elbowed its way into the market of Big Data applications as the fastest, most powerful, most scalable and most sustainable option. The launch of Version 2.0 confirms the great welcome by the community of developers and the enormous possibilities that it offers companies that use big data to obtain a competitive advantage.

Apache Spark is an open source distributed computing platform with a very active community; it is faster and cheaper to implement and maintain than its predecessors in the Hadoop environment, such as MapReduce; it is unified; it features an interactive console that is very convenient for developers and it also has a quite powerful API for working with data. It is the best option on the market because it provides data engineers and scientists with a tool that resolves any possible scenario involving automatic learning, graphic computing, data streaming and interactive query processing problems in real time and with all the necessary scalability to respond to needs.

The Apache Spark version 2.0 comes with some interesting features that make it an even more powerful tool to work with Big Data:

Apache Spark 2.0, faster than Spark and MapReduce

The arrival of Apache Spark revealed speed as one of the essential advantages of the new platform, based on the fact that Spark works in memory and not in disk. Caching the data makes the interaction with the data more efficient and faster. This applies not only to the original data, but to the subsequent transformation of that information. When the system needs the data, it does not need to call on the disk, it simply goes to the cache memory. It is estimated that Apache Spark is 100 times faster in memory and 10 times faster in disk than Apache Hadoop MapReduce.  

Apache Spark 2.0 doesn't stop there. The new version has increased its data processing speed even more; this is also cache-based (integrated cache memory, in this case) as well as on code generation in execution time. It is estimated that the new version can be between 5 and 10 times faster than the 1.0 and subsequent versions of Apache Spark.

The APIs are unified in a single API

The spectacular added value of Apache Spark is real-time processing and analysis of big data. And what the community of developers started to demand from Spark administrators was a lunge ahead that involved real time data processing and combining that with other types of information analysis (working with batches on the one hand and interactive data querying on the other). So, Spark's second version has an API that gives developers the capacity to develop applications that combine real time, interactive and batch components. 

To work with this integrated Apache Spark 2.0. API, the development equipment must configure data storage with ETL functions (Extraction, Transformation and Loading). This feature provides developers with web analysis via interactive queries in a specific session or, for example, the option of applying automatic learning to create efficient patterns by training with old sample data and then including more recent information.

The API DataFrame and API Datasets are unified in a single library to make it easier for developers to learn the necessary programming elements, especially in the two languages: Java and Scala. It is not available in syntax such as Python or R because its characteristics don't allow it.

Structured data streaming

This unified API includes new high-level structured streaming at the top of the Spark SQL engine. This is the feature that allows interactive batch querying (on static data in a database) or during streaming (real time querying of data flow between the source and database to prepare reports or monitor specific information, for example). 

The idea is that developers can program "continuous applications"; in other words, applications that require a streaming engine but also integrating this with working with batches, interactive querying, external storage systems or any changes in business logic. We could say that Apache Spark 2.0 makes it easier for developers to program multi-purpose applications without the need to use several different programming models. This presents disadvantages for working with third party systems or providers, such as MySQL or Amazon S3 (Simple Storage Service).

Spark as a compilar

Spark project administrators have always expressed their concern for increasing its speed, even when it is already tremendously fast technology. The reason behind this requirement are the demands from the community itself, expressed in the periodic surveys held to improve the project. Spark 2.0 is 10 times faster than its predecessor Spark 1.6, because its developers have wiped it clean of non-essential tasks.

As stated by its administrators, most data engine cycles are dedicated to useless tasks such as calling virtual functions or reading and writing interim data in the cache. Optimizing use to avoid unnecessary CPU cycles is a big step.

Spark 2.0 is based on the second generation of the tungsten engine, which comes close to using the principles that govern the operations of modern compilers and MPP databases (massive parallel processing databases). How do they do it? They use the CPU registries to write the intermediate data and completely eliminate calls to virtual functions. 

If you want to try BBVA's APIs, test them here.