Pros and cons of Python and R for data science

Pros and cons of Python and R for data science

Python and R are the two most widely used languages for data science: mining and visualization of complex data. R is a powerful language; Python is versatile, and has a steep learning curve. But programmers are not all unanimous in their praise. 

BBVAOpen4U
|
09 Aug. 2016

Data science has become one of the fastest growing fields in companies. Certain elements of data usage such as visualizing complex information to extract conclusions and improve business decision-making offer a wealth of professional opportunities. In this field the programming languages Python and R are ahead of the pack. If you have statistical information and you want to understand it, this syntax can help you.

R is an open-code programming language created in 1995 by Ross Ihaka and Robert Gentleman to improve the visualization and data analysis features of a prior syntax such as S. R is an evolution of S. Today many users and professionals who use R come from the world of statistics or mathematics, applied to sectors such as healthcare. Other professionals are slowly beginning to see the advantages of R for understanding complex information and improving decision-making.

Python is a syntax that has its origins in 1991, when it was created by Guido Van Rossem in order to make an agile and simple programming language with a very steep learning curve. This is a great advantage for growing the use of the syntax internationally. Since its beginnings it was aimed at professionals from the world of statistics, but its features have significantly broadened Python's field of use: it is now used to create graphics with big data. More and more companies are adding Python programmers to both their back-end and front-end teams.

This language is a highly intuitive general-purpose syntax: any developer who takes a little time to learn it can then create innovative products that make good use of its great flexibility. This makes it a fun and versatile language for programmers

The greatest benefits of Python and R

●      A large number of repositories in GitHub:

GitHub is one of the world's largest collaborative development pages. Most programmers use this website to reuse open-code projects for their own initiatives. This means it serves as an effective thermometer for measuring the community supported by each programming language and the muscle that keeps the evolution of each syntax alive.

In the case of R, GitHub has over 43,000 repositories and Python, more than 91,000. They come after with more activity: over 230,000 project repositories in JavaScript; 196,000 in Ruby; and 162,000 initiatives in Java. They are followed by other well-known syntaxes such as HTML, PHP, CSS, C++ and C#.

This is not the only reference frequently used to compare the value of programming languages among the community of developers and companies when embarking on projects. Once a year, Dice conducts a survey to rate certain elements related to programming syntax, technology profiles and their salaries, and more. In the most recent study, ‘2015-2016 Dice Tech Salary Survey’, the Python syntax, and especially R, are favorably placed in the rating of professional salaries: 126,249 dollars a year for professionals with knowledge of R, and 109,782 dollars for Python

●      R and Python packages for data science

Both R and Python have several packages or plugins focused on data visualization. In the case of R, two of the most commonly used are ggplot2, a library that allows bar, point, line, area, maps and scale charts. ggplot2 depends on other packages that need to be downloaded and installed, such as itertools, iterators, reshape, proto, plyr, RColorBrewer, digest and colorspace. Another plugin for R programmers to make visualizations with big data is rgl, which enables the creation of 3D graphics in real time. Most packages in R can be found at RDocumentation.org, which has more than 11,000 plugins, with over 54,000 versions and 24,000 collaborators

There are also several packages available for making data representations in Python: matplotlib is one of the most widely used in data science for all kinds of graphics (bar charts, scatter charts, fever charts, and maps with Basemap and in 3D with mplot3D…) with very little code; and Seaborn, another library in Python based on matplotlib that offers scientists a package that enables them to create explanatory graphs from highly complex data. Python plugins are also gradually catching up with the resources available in a data visualizer when it uses R. 

Python and R also have their drawbacks

●      R is a slow programming language: this tends to be one of the recurrent complaints mentioned by developers when asked about the drawbacks of programming with R. There's always someone who says the syntax is slow. Although this is widely acknowledged by almost everyone, it's also true that some programmers explain this lack of speed by the fact that many of the packages used to add features are not developed in R, but in other syntaxes such as Fortran and C++. And this takes its toll. 

There is also a general consensus that Python is faster, particularly because this language has more resources to make it so.

R is an erratic tool for machine learning projects; some developers prefer Python, and particularly its scikit-learn library, a simple and effective tool for data mining and information analysis which allows the simple re-use of code between projects, and has all the benefits of being an open-code library (BSD license). 

●      Python doesn't have good documentation: some programmers complain about the lack of good documentation for Python, particularly compared to other programming languages like PHP and Java. It has other drawbacks –this analysis by datafull.co is a good summary