For several years now, many Data Scientists have found themselves turning to “language” command line tools, such as R and Python, to deal with Big Data. But can you really undertake a whole Data Science project solely armed with these two technologies?
The evolution of Data Science tools
Looking back on the evolution of what is known today as Data science, (which, as we recall, is a hybrid field covering statistics, machine learning and artificial intelligence), we can see that the alternation between “command line” type tools and GUI (graphical user interface) type tools has been constant.
Introduction of new technology generally makes existing, sophisticated tools with GUI type interfaces obsolete. This only leaves the command line technology, with relatively basic software in terms of man-machine interaction, until such time as vendors develop a new interface that is compatible with the new technologies, which can sometimes take years.
In this regard, Big Data has been no exception and caused difficulty for many of the market’s existing Data Mining software, eventually forcing Data Scientists to fall back on basic command line tools to address the challenge it posed.
The advent of open source
The last GUI/command line alternation matched with the massive rise of open source in the Data Science field. And the fact that all of the above-mentioned technologies are open source is no accident. We owe this phenomenon namely to the world of academia since universities are increasingly prioritising open source for economic or independence-from-vendor reasons.
It is also undeniable that open source utilisation was boosted by the online courses and MOOC phenomenon which, driven by new specialised players such as Coursera, Udemy, Udacity, etc., has taken Data Science by storm during the past years.
Many of today’s Data Scientists are in fact statisticians and “data miners” who have upskilled thanks to Moocs and a healthy dose of personal motivation.
In short, open source, whether it be through universities or Moocs, has become a fixture of Data Science over the past years.
The unstoppable rise of R and Python
The above brings us to the article’s key question: “When it comes to Data Science, can you do everything with R and Python?”
R and Python are not new languages. R is itself a derivative of the S language, developed by John Chambers in Bell’s laboratories in 1975. The Python language, was for its part, created in the late 80s, at the National Research Institute for Mathematics and Computer Science of Amsterdam, by Guido van Rossum.
Many Data Scientists having mainly trained in one of these two environments often ask us if the entirety of their Data Science project can be done using R or Python…leading our customers to also often ask the same question.
Rather R or Python?
Expressed like this, it could seem like a strange question to ask, especially since, in general, Data Scientists have chosen their camp, either R or Python, very rarely both.
However, it must be made clear right away that using only R and Python as programming language, without any external libraries, will not take you very far in Data Science. Indeed, both languages have to use external libraries to get most of the necessary algorithms to work. There is a high number of these essential libraries in both languages but they are, of course, not interchangeable from one language to another. Moreover, some of them are even linked to your Big Data work environment, such as Spark’s Mllib.
The aim of this article is not to launch a R versus Python debate, even if we will keep in mind that, in very general terms and with exceptions, R seems a bit more suited to data mining and statistics, whilst Python seems to be better at standardising algorithms or models, especially in Big Data environments.
This IDE contains namely at least one interpreter, crucial to the functioning of the language, and often an integrated specialised text editor with syntax detection.
Of course, nothing prevents you from working with a traditional text editor, or even several different complementary tools, but the current trend leans towards enriching these IDEs with the add-ons required for Data Science. One simple example being Anaconda for Python and R-Studio for the R language.
IDE, the new basic Data Science tool
This IDE-enhancement (with features that facilitate language use) trend has not escaped the attention of the sector’s major software vendors, some of whom have already leapt on the bandwagon. Indeed, the most recent versions of Data Science software that we have been able to test seem to be nothing more than ramped up IDEs.
Something which is particularly apparent when you look namely at the latest releases from world renown vendors in the field of Data Science: SAS Viya, IBM Watson DSX, KNIME, Microsoft Azure and Dataiku DSS to name but a few.
A huge increase in productivity
These software programmes are definitely superior to the basic IDE in the sense that they include a user interface but also, and most importantly, a workflow system. A workflow is a sophisticated and editable graphical representation of a sequence of algorithmic processes called nodes, each node represents a basic algorithm or process with input and output parameters. Nodes are connected to one another by flows represented graphically by arrows.
In addition to its high readability, the workflow is also extremely useful in Data Science as it does what no language can, i.e. saves your dataset’s status in-between processing stages. Generally, the workflow also provides a ready-for-use set of processes that can help save a lot of time (namely the case during the data preparation and recoding phases).
The increase in productivity brought about by workflows (compared to a basic command line interface) is huge and the payoff will be quick, especially compared to linear programming only in line mode. Moreover, whenever a non-native feature is required, it can always be programmed in R or Python. The feature then becomes a node in your workflow and if it has been properly designed, it can be later reused, a bit like a macro. The workflow thus perfectly complements the language.
The best of both worlds
“New generation” Data Science software are nowadays natively language-based (either R, or Python, and on rare occasions, both), and designed as complete IDE-type development environments.
The best of them usually include, in addition to full language support, the workflow notion, which is extremely useful in Data Science, the best of both worlds if you will.
To conclude, I would say that even though R and Python are very good starting points in Data Science, they should always be complemented with an IDE-type software that is able to implement a graphical workflow if you want productivity gains to be part of your projects.
These new IDEs (some being released even as you read this article) represent the ideal modern work environment for Data Science and will rapidly become impossible to do without, especially if your objective is to place production algorithms and models in Big Data environments with minimal obstacles and waste of time.