Home > Data Science / AI > Understanding AI and Data Science > Can a whole Data Science project be done using R or Python?

Understanding AI and Data Science

Can a whole Data Science project be done using R or Python?

4 March 2021 Updated at 15 May 2023

Didier Gaultier

For several years now, many Data Scientists have found themselves turning to “language” command line tools, such as R and Python, to deal with Big Data. But can you really undertake a whole Data Science project solely armed with these two technologies?

The evolution of Data Science tools

Looking back on the evolution of what is known today as Data science, (which, as we recall, is a hybrid field covering statistics, machine learning and artificial intelligence), we can see that the alternation between “command line” type tools and GUI (graphical user interface) type tools has been constant.

Introduction of new technology generally makes existing, sophisticated tools with GUI type interfaces obsolete. This only leaves the command line technology, with relatively basic software in terms of man-machine interaction, until such time as vendors develop a new interface that is compatible with the new technologies, which can sometimes take years.

In this regard, Big Data has been no exception and caused difficulty for many of the market’s existing Data Mining software, eventually forcing Data Scientists to fall back on basic command line tools to address the challenge it posed.

Spark, Scala, R, Python, Jupyter, etc…, all these technologies currently in high demand in Data Science are in command line format.

The advent of open source

The last GUI/command line alternation matched with the massive rise of open source in the Data Science field. And the fact that all of the above-mentioned technologies are open source is no accident. We owe this phenomenon namely to the world of academia since universities are increasingly prioritising open source for economic or independence-from-vendor reasons.

It is also undeniable that open source utilisation was boosted by the online courses and MOOC phenomenon which, driven by new specialised players such as Coursera, Udemy, Udacity, etc., has taken Data Science by storm during the past years.

Many of today’s Data Scientists are in fact statisticians and “data miners” who have upskilled thanks to Moocs and a healthy dose of personal motivation.

In short, open source, whether it be through universities or Moocs, has become a fixture of Data Science over the past years.

The unstoppable rise of R and Python

The above brings us to the article’s key question: “When it comes to Data Science, can you do everything with R and Python?”

R and Python are not new languages. R is itself a derivative of the S language, developed by John Chambers in Bell’s laboratories in 1975. The Python language, was for its part, created in the late 80s, at the National Research Institute for Mathematics and Computer Science of Amsterdam, by Guido van Rossum.

Many Data Scientists having mainly trained in one of these two environments often ask us if the entirety of their Data Science project can be done using R or Python…leading our customers to also often ask the same question.

Rather R or Python?

Expressed like this, it could seem like a strange question to ask, especially since, in general, Data Scientists have chosen their camp, either R or Python, very rarely both.

However, it must be made clear right away that using only R and Python as programming language, without any external libraries, will not take you very far in Data Science. Indeed, both languages have to use external libraries to get most of the necessary algorithms to work. There is a high number of these essential libraries in both languages but they are, of course, not interchangeable from one language to another. Moreover, some of them are even linked to your Big Data work environment, such as Spark’s Mllib.

The aim of this article is not to launch a R versus Python debate, even if we will keep in mind that, in very general terms and with exceptions, R seems a bit more suited to data mining and statistics, whilst Python seems to be better at standardising algorithms or models, especially in Big Data environments.

No, the point here is to remind you that for these languages (as well as their associated libraries) to work, you need a tool known as an IDE (Integrated Development Environment).

This IDE contains namely at least one interpreter, crucial to the functioning of the language, and often an integrated specialised text editor with syntax detection.

Of course, nothing prevents you from working with a traditional text editor, or even several different complementary tools, but the current trend leans towards enriching these IDEs with the add-ons required for Data Science. One simple example being Anaconda for Python and R-Studio for the R language.

IDE, the new basic Data Science tool

This IDE-enhancement (with features that facilitate language use) trend has not escaped the attention of the sector’s major software vendors, some of whom have already leapt on the bandwagon. Indeed, the most recent versions of Data Science software that we have been able to test seem to be nothing more than ramped up IDEs.

Something which is particularly apparent when you look namely at the latest releases from world renown vendors in the field of Data Science: SAS Viya, IBM Watson DSX, KNIME, Microsoft Azure and Dataiku DSS to name but a few.

A huge increase in productivity

These software programmes are definitely superior to the basic IDE in the sense that they include a user interface but also, and most importantly, a workflow system. A workflow is a sophisticated and editable graphical representation of a sequence of algorithmic processes called nodes, each node represents a basic algorithm or process with input and output parameters. Nodes are connected to one another by flows represented graphically by arrows.

In addition to its high readability, the workflow is also extremely useful in Data Science as it does what no language can, i.e. saves your dataset’s status in-between processing stages. Generally, the workflow also provides a ready-for-use set of processes that can help save a lot of time (namely the case during the data preparation and recoding phases).

The increase in productivity brought about by workflows (compared to a basic command line interface) is huge and the payoff will be quick, especially compared to linear programming only in line mode. Moreover, whenever a non-native feature is required, it can always be programmed in R or Python. The feature then becomes a node in your workflow and if it has been properly designed, it can be later reused, a bit like a macro. The workflow thus perfectly complements the language.

The best of both worlds

“New generation” Data Science software are nowadays natively language-based (either R, or Python, and on rare occasions, both), and designed as complete IDE-type development environments.

The best of them usually include, in addition to full language support, the workflow notion, which is extremely useful in Data Science, the best of both worlds if you will.

To conclude, I would say that even though R and Python are very good starting points in Data Science, they should always be complemented with an IDE-type software that is able to implement a graphical workflow if you want productivity gains to be part of your projects.

These new IDEs (some being released even as you read this article) represent the ideal modern work environment for Data Science and will rapidly become impossible to do without, especially if your objective is to place production algorithms and models in Big Data environments with minimal obstacles and waste of time.

Business & Decision

Data Scientist – Director of the Data Science & Customer Intelligence offerings at Business & Decision France. Also teaching Data Mining & Statistics applied to Marketing at EPF Schoolg and ESCP-Europe.

Learn more >

Your email address is only used by Business & Decision, the controller, to process your request and to send any Business & Decision communication related to your request only. Learn more about managing your data and your rights.

Data Strategy

Data Governance and Data Management: what's the difference?

In a world where companies' ambition is to be data-driven, data governance and data management are still too often regarded as being synonymous. Let us clear up the confusion. Data...

Premium

Data Governance

REPLAY | Let’s win the Data Mesh Battle: the winning alliance between Data Architecture and Data Governance

The Data Mesh vision has brought to light the various challenges that companies face in managing and effectively utilizing their data. This is not a new challenge, as it has...

Premium

Data Trends

REPLAY | The missing pillars in the Data Mesh approach

Is Data Mesh a utopia? For two years now, the concept of Data Mesh has been seen as a revolution in the world of data since it would fill the...

Premium

Data Strategy

WHITEPAPER | Spiderman guides you towards a data-driven company

There is tremendous enthusiasm for Data Mesh. And for good reason: we finally have a complete framework for valuing data at company level. This white paper offers you a deep...

Data Trends

Data Mesh, a total data-driven model

Through its four main pillars, Data Mesh truly moves away from the dogma of centralisation and all-technology in favor of a global approach based on federation. Data Mesh thus promises...

Data Trends

#Data #AI: 7 hot topics for 2023

The 7 hot topics Data and AI of this 7th edition are the solutions for the performing company. What are specifically the trends and topics to track in 2023? This...

Data Trends

Data Mesh: Practical examples and feedback

Mastering data and its uses to create value is an ambition that is increasingly shared. However, organisations continue to face obstacles that Data Mesh could help to overcome… provided the...

Data Trends

Data Mesh: federated governance to guarantee efficiency

Data governance is an essential part of any data strategy. Nevertheless, it remains complex to deploy in a traditional organisation, but through its federated approach, Data Mesh is able to...

Data Trends

Data infrastructure self-service as the technological driving force behind Data Mesh

Data Mesh is not strictly speaking a technological approach, but data domains need powerful technical resources to develop their products. The data platform and its infrastructure are a facilitator for...

Data Trends

Data Mesh: data is a product

Oil, digital black gold, strategic asset… With Data Mesh, data is regarded as a product. Data domains are responsible for managing the life cycle of these products and for sharing...

Data Trends

Data domains: Data Mesh gives business domains superpowers

The Data Mesh concept is based on four main pillars, the first of which is an organisation divided into data domains. To be effective, this structure must reflect the business...

Integrating AI and Data Science

Data Science and AI: how to properly scope your business projects?

An increasing number of companies are opting for data-driven strategies and embarking on marathon Data Science and Artificial Intelligence projects, in the hope of sharing the benefits of new technologies...

Understanding AI and Data Science

Data Science: the 4 obstacles to overcome to ensure a successful project

The last five years we have seen the number of Data Science projects carried out by Business & Decision in various sectors, such as the oil industry, telephony, retail and...

Integrating AI and Data Science

How is the Port of Antwerp optimising logistics with data science?

Looking for fast, intelligent exploitation of its mass of data, the Port of Antwerp turned to Business & Decision to optimise and secure the safety and efficiency of its maritime...

Can a whole Data Science project be done using R or Python?

The evolution of Data Science tools

The advent of open source

The unstoppable rise of R and Python

Rather R or Python?

IDE, the new basic Data Science tool

A huge increase in productivity

The best of both worlds

Discover also

Data Governance and Data Management: what's the difference?

REPLAY | Let’s win the Data Mesh Battle: the winning alliance between Data Architecture and Data Governance

REPLAY | The missing pillars in the Data Mesh approach

WHITEPAPER | Spiderman guides you towards a data-driven company

Data Mesh, a total data-driven model

#Data #AI: 7 hot topics for 2023

Data Mesh: Practical examples and feedback

Data Mesh: federated governance to guarantee efficiency

Data infrastructure self-service as the technological driving force behind Data Mesh

Data Mesh: data is a product

Data domains: Data Mesh gives business domains superpowers

Data Science and AI: how to properly scope your business projects?

Data Science: the 4 obstacles to overcome to ensure a successful project

How is the Port of Antwerp optimising logistics with data science?

Informations sur la gestion de vos données et vos droits