Home > Data Science / AI > Integrating AI and Data Science > Statistics versus Machine Learning: should they really be opposed?

Integrating AI and Data Science

Statistics versus Machine Learning: should they really be opposed?

29 April 2021 Updated at 6 April 2022

Didier Gaultier

This “seemingly” old debate deserves to be revisited with fresh perspective. Data Science (such as Big Data) is a constantly evolving field with nowadays proven applications namely in the fields of customer knowledge and marketing…

Statistics and machine learning in the era of Data Science and customer knowledge

Even though the field of application is fairly recent, the basic methods used in Data Science are for the most part some forty years old now. To recall, the two main branches concerned are statistics on the one hand, and machine learning on the other, to which I would add a third branch that consists of what could be called “business ontologies” i.e. “structured sets of terms and concepts representing business know-how or a field of application” (Wikipedia). These ontologies help break down business know-how into two main areas:

One grouping the data dictionary and concepts specific to the business

Another one focused on capitalising on the processes and operating procedures specific to this business

We can notice that some people are absorbed by the versus debate, comparing the statistical and machine learning approaches, and their efficiency, ROI and cost within the context of predictive applications (predictive marketing, Digital Marketing, customer knowledge, etc.).

What sparked the debate?

This debate is by no means a new one in the sense that the two “schools” sprang from two different intellectual trends. “Machine learning”, also sometimes referred to as “artificial intelligence” is based on the premise that the ever-growing computational power of computers can help model specific phenomena. As for statistics, it is a specialised branch of mathematics that exists, at least theoretically, independently of computers.

The origins of statistics can be traced back to the reign of Louis XIV, who wanted a record of the various trades existing at the time in France (the term “statistics” actually contains the root of the word “état” (state), or state science). Afterwards, it itself split into several schools, namely the French, the Anglo-Saxon and Russian schools.

Today, in the wake of a remarkable evolution, all three schools have started more or less converging, with all three benefiting from the exponential growth of computers predicted by Moore’s famous law regarding their application in the form of increasingly powerful programmed algorithms.

Without being chauvinistic, the French school (sometimes called “statistique à la française” (French-style statistics)) remains amongst the most advanced in the world, at least in the academic sphere.

A little theory

The fact that we can use algorithms to predict phenomena, such as the behaviour of a customer group, remains mind-boggling and quite mysterious for many. But it is not as mysterious as it seems. In fact, all you need is a set of variables that characterise a given phenomenon in a number of actual observations, as well as a variable that describes each time the result in the form of a logical, categorical or numeric value. The objective is then to establish a link (or a model) between the output variable (or a variable to be predicted) and input variables (or predictive variables).

If we simplify the task totally, the operation uses statistics and/or Machine Learning only if the result of the variable to be predicted for a limited number of observations or cases, called “learning sample,” is known. Analysing the adjustment of the obtained model to observation data helps us to assess the accuracy of the model with respect to this learning sample.

The next step consists in validating the predictive model obtained using another “test” sample. This helps validate the robustness (reliability) of the model generated from the learning sample.

Naturally this implies to have a fairly high quality data, an IT infrastructure that can support data processing, a software tool (statistics and/or machine learning focussed) and of course a key stakeholder, in general known as a “Data Scientist”, who will implement an approach (of CRISP-DM-type, Cross Industry Standard Process for Data Mining) that provides a logical framework for the project.

The explanation

The siren song often heard on the market seems to suggest that machine learning solutions can almost single-handedly do the job, without the intervention of a specialist to configure them and in addition with much better results than using the approach described in the previous paragraph.

The fact is that, today, there are almost as many machine learning methods as statistical methods available. However, experience consistently shows that the best results are obtained when both approaches are combined. The debate opposing the two approaches is thus a relatively empty one. In fact, statistics and machine learning are complementary.

We can understand this point if it is made clear that any predictive approach (predicting a future state based on a present state) requires a prior explanatory step (explaining a present state using a past state), which in turn requires a prior descriptive step (explaining the relationships and correlations between the various variables), and maybe even the implementation of an ontology for the concerned business.

Today, statistics (whether combined with ontology or not) can definitely give true descriptive or explanatory “business” meaning to data.

So, to oppose or not to oppose?

Can we, in absolute terms, dispense with these descriptive and explanatory steps (i.e. the statistics, maybe even ontology, part) and directly apply machine learning to data to predict a phenomenon?

Even if it is possible in theory and from an IT perspective. I would not recommend it. Indeed, the ease of use of these methods, often discussed by the “All Machine Leaning”, champions could make you think that non-statisticians are perfectly capable of using them. This is not the case.

The robustness and accuracy of a purely “machine learning” model does not ensure that it makes sense from a business point of view (only statistics does that).

And even if the initial results of these automatic methods were flawless from a business point of view, a non-specialist user would not necessarily be able to assess the model’s degradation over time due to the arrival of new customer populations or the need to integrate new observations.

Machine learning groups a set of key algorithms that generates good results in terms of campaign targeting, customisation etc. But these results are better, more reliable and accurate, if machine learning is based on intermediate statistical results such as typologies, propensity scores, etc. that have been obtained in a professional manner.

In summary, machine learning and statistics are not, so to speak, rival methods but well and truly complementary ones. The best marketing and customer knowledge (CRM) results are often obtained when the two are combined…

Business & Decision

Data Scientist – Director of the Data Science & Customer Intelligence offerings at Business & Decision France. Also teaching Data Mining & Statistics applied to Marketing at EPF Schoolg and ESCP-Europe.

Learn more >

Your email address is only used by Business & Decision, the controller, to process your request and to send any Business & Decision communication related to your request only. Learn more about managing your data and your rights.

Data Strategy

Data Governance and Data Management: what's the difference?

In a world where companies' ambition is to be data-driven, data governance and data management are still too often regarded as being synonymous. Let us clear up the confusion. Data...

Premium

Data Governance

REPLAY | Let’s win the Data Mesh Battle: the winning alliance between Data Architecture and Data Governance

The Data Mesh vision has brought to light the various challenges that companies face in managing and effectively utilizing their data. This is not a new challenge, as it has...

Premium

Data Trends

REPLAY | The missing pillars in the Data Mesh approach

Is Data Mesh a utopia? For two years now, the concept of Data Mesh has been seen as a revolution in the world of data since it would fill the...

Premium

Data Strategy

WHITEPAPER | Spiderman guides you towards a data-driven company

There is tremendous enthusiasm for Data Mesh. And for good reason: we finally have a complete framework for valuing data at company level. This white paper offers you a deep...

Data Trends

Data Mesh, a total data-driven model

Through its four main pillars, Data Mesh truly moves away from the dogma of centralisation and all-technology in favor of a global approach based on federation. Data Mesh thus promises...

Data Trends

#Data #AI: 7 hot topics for 2023

The 7 hot topics Data and AI of this 7th edition are the solutions for the performing company. What are specifically the trends and topics to track in 2023? This...

Data Trends

Data Mesh: Practical examples and feedback

Mastering data and its uses to create value is an ambition that is increasingly shared. However, organisations continue to face obstacles that Data Mesh could help to overcome… provided the...

Data Trends

Data Mesh: federated governance to guarantee efficiency

Data governance is an essential part of any data strategy. Nevertheless, it remains complex to deploy in a traditional organisation, but through its federated approach, Data Mesh is able to...

Data Trends

Data infrastructure self-service as the technological driving force behind Data Mesh

Data Mesh is not strictly speaking a technological approach, but data domains need powerful technical resources to develop their products. The data platform and its infrastructure are a facilitator for...

Data Trends

Data Mesh: data is a product

Oil, digital black gold, strategic asset… With Data Mesh, data is regarded as a product. Data domains are responsible for managing the life cycle of these products and for sharing...

Data Trends

Data domains: Data Mesh gives business domains superpowers

The Data Mesh concept is based on four main pillars, the first of which is an organisation divided into data domains. To be effective, this structure must reflect the business...

Understanding AI and Data Science

Does Auto-Machine Learning (AutoML) really exists?

Automated machine learning (AutoML) has existed since 1990, it was considered as a silent revolution in the Artificial Intelligence (AI) field. When we analyze the term AutoML, we see that...

Integrating AI and Data Science

Data Science and AI: how to properly scope your business projects?

An increasing number of companies are opting for data-driven strategies and embarking on marathon Data Science and Artificial Intelligence projects, in the hope of sharing the benefits of new technologies...

Integrating AI and Data Science

Artificial intelligence, machine learning, data science: are these terms interchangeable?

Many writers talk about AI, machine learning and data science, as if these terms were broadly interchangeable. What’s going on exactly?

Statistics versus Machine Learning: should they really be opposed?

Statistics and machine learning in the era of Data Science and customer knowledge

What sparked the debate?

A little theory

The explanation

So, to oppose or not to oppose?

Discover also

Data Governance and Data Management: what's the difference?

REPLAY | Let’s win the Data Mesh Battle: the winning alliance between Data Architecture and Data Governance

REPLAY | The missing pillars in the Data Mesh approach

WHITEPAPER | Spiderman guides you towards a data-driven company

Data Mesh, a total data-driven model

#Data #AI: 7 hot topics for 2023

Data Mesh: Practical examples and feedback

Data Mesh: federated governance to guarantee efficiency

Data infrastructure self-service as the technological driving force behind Data Mesh

Data Mesh: data is a product

Data domains: Data Mesh gives business domains superpowers

Does Auto-Machine Learning (AutoML) really exists?

Data Science and AI: how to properly scope your business projects?

Artificial intelligence, machine learning, data science: are these terms interchangeable?

Informations sur la gestion de vos données et vos droits

Statistics and machine learning in the era of Data Science and customer knowledge