To all those young people wishing to embark on a career in Data Science, my advice was to begin with a Data Engineering job rather than directly as a Data Scientist… Today, I would like to walk you through the apprenticeships and training programmes that will help you become a Data Engineer.
Data Engineer: which training programmes to choose?
The Data Engineer is one who has a perfect understanding of Big Data ecosystems such as Spark or Hadoop as well as, of course, their programming. Data Engineers specifically perform the following tasks:
- Operationalize Big Data infrastructure
- Handle data ingestion and display in and from the infrastructure
- Take care of the preparation, and 1st level re-coding, of data
- Program, automate and optimize algorithms in the target infrastructure
“A Data Engineer is first and foremost an IT engineer”
A Data Engineer is first and foremost an IT engineer. As such, formal training in IT from universities and engineering schools in Big Data and of course, Data Engineering is perfectly suitable for the position.
It should preferably include the most advanced possible course in Python and Scala languages, and of course also include high-level proficiency in SQL and its NoSQL modern “versions” like Hive, Impala or Spark SQL
Technical education programmes (we are not talking about “soft skills” here, which will be addressed in a later article) for their part, have to focus on major areas covering at the very least Big Data, the Cloud, DevOps methods and of course, Artificial Intelligence.
Regarding Big Data, the imperatives are naturally Spark and Hadoop. Hadoop encompasses the whole ecosystem known as “Zookeeper,” and includes technologies such as Hive, Nifi, Oozie and Kafka. A significant number of these technologies are Java-based, it is thus a good idea to have a sound knowledge of Java to better control the environment, but you do not have to be a Java JEE developer to become a Data Engineer (least of all a Data Scientist).
Spark is now a must
On the other hand, Spark is absolutely essential. You can approach it in two ways, either through Python by means of PySpark, or through the Scala language. Both options are available, but needless to say that having both feathers in your cap would be ideal.
Regarding the Cloud, since there is no lack of private publishers, choices must be made. What matters is mastering the Spark and Hadoop infrastructures in the target cloud(s) that you have chosen to study. Indeed, each cloud has its own specific technical characteristics, in particular Artificial Intelligence APIs provided by the publisher, that you would be wise to get to know.
Carrying out an AI project
This brings us to Artificial Intelligence (AI), a field in which Python, of course, reigns supreme. Beware however as, in addition to Python, you will be required to have an expert knowledge of quite a few Python libraries in order to be able to see a whole project through. Here we can only mention the most important libraries such as Numpy, Pandas, Mathplotlib, Scikit-learn, Mllib, etc. You should also have a good understanding of version control systems and notebooks such as Git, GitHub, GitLab, Jupyter, Zeppelin, etc.
It is always possible to execute an AI project 100% in Python, but no customer (internal or external) will want to buy such AI because it will be too expensive and too difficult to maintain.
Therefore, you must also be able to manage Artificial Intelligence platforms available on the market. There are a number of those and the objective of this article is not to draw up an exhaustive list of these platforms or to compare them. In this regard, I would like to draw your attention to, for example, Gartner’s – Magic Quadrant 2019 benchmarks relating to Data Science and Machine Learning platforms.
Technical and project management training
Besides the technical education programmes outlined above, you should also consider training in project management methods which include namely DevOps methods and the inescapable CRISP method. The Scrum method must also be understood but beware of “poor combinations” of methods in Data Science, as for instance CRISP must always be given priority over Scrum. To support DevOps, docker and kubernetes technologies are particularly interesting.
This is why Business & Decision has launched the Data School (École de la Data) in France in order to train Data Engineers, and ultimately Data Scientists and Data Analysts. The idea behind this project being to ensure that the talented young people who join our ranks are fully operational after having attended Business & Decision’s Data School. The course provided lasts three months and complements what is being taught in engineering schools and universities.