Home > Technologies > Data tutorials, tools and languages > [TUTORIAL] First steps with Zeppelin

Data tutorials, tools and languages

[TUTORIAL] First steps with Zeppelin

6 April 2017 Updated at 4 May 2023

Zeppelin is the ideal companion for any Spark installation. It is a notebook that allows you to perform interactive analytics on a web browser. You can execute Spark code and view the results in table or graph form. To find out more, follow the guide!

Zeppelin is the ideal companion for any Spark installation. It is a notebook that allows you to perform interactive analytics on a web browser. You can execute Spark code and view the results in table or graph form. To find out more, follow the guide!

Installing Zeppelin

If, like me you have installed a stand-alone instance of Spark without hadoop, I recommend that you build Zeppelin from source code. However, for that you first need to install Maven.

To start the Zeppelin build with Spark1.5.2 execute this command:

mvn clean package -Pspark-1.5

Then you need to configure Zeppelin giving the parameters for a connection to your instance of Spark. Do this by editing files zeppelin-env.sh and zeppelin-site.xml which are located Zeppelin’s conf directory. For me, this gives:

Fragment of my zeppelin-env.sh

export MASTER=spark://spark.bd:7077 export SPARK_HOME=/root/spark_

Fragment of my zeppelin-site.xml

zeppelin.server.addr 0.0.0.0 Server address

zeppelin.server.port 8090 Server port.

Leave 0.0.0.0 as the server address Then specify your preferred port number.

Zeppelin can then be started. We start by launching Spark in cluster mode. start-master.sh start-slave.sh spark://spark.bd:7077 –m 2G

The slave must be launched with enough memory to execute Zeppelin. Otherwise you will be able to access your notebooks but you will not be able to execute Spark applications.

Then run Zeppelin. cd /root/incubator-zeppelin ./bin/zeppelin-daemon.sh start

You can check that Spark and Zeppelin have started up correctly by going to the Spark monitoring page (in my case, https://localhost:8080 )

Presentation of Zeppelin

If all goes well, you will be able to access the Zeppelin home page at https://localhost:8090

You can open an existing notebook or create a notebook.

Spark application

The primary value of Zeppelin is being able to write Spark code. Note that the main libraries (spark, sparkContext) are imported automatically.

You can therefore write your code directly in a notebook window. Once you have entered the code, just click on the triangle at top right to execute it. The output will be displayed beneath the code.

Zeppelin offers the benefits of spark-shell (direct execution of the code without compilation) while showing all the lines of our application (so that they are easy to edit).

It is also not limited to the basic functions of Spark. For example, you can run machine learning algorithms by importing the required librairies at the beginning of the script.

Graph view

If our application records data in DataFrames and then tables, we can then execute SQL code and display the results in the form of tables or graphs.

Of course, the options are limited: the aim of Zeppelin is not to compete with the big dogs of the sector like Qlik or Table. The benefit is that a single tool offers the ability to process data using Spark on a powerful cluster and to view the results.

Shell commands

Zeppelin also allows shell commands to be executed via the same interface. No need to open a terminal and connect to our Spark cluster. You can find the path to our files immediately or view the first few lines.

Settings

It is possible to configure the various windows of our notebook by clicking on the small cogwheel at top right.

You can then enter a title or reduce the width or our window.

Here is an example of a two-column report with the code displayed and the application output masked:

Other functionalities

A notebook can be exported. It includes the code and the views if they have been generated but not the data: the file generated is thus very small.

It can be exported so that the notebook can be stored locally or sent to an email recipient who can import it to their own platform.

On an enterprise platform it is also possible to share a notebook by altering permissions, but I have not tested this.

Another interesting function: cloning, which lets you duplicate a notebook in order to make changes without risk.

A promising tool

Zeppelin is a promising tool. It answers a real need for an integrated took for all datalab type work using Spark. In incubation up to May this year, it has recently been lauded by the Apache community. As evidence of its success, it is already offered on the Hortonworks distribution. Find out more about it now by taking a look at the Apache project page.

Business & Decision

Comment (1)

Your email address is only used by Business & Decision, the controller, to process your request and to send any Business & Decision communication related to your request only. Learn more about managing your data and your rights.

kuldeep shar ma Le 12 April 2017 à 10h12

Great Article, we have also write one article on Data Visualization Using Apache Zeppelin https://acadgild.com/blog/data-visualization-using-apache-zeppelin/

Data Strategy

Data Governance and Data Management: what's the difference?

In a world where companies' ambition is to be data-driven, data governance and data management are still too often regarded as being synonymous. Let us clear up the confusion. Data...

Premium

Data Governance

REPLAY | Let’s win the Data Mesh Battle: the winning alliance between Data Architecture and Data Governance

The Data Mesh vision has brought to light the various challenges that companies face in managing and effectively utilizing their data. This is not a new challenge, as it has...

Premium

Data Trends

REPLAY | The missing pillars in the Data Mesh approach

Is Data Mesh a utopia? For two years now, the concept of Data Mesh has been seen as a revolution in the world of data since it would fill the...

Premium

Data Strategy

WHITEPAPER | Spiderman guides you towards a data-driven company

There is tremendous enthusiasm for Data Mesh. And for good reason: we finally have a complete framework for valuing data at company level. This white paper offers you a deep...

Data Trends

Data Mesh, a total data-driven model

Through its four main pillars, Data Mesh truly moves away from the dogma of centralisation and all-technology in favor of a global approach based on federation. Data Mesh thus promises...

Data Trends

#Data #AI: 7 hot topics for 2023

The 7 hot topics Data and AI of this 7th edition are the solutions for the performing company. What are specifically the trends and topics to track in 2023? This...

Data Trends

Data Mesh: Practical examples and feedback

Mastering data and its uses to create value is an ambition that is increasingly shared. However, organisations continue to face obstacles that Data Mesh could help to overcome… provided the...

Data Trends

Data Mesh: federated governance to guarantee efficiency

Data governance is an essential part of any data strategy. Nevertheless, it remains complex to deploy in a traditional organisation, but through its federated approach, Data Mesh is able to...

Data Trends

Data infrastructure self-service as the technological driving force behind Data Mesh

Data Mesh is not strictly speaking a technological approach, but data domains need powerful technical resources to develop their products. The data platform and its infrastructure are a facilitator for...

Data Trends

Data Mesh: data is a product

Oil, digital black gold, strategic asset… With Data Mesh, data is regarded as a product. Data domains are responsible for managing the life cycle of these products and for sharing...

Data Trends

Data domains: Data Mesh gives business domains superpowers

The Data Mesh concept is based on four main pillars, the first of which is an organisation divided into data domains. To be effective, this structure must reflect the business...

Data Trends

Data Mesh:The ultimate model for data-driven companies?

A new paradigm for data management, Data Mesh breaks with data centralisation models used for the past 30 years. Its foundations: federated decentralisation and redistribution of responsibility for the benefit...

Data tutorials, tools and languages

TUTORIEL | Spark Structured Streaming: performance testing

Spark is an open source distributed computing framework that is more efficient than Hadoop, supports three main languages (Scala, Java and Python) and has rapidly carved out a significant niche...

Integrating AI and Data Science

Green AI: Responsible artificial intelligence is also frugal

When it comes to Artificial Intelligence, it’s not only about improving performance at any costs. Its benefits along its adoption requires AI to be responsible by also including an environmental...

[TUTORIAL] First steps with Zeppelin

Installing Zeppelin

Fragment of my zeppelin-env.sh

Fragment of my zeppelin-site.xml

Presentation of Zeppelin

Spark application

Graph view

Shell commands

Settings

Other functionalities

A promising tool

Discover also

Data Governance and Data Management: what's the difference?

REPLAY | Let’s win the Data Mesh Battle: the winning alliance between Data Architecture and Data Governance

REPLAY | The missing pillars in the Data Mesh approach

WHITEPAPER | Spiderman guides you towards a data-driven company

Data Mesh, a total data-driven model

#Data #AI: 7 hot topics for 2023

Data Mesh: Practical examples and feedback

Data Mesh: federated governance to guarantee efficiency

Data infrastructure self-service as the technological driving force behind Data Mesh

Data Mesh: data is a product

Data domains: Data Mesh gives business domains superpowers

Data Mesh:The ultimate model for data-driven companies?

TUTORIEL | Spark Structured Streaming: performance testing

Green AI: Responsible artificial intelligence is also frugal

Informations sur la gestion de vos données et vos droits