Introduction#

Target group#

The users of Jupyter notebooks are diverse, from data scientists to data engineers and analysts to system engineers. Their skills and workflows are very different. However, one of the great strengths of Jupyter notebooks is that they allow these different experts to work closely together in cross-functional teams.

Data scientists

explore data with different parameters and summarise the results.

Data engineers

check the quality of the code and make it more robust, efficient and scalable.

Data analysts

use the code provided by data engineers to systematically analyse the data.

System engineers

provide the research platform based on the JupyterHub on which the other roles can perform their work.

In this tutorial we address system engineers who want to build and run a platform based on Jupyter notebooks. We then explain how this platform can be used effectively by data scientists, data engineers and analysts.

Why Jupyter?#

How can these diverse tasks be simplified? You will hardly find a tool that covers all of these tasks, and several tools are often required even for individual tasks. Therefore, on a more abstract level, we are looking for more general patterns for tools and languages with which data can be analysed and visualised and a project can be documented and presented. This is exactly what we are aiming for with Project Jupyter.

The Jupyter project started in 2014 with the aim of creating a consistent set of open source tools for scientific research, reproducible workflows, computational narratives and data analysis. In 2017, Jupyter received the ACM Software Systems Award – a prestigious award which, among other things, shares with Unix and the web.

To understand why Jupyter notebooks are so successful, let’s take a closer look at the core functions:

Jupyter Notebook Format

Jupyter Notebooks are an open, JSON-based document format with full records of the user’s sessions and the code they contain.

Interactive Computing Protocol

The notebook communicates with the computing kernel via the Interactive Computing Protocol, an open network protocol based on JSON data via ZMQ and WebSockets.

Kernels

Kernels are processes that execute interactive code in a specific programming language and return the output to the user.

Jupyter infrastructure#

A platform for the above-mentioned use cases requires an extensive infrastructure that not only allows the provision of the kernel and the parameterisation, time control and parallelisation of notebooks, but also the uniform provision of resources.

This tutorial provides a platform that enables fast, flexible and comprehensive data analysis beyond Jupyter notebooks. At the moment, however, we are not yet going into how it can be expanded to include streaming pipelines and domain-driven data stores.

However, you can also create and run the examples in the Jupyter tutorial locally.

Workspace#

Setting up the workspace includes installing and configuring IPython and Jupyter notebooks, nbextensions and ipywidgets.