15 Jan 2022
When working with Jupyter notebooks, I often struggle with the combinatorical explosion that inevitably happens as you explore your problem space.
You start prototyping a new model. After a while you end up with $n$ different preprocessing setups, $m$ models with $k$ hyperparameter settings producing $nmk$ pipelines in total, each evaluated with $l$ metrics using both test and train sets and visualized with $p$ plots. Congratulations! Your notebook is now a complete mess.
I usually deal with the above scenario by putting all the code into functions, classes, and modules that can be shared across multiple notebooks. This approach limits code repetition and prevents my notebooks from getting too long and complex. However, the downside is that you lose lot of the interactivity. Recently I have been thinking about applying the notion of inheritance to Jupyter notebooks and kernels as another possible solution.
Before going into the details, we need to establish some basic terminology. Notebook is a
.ipynb
file. Kernel is a process connected to a Python interpreter that evaluates notebook
contents. It is possible to connect multiple notebooks to the same kernel.
Let’s have a running notebook base.ipynb
with a cell x = "Hey!"
that has been executed
and another notebook new.ipynb
with the following code:
import jupyter_inheritance
jupyter_inheritance.inherit_from("base.ipynb")
print(x)
When executed, we want the cell above to output Hey!
even though, the variable x
has not
been defined in the new.ipynb
notebook. There are multiple ways we can achieve this.
The easiest solution is to just copy the contents of base.ipynb
notebook and execute it in new.ipynb
.
But what if base.ipynb
contained a database query that takes forever to run? We would have to wait
for the query to finish each time we inherit from base.ipynb
. Furthermore, there is no guarantee
that the query keeps returning the same results as the data in the database can change over time.
We can also just connect new.ipynb
to the same kernel as base.ipynb
. This has one downside though.
If we decided to redefine x
to x = "Bye!"
in new.ipynb
, the change would propagate back to
base.ipynb
and x = "Hey!"
would be lost.
Serializing the state (all the Python objects existing in the kernel memory) of base.ipynb
kernel,
dumping it into a file and loading it in new.ipynb
deals with all the mentioned issues. The code is
not executed from scratch and both notebooks use separate kernels. This is the solution we will try
to implement.
I use the term “inheritance” instead of “copy” because I think the user experience feels similar to class inheritance in object oriented programming. It allows us to create an empty notebook, get everyting from a parent for free, and build on top of it.
To make the inheritance work, we need to solve three problems:
One of the resources exposed by Jupyter server API is /api/sessions
. The response
looks something like this:
[
{
"path": "base.ipynb",
"kernel": {
"id": "3fa85f64",
"connections": 1
}
}
]
In this example, the notebook base.ipynb
is using kernel 3fa85f64
to execute its code.
Finding the correct id is just the matter of filtering the response using the path to the
notebook we want to inherit from.
To get the host of the Jupyter server to which we can send the API request, we can use
list_running_servers
function from jupyter_server.serverapp
.
There is a useful package called jupyter_client
that can send messages to Jupyter kernels
using ZeroMQ transport. To send a message, we first need connection details of the receiving
kernel. The details can be found in a file jupyter-runtime-dir/kernel-{kernel_id}.json
,
every running kernel has one. When we have the connection file, we can do this:
client = BlockingKernelClient(connection_file)
client.load_connection_file()
client.start_channels()
client.execute("y = 'Hello from the other side!'")
If successful, the kernel specified by the connection file has now a defined variable y
containing the string "Hello from the other side!"
. The variable is accessible from any
notebook connected to the kernel. The code above can be executed in any Python or Jupyter
environment that can reach the target kernel. In our case, we will be running it in new.ipynb
.
The naive aproach of just serializing everything in __main__.__dict__
with pickle
will not get us very far. One of the problems is that during deserialization, we will
not be able to unpickle functions or classes that were defined directly in base.ipynb
because their definitions will not be available in new.ipynb
.
Fortunately, a really neat package called dill
can serialize entire module type objects
with all the required definitions. The package even offers functions dump_session
load_session
to (de)serialize the __main__
module directly.
We can now combine dill
and jupyter_client
to send a message from new.ipynb
to
base.ipynb
telling the base kernel to serialize its state to a file. When all is done,
we just deserialize the file to new.ipynb
, effectively copying the state of one kernel
to another.
code = f"""
import dill
dill.dump_session("{storage_file_path}")
"""
client.execute(code)
client.get_shell_msg()
dill.load_session(storage_file_path)
The line client.get_shell_msg()
blocks new.ipynb
code execution until base.ipynb
is
done with the serialization. This prevents new.ipynb
going immidiately to dill.load_session
and loading an incomplete file.
That is pretty much it! I have published the complete implementation as a Python package, it is a bit more complex than the snippets in this post but all the important parts works the same. You can check it out on GitHub and try it yourself.