15 Jan 2022
When working with Jupyter notebooks, I often struggle with the combinatorical explosion that inevitably happens as you explore your problem space.
You start prototyping a new model. After a while you end up with $n$ different preprocessing setups, $m$ models with $k$ hyperparameter settings producing $nmk$ pipelines in total, each evaluated with $l$ metrics using both test and train sets and visualized with $p$ plots. Congratulations! Your notebook is now a complete mess.
I usually deal with the above scenario by putting all the code into functions, classes, and modules that can be shared across multiple notebooks. This approach limits code repetition and prevents my notebooks from getting too long and complex. However, the downside is that you lose lot of the interactivity. Recently I have been thinking about applying the notion of inheritance to Jupyter notebooks and kernels as another possible solution.
Before going into the details, we need to establish some basic terminology. Notebook is a
.ipynb file. Kernel is a process connected to a Python interpreter that evaluates notebook
contents. It is possible to connect multiple notebooks to the same kernel.
Let’s have a running notebook base.ipynb with a cell x = "Hey!" that has been executed
and another notebook new.ipynb with the following code:
import jupyter_inheritance
jupyter_inheritance.inherit_from("base.ipynb")
print(x)When executed, we want the cell above to output Hey! even though, the variable x has not
been defined in the new.ipynb notebook. There are multiple ways we can achieve this.
The easiest solution is to just copy the contents of base.ipynb notebook and execute it in new.ipynb.
But what if base.ipynb contained a database query that takes forever to run? We would have to wait
for the query to finish each time we inherit from base.ipynb. Furthermore, there is no guarantee
that the query keeps returning the same results as the data in the database can change over time.
We can also just connect new.ipynb to the same kernel as base.ipynb. This has one downside though.
If we decided to redefine x to x = "Bye!" in new.ipynb, the change would propagate back to
base.ipynb and x = "Hey!" would be lost.
Serializing the state (all the Python objects existing in the kernel memory) of base.ipynb kernel,
dumping it into a file and loading it in new.ipynb deals with all the mentioned issues. The code is
not executed from scratch and both notebooks use separate kernels. This is the solution we will try
to implement.
I use the term “inheritance” instead of “copy” because I think the user experience feels similar to class inheritance in object oriented programming. It allows us to create an empty notebook, get everyting from a parent for free, and build on top of it.
To make the inheritance work, we need to solve three problems:
One of the resources exposed by Jupyter server API is /api/sessions. The response
looks something like this:
[
{
"path": "base.ipynb",
"kernel": {
"id": "3fa85f64",
"connections": 1
}
}
]In this example, the notebook base.ipynb is using kernel 3fa85f64 to execute its code.
Finding the correct id is just the matter of filtering the response using the path to the
notebook we want to inherit from.
To get the host of the Jupyter server to which we can send the API request, we can use
list_running_servers function from jupyter_server.serverapp.
There is a useful package called jupyter_client that can send messages to Jupyter kernels
using ZeroMQ transport. To send a message, we first need connection details of the receiving
kernel. The details can be found in a file jupyter-runtime-dir/kernel-{kernel_id}.json,
every running kernel has one. When we have the connection file, we can do this:
client = BlockingKernelClient(connection_file)
client.load_connection_file()
client.start_channels()
client.execute("y = 'Hello from the other side!'")If successful, the kernel specified by the connection file has now a defined variable y
containing the string "Hello from the other side!". The variable is accessible from any
notebook connected to the kernel. The code above can be executed in any Python or Jupyter
environment that can reach the target kernel. In our case, we will be running it in new.ipynb.
The naive aproach of just serializing everything in __main__.__dict__ with pickle
will not get us very far. One of the problems is that during deserialization, we will
not be able to unpickle functions or classes that were defined directly in base.ipynb
because their definitions will not be available in new.ipynb.
Fortunately, a really neat package called dill can serialize entire module type objects
with all the required definitions. The package even offers functions dump_session
load_session to (de)serialize the __main__ module directly.
We can now combine dill and jupyter_client to send a message from new.ipynb to
base.ipynb telling the base kernel to serialize its state to a file. When all is done,
we just deserialize the file to new.ipynb, effectively copying the state of one kernel
to another.
code = f"""
import dill
dill.dump_session("{storage_file_path}")
"""
client.execute(code)
client.get_shell_msg()
dill.load_session(storage_file_path)The line client.get_shell_msg() blocks new.ipynb code execution until base.ipynb is
done with the serialization. This prevents new.ipynb going immidiately to dill.load_session
and loading an incomplete file.
That is pretty much it! I have published the complete implementation as a Python package, it is a bit more complex than the snippets in this post but all the important parts works the same. You can check it out on GitHub and try it yourself.