Implementing a Gateway Provisioner#

A gateway provisioner implementation is necessary if you want to interact with a resource manager that is not currently supported or extend some existing behaviors. Examples of resource managers in which there’s been interest include Slurm Workload Manager and Apache Mesos, for example. In the end, it’s really a matter of having access to an API and the ability to apply “tags” or “labels” in order to discover where the kernel is running within the managed cluster. Once you have that information, then it becomes a matter of implementing the appropriate methods to control the kernel’s lifecycle.

General Approach#

Please refer to the Gateway Provisioners section in the System Architecture pages for descriptions and structure of existing gateway provisioners. Here is the general guideline for the process of implementing a gateway provisioner.

  1. Identify and understand how to decorate your “job” within the resource manager. For example,

    1. In Hadoop YARN, this is done by using the kernel’s ID as the application name by setting the --name parameter to ${KERNEL_ID}.

    2. In Kubernetes, we apply the kernel’s ID to the kernel-id label on the POD.

  2. Today, all invocations of kernels into resource managers use a shell or python script mechanism configured into the argv stanza of the kernelspec. If you take this approach, you need to apply the necessary changes to integrate with your resource manager.

  3. Determine how to interact with the resource manager’s API to discover the kernel and determine on which host it’s running. This interaction should occur immediately following receipt of the kernel’s connection information in its response from the kernel launcher. This extra step, performed within confirm_remote_startup(), is necessary to get the appropriate host name as reflected in the resource manager’s API.

  4. Determine how to monitor the “job” using the resource manager API. This will become part of the poll() implementation to determine if the kernel is still running. This should be as quick as possible since it occurs every 3 seconds. If this is an expensive call, you may need to make some adjustments like skip the call every so often.

  5. Determine how to terminate “jobs” using the resource manager API. This will become part of the kernel’s termination sequence, but probably only necessary if the message-based shutdown does not work (i.e., a last resort).

Tip

Because kernel IDs are globally unique, they serve as ideal identifiers for discovering where in the cluster the kernel is running and are recommended “keys”.

You will likely need to provide implementations for launch_process(), poll(), wait(), send_signal(), and kill(), although, depending on where your provisioner resides in the class hierarchy, some implementations may be reused.

For example, if your provisioner is going to service remote kernels, you should consider deriving your implementation from the RemoteProvisionerBase class. If this is the case, then you’ll need to implement confirm_remote_startup().

Likewise, if your process proxy is based on containers, you should consider deriving your implementation from the ContainerProvisionerBase class. If this is the case, then you’ll need to implement get_container_status() and terminate_container_resources() rather than confirm_remote_startup(), etc.

Once the gateway provisioner has been implemented, construct an appropriate kernel specification that references your gateway provisioner and iterate until you are satisfied with how your remote kernels behave.

If you intend to contribute your gateway provisioner into this package, you can extend the CLI tooling to create applicable kernel specifications and launch scripts.