GPU Node Install

Add a GPU host to your NUSAPOD fleet by running the join command generated by your control plane. This guide covers everything to check before and after running that command.

Prerequisites

NVIDIA driver: the join script installs the NVIDIA container toolkit but expects the kernel driver to already be present. Verify before proceeding:

bash

nvidia-smi

If nvidia-smi fails, install the driver first (Ubuntu: sudo ubuntu-drivers install; for other distros follow NVIDIA's guide for your kernel version). A reboot is usually required after driver installation.

Network: the node dials out — no inbound ports need to be opened on the GPU host itself. Required outbound access:

Port 443 to the control plane, get.k3s.io, and the NVIDIA toolkit repo.
Port 6443 to the control plane (K3s API).

Access: run the join command as root (sudo).

Clock: keep the node NTP-synced — the enrollment token is single-use and time-limited (default 1 hour). If it expires, generate a fresh join command from the console (Fleet → Connect node).

Join the cluster

Copy the join command from the console (Fleet → Connect node) and run it on the GPU host:

bash

curl -sfL https://<control-plane>/install.sh | bash -s -- --token <token> --server https://<control-plane>

The token is single-use and time-limited (~1 hour). If it expires before you run the command, regenerate a new one from the console — the old token cannot be reused.

The script:

Verifies the NVIDIA driver (nvidia-smi).
Redeems the enrollment token for the K3s join token.
Installs the NVIDIA container toolkit and sets nvidia as the default containerd runtime (see below).
Installs and joins the K3s agent, labelling the node gpu=on.

nvidia as the default containerd runtime

Warning

K3s registers an nvidia runtime when the toolkit is present, but leaves the default as runc. NUSAPOD pods carry no runtimeClassName, so they use the default runtime — under runc a pod runs but cannot see the GPU (VRAM limits won't be enforced, nvidia-smi inside the pod fails). The join script sets nvidia as the default automatically.

The join script sets nvidia as the default automatically and rolls back if it would break the runtime (so the node still joins safely). Verify after the script completes:

bash

sudo grep default_runtime_name /var/lib/rancher/k3s/agent/etc/containerd/config.toml

Expected output: default_runtime_name = "nvidia". If it still shows runc, set it manually — contact support or see the troubleshooting section in the control-plane setup guide.

Verify

Within about one minute of the script completing, the node should appear as online in the console under Fleet. The platform registers the GPU(s) automatically. You can also check from the control plane:

bash

kubectl get nodes

The new node should show Ready status. GPU inventory and telemetry (utilisation, memory, temperature, power) will populate in the console within ~1 minute of the GPU device-plugin starting on the node.

Prerequisites#

Join the cluster#

nvidia as the default containerd runtime#

Verify#

Prerequisites

Join the cluster

nvidia as the default containerd runtime

Verify