Graphics cards are once again available. The AI hype is on for sure, but it’s not as much of a craze as the the mining fad of yesteryear. Finally, graphics-based compute is semi-affordable again, especially if you look to the second hand market. Having dabbed my toes in the world of self hosting large language models I have come to realize that hosting those on CPUs only is not going to give me a good insight into what this tech is capable of. A lot of different models and APIs are either optimized or exclusively available on GPU platforms only. The one that tipped the scale for me is refact.ai. I really wanted to give the self hosted version a spin, and so I had gone on to make my wallet slimmer.

The Hardware

What GPU you should buy to get the most out of it depends on the models that you want to run on it. There do exist smaller coding-focused LLMs that can be run on a GTX 1060 6GB or an 8GB GTX 1070, which can be sniped on the second hand market for under a 100 euros. AMD has historically sold cards with higher VRAM sizes which would make their second hand units the best bargain, sadly it seems that their ROCm software platform for running AI compute tasks is still in its early days and not many, if any, consumer cards are supported at the moment.

Some of the available models and their estimated VRAM requirements:

Model	VRAM Requirement
Refact/1.6B	~4 GB
starcoder/15b/base	~9 GB
wizardcoder/15b	~9 GB
codellama/7b	~14.3 GB

I was actually about to pick up a second hand GTX 1070 with 8GB of VRAM for 90 euros when I found a good deal on a refurbished RTX 3060 12GB model and ordered immediately for 270 euros delivered. In terms of performance per euro the 1070 was a much better choice, however the 3060 has 3 things going for it:

lower power consumption (lol nope)
smaller physical footprint (I am trying to cram this into a mITX case)
extra 4GB of VRAM (more VRAM more better models)

Fun fact, now that I already have the card and started writing this article, I realized that actually the 3060 has a higher peak power and similar idle numbers to the 1070… Oh well… According to techPowerUp anyway: 1070: 13W idle, 148 peak, 3060: 13W idle, 179 peak (https://www.techpowerup.com/review/evga-geforce-rtx-3060-xc/36.html).

A lot of models that I want to try out exceed the 8GB mark once loaded into RAM, notably starcoder. I was also curious to figure out what it would take to set up CUDA time sharing on a single GPU, where multiple smaller models are loaded into VRAM simultaneously. It is a shame that AMD is not an option, as they have a number of 16GB VRAM cards at a similar price to this one (RX6700XT for example), while for Nvidia we would have to jump up to an RTX 4060 16GB, which retails at around 500 euros.

The Platform

You know what you’re getting yourself into coming here right? It’s all about Kubernetes xD One of the Proxmox nodes in my server rack has a spare X16 slot that will happily house our refurb 3060, and we can pass it through to one of the Kubernetes node VMs. To do this we need to follow a few steps:

Prepare Proxmox

Drain the node to be shut down

kubectl drain k8s12 --delete-emptydir-data --ignore-daemonsets

Edit the hardware definition of the node (add the new PCI device)
Shut down the node so that it may pick up the hardware change
(optional) Make a snapshot/backup of the VM, in case we screw something up later
Boot it up again

Voila, the GPU should be now available inside the VM. We can check that by running:

~ ❯ lspci                                                                             octopusx@k8s12
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:02.0 VGA compatible controller: Device 1234:1111 (rev 02)
00:03.0 Unclassified device [00ff]: Red Hat, Inc. Virtio memory balloon
00:05.0 SCSI storage controller: Red Hat, Inc. Virtio SCSI
00:10.0 VGA compatible controller: NVIDIA Corporation GA106 [GeForce RTX 3060 Lite Hash Rate] (rev a1)
00:11.0 Audio device: NVIDIA Corporation Device 228e (rev a1)
00:12.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:1e.0 PCI bridge: Red Hat, Inc. QEMU PCI-PCI bridge
00:1f.0 PCI bridge: Red Hat, Inc. QEMU PCI-PCI bridge

Uncordon the K8S node to re-enable scheduling on it

kubectl uncordon k8s12

Fun fact

As much as the above steps are all that I “technically” had to do should my proxmox server be correctly configured from the get go, I was getting weird behavior at the start. The graphics card was passed into the VM the first time it booted, but the driver (I installed next) would tell me there is no GPU available, even after rebooting the VM. I started messing around, thinking it is maybe some weird property of PCIe passthrough on Nvidia hardware, went on to blacklist the nvidia kernel module/drivers inside the Proxmox host to make sure it doesn’t get initialized by the host, which in turn prevented the VM I would pass it through from booting altogether. In the end I jumped into the bios of the machine to make sure all of the PCIe-related settings are in order and… of course I left the bifurcation settings on the x16 slot in the x8/x8 mode from when I was using multiple NVME drives in that slot. Changing that back and reverting all of the trial and error driver shenanigans on the host got it to work, boot and be available inside the VM every time.

Install Nvidia Drivers

Now off to install all of that dirty proprietary Nvidia stuff. We only just gave them our money, so we’ve got nothing to loose by going all the way and also giving them our souls.

Nvidia has a pretty solid guide with versions for various popular server linux flavors:

https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html

Which amounts to installing the CUDA toolkit and drivers.

Install the CUDA toolkit

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-3

Install Nvidia drivers

sudo apt-get install -y nvidia-kernel-open-545
sudo apt-get install -y cuda-drivers-545

Install the Nvidia container runtime

The last step that we need to do to prepare our Ubuntu VM to run GPU accelerated tasks inside OCI containers is to install the Nvidia container runtime. This is a container runtime that is compatible with the OCI specification and is able to run GPU accelerated containers. It is also compatible with the Kubernetes CRI interface, which is what K3S uses to run containers. Once installed, K3S will automatically detect it once its service is restarted. You can find out the setup details (which are very simple) on the following documentation page: https://docs.k3s.io/advanced#configuring-containerd.

Since we are running K3S on Ubuntu we are using systemd-containerd and installing everything from an apt repo. Therefore we need to follow Nvidia’s instructions for enabling their container runtime for containerd, as found here: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-containerd. In my case that would be:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list \
  && \
    sudo apt update
    sudo apt install -y nvidia-container-toolkit

Once this is done you can run an Nvidia CUDA container to see that the runtime is functioning:

~ ❯ sudo ctr image pull docker.io/nvidia/cuda:12.2.2-base-ubuntu22.04
docker.io/nvidia/cuda:12.2.2-base-ubuntu22.04:                                    resolved       |++++++++++++++++++++++++++++++++++++++| 
index-sha256:0b165c469e8b0a620ce6b22373ead52502ef06d3088ba35a6edb78582b5274f6:    done           |++++++++++++++++++++++++++++++++++++++| 
manifest-sha256:1f0416abdf40fca3a4ce4e42093584664b4ac0dddd012571453c94e5c7a35937: done           |++++++++++++++++++++++++++++++++++++++| 
config-sha256:f5861cf44882f74454e1a5915647321b5b344af925503515bcd8bb0728d84551:   done           |++++++++++++++++++++++++++++++++++++++| 
layer-sha256:aece8493d3972efa43bfd4ee3cdba659c0f787f8f59c82fb3e48c87cbb22a12e:    done           |++++++++++++++++++++++++++++++++++++++| 
layer-sha256:4b46bd5b7766c7415446d5b866f709b45dbdb60b04fec7677343d8232ca2e427:    done           |++++++++++++++++++++++++++++++++++++++| 
layer-sha256:f038740a8e0591897bdd6280d24e1f89690cf80c28e77c14072f84eca205c7e6:    done           |++++++++++++++++++++++++++++++++++++++| 
layer-sha256:b2dca9da14b18f8a959e7deed652de532f88f979d8d1279511b987e294699f3c:    done           |++++++++++++++++++++++++++++++++++++++| 
layer-sha256:998220aec9f5708b331daeb126f9d66fe39bbcc333224ca9ba6efabc9b563b7d:    done           |++++++++++++++++++++++++++++++++++++++| 
elapsed: 7.0 s                                                                    total:  83.6 M (11.9 MiB/s)                                      
unpacking linux/amd64 sha256:0b165c469e8b0a620ce6b22373ead52502ef06d3088ba35a6edb78582b5274f6...
done: 1.00494491s

~ ❯ sudo ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:12.2.2-base-ubuntu22.04 12.2.2-base-ubuntu22.04 nvidia-smi
Thu Oct 26 07:30:54 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.06              Driver Version: 545.23.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3060        On  | 00000000:00:10.0 Off |                  N/A |
|  0%   48C    P2              46W / 170W |  11651MiB / 12288MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

And also, once you’ve restarted your K3S agent service, check that it also picked up the change:

~ ❯ sudo grep nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.toml              octopusx@k8s12
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/bin/nvidia-container-runtime"

The Deployment

In order for any container to take advantage of the nvidia containerd runtime plugin we need to create a special runtime class object, which can be referenced inside our kubernetes deployment yaml.

It looks like this:

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: nvidia

Since (at least in my case) I am running a single GPU that is attached to a specific K8S worker VM, we need to somehow force the workloads that require access to said GPU to be deployed on that specific node. To do this we create a node label that we can then reference in the pod affinity specification:

kubectl label nodes k8s12 gpu=3060

We can check that it worked by invoking the describe command on the K8S node object:

~/ ❯ kubectl describe node k8s12                                                                                                                       ○ teleport-k3s
Name:               k8s12
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=k3s
                    beta.kubernetes.io/os=linux
                    gpu=3060
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=k8s12
                    kubernetes.io/os=linux
                    kubernetes.io/role=worker
                    node.kubernetes.io/instance-type=k3s

You will see that I created a label with a non-binary value. This is to make this system a little more future-proof on my side. This way I can match either on the label just “existing”, so if I want any GPU present to bind a deployment onto, or to a specific value if I want to assign a workload to a predefined model. Neat.

To make sure that our assignment mechanism works, we can spawn a test GPU benchmark container. If we did everything right this should return us a benchmark score for our GPU:

apiVersion: v1
kind: Pod
metadata:
  name: nbody-gpu-benchmark
  namespace: default
spec:
  restartPolicy: OnFailure
  runtimeClassName: nvidia
  containers:
  - name: cuda-container
    image: nvcr.io/nvidia/k8s/cuda-sample:nbody
    args: ["nbody", "-gpu", "-benchmark"]
    env:
    - name: NVIDIA_VISIBLE_DEVICES
      value: all
    - name: NVIDIA_DRIVER_CAPABILITIES
      value: all
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: gpu
            operator: Exists

RefactAI Chart

Now that we have checked that everything necessary to deploy refact is in place, we deploy it. Here is an example output of helm template . command for the helm chart that I created:

# Source: refact/templates/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: refact
  labels:
    helm.sh/chart: refact-0.1.0
    app.kubernetes.io/name: refact
    app.kubernetes.io/instance: release-name
    app.kubernetes.io/version: "latest"
    app.kubernetes.io/managed-by: Helm
spec:
  type: ClusterIP
  ports:
  - port: 8008
    targetPort: http
    protocol: TCP
    name: http
  selector:
    app.kubernetes.io/name: refact
    app.kubernetes.io/instance: release-name
---
# Source: refact/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: refact
  labels:
    helm.sh/chart: refact-0.1.0
    app.kubernetes.io/name: refact
    app.kubernetes.io/instance: release-name
    app.kubernetes.io/version: "latest"
    app.kubernetes.io/managed-by: Helm
spec:
  serviceName: refact
  revisionHistoryLimit: 3
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: refact
      app.kubernetes.io/instance: release-name
  template:
    metadata:
      labels:
        app.kubernetes.io/name: refact
        app.kubernetes.io/instance: release-name
    spec:
      hostNetwork: true
      runtimeClassName: nvidia # IMPORTANT!
      dnsPolicy: ClusterFirstWithHostNet
      enableServiceLinks: true
      containers:
        - name: refact
          image: "smallcloud/refact_self_hosting:latest"
          imagePullPolicy: IfNotPresent
          securityContext:
            privileged: false
          env:
            - name: "TZ"
              value: "UTC+02:00"
          ports:
            - name: http
              containerPort: 8008
              protocol: TCP
          volumeMounts:
            - name: data
              mountPath: /perm_storage
      affinity: # This is how we tell it to only spawn on the node with our GPU
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: gpu
                operator: Exists
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: refact
---
# Source: refact/templates/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: refact
  labels:
    helm.sh/chart: refact-0.1.0
    app.kubernetes.io/name: refact
    app.kubernetes.io/instance: release-name
    app.kubernetes.io/version: "latest"
    app.kubernetes.io/managed-by: Helm
  annotations:
    kubernetes.io/ingress.class: # insert your own ingress class to use
    # insert any other annotations you need for your ingress controller
spec:
  tls:
    - hosts:
        - refact.example.com
      secretName: "wildcard-cert"
  rules:
    - host: refact.example.com
      http:
        paths:
          - path: "/"
            pathType: Prefix
            backend:
              service:
                name: refact
                port:
                  number: 8008

For me an extra step here would be to add refact.example.com to my local DNS to expose this ingress entry (I just add it to my PiHole server). If all goes well, you should be able to go to refact.example.com (taking that you have DNS set up for this) you should see a screen much like this:

refact web UI

Here I have loaded 2 small models, one that provides code completion and one that provides chat functionality. Also, I have no idea why I am getting the VRAM warning message. It has been there for me since the beginning no matter what model I select and since everything is working for me I am going to assume it’s a UI bug…

VSCode Plugin

Let’s finally go harness the power of the AI in a practical manner. First, we install the refact.ai plugin:

refact vscode plugin

Then we go to its config page to enter our local instance URL:

plugin config

And off you go! Refact should now automatically start generating suggestions in your VSCode editor window as you type:

autocompletion

In my case, it will also allow you to chat with the llama7b model:

llama7b chat

The End

All in all, I am quite happy with this setup. It was very straight forward and pretty much just worked. I may even sign up for their paid version just to support their project, well done refact.ai! I haven’t used this in practice much so I don’t feel comfortable giving a verdict on how good the small models that I am able to fit into my 3060’s memory are. However what I can say is that the models execute extremely fast on the RTX 3060 12GB gpu… The code generation is very fast and the llama7b models chat is pretty much instant. Much more responsive than my previous attempt at CPU hosted model and downright impressive, with such a relatively small investment in hardware. Something I really wish I was able to do is expose my llama7b model hosted by refact in other apps via an openAI API endpoint. Another would be to be able to host a single larger model that does both code completion as well as chatting. That would make me 120% happy… For the moment, I have browsed the refact’s github issue tracker as well as their discord server (https://www.smallcloud.ai/discord) which seems to be active and will see if I can find out more about the project’s roadmap and if the few functionalities that are missing will be added over time. Hope you enjoyed this tutorial and will come back again when the next entry rolls in.

Kubes and GPUs