Running openai/gpt-oss-20b local model with NVIDIA L4 GPU this model can actually run on a consumer RTX Series GPU with ~16GB of VRAM. I divided it into two parts: running manually and using a container using the Ubuntu 24.04 LTS operating system.

Preparation

Installing drivers and dependencies

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb && rm -rf cuda-keyring_1.1-1_all.deb
sudo apt update && sudo apt install -y \
     linux-headers-$(uname -r) \
     libnvidia-compute-580 nvidia-dkms-580-open \
     datacenter-gpu-manager-4-cuda-all \
     datacenter-gpu-manager-exporter \
     cuda-toolkit nvtop build-essential

We need a host reboot to apply the GPU driver.

Driver Validation

nvidia-smi
nvidia-smi -L

nvidia-smi

Running vLLM Manually

Installing git LFS and clone openai/gpt-oss-20b repository

curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash
apt install git-lfs && git lfs install
git clone https://huggingface.co/openai/gpt-oss-20b /models/gpt-oss-20b

Installing UV for python virtual environment

curl -LsSf https://astral.sh/uv/install.sh | sh
cd /models && uv venv
source .venv/bin/activate
uv pip install vllm==0.10.2 --torch-backend=auto

Start serving the model with vllm using the following command

export VLLM_LOGGING_LEVEL=ERROR
python3 -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 --port 8000 \
--api-key $API_TOKEN \
--served-model-name openai/gpt-oss-20b \
--model /models/gpt-oss-20b \
--gpu-memory-utilization 0.90 \
--chat-template-content-format openai \
--tool-call-parser openai \
--enable-auto-tool-choice \
--trust-remote-code \
--async-scheduling \
--disable-log-requests

You can monitor GPU utilization in real time using the nvtop command.

When the model is successfully loaded, a message stating “application startup complete” will appear, and you will be able to use the model. vllm-serve

Running vLLM with Container

Installing Nvidia Container Toolkit

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo \
gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.18.0-1
sudo apt update && apt-get install -y \
     nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
     nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
     libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
     libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION}

Installing Docker and setup Nvidia Runtime

curl -fsSL https://get.docker.com | bash

nvidia-ctk runtime configure --runtime=docker
systemctl restart docker.service
docker run --rm --gpus all nvidia/cuda:12.3.2-base-ubuntu22.04 nvidia-smi

If the GPU has been detected at the container level, proceed to the next step.

Create docker-compose.yaml file configuration

services:
  vllm:
    image: vllm/vllm-openai:v0.12.0
    container_name: vllm-gpt-oss-20b
    restart: unless-stopped
    ports:
      - "8000:8000"
    environment:
      API_TOKEN: ${API_TOKEN}
      VLLM_LOGGING_LEVEL: ERROR
    volumes:
      - /models:/models
    command: >
      --host 0.0.0.0
      --port 8000
      --api-key ${API_TOKEN}
      --served-model-name openai/gpt-oss-20b
      --model /models/gpt-oss-20b
      --gpu-memory-utilization 0.90
      --chat-template-content-format openai
      --tool-call-parser openai
      --enable-auto-tool-choice
      --trust-remote-code
      --async-scheduling
      --disable-log-requests
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    ipc: host

ipc=host makes your container share memory space with the host machine, which is critical for running large models efficiently.

run it with the command docker compose up -d then monitor the log with the command docker compose logs -f.

Result:
docker-compose-up

Using The Model

Check the model with the GET API endpoint /v1/models as in the following example

curl -X GET http://127.0.0.1:8000/v1/models \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $API_TOKEN" | jq -r

Result:
api-check-model.webp

Then perform a prompt test with POST to the API endpoint /v1/chat/completions as shown in the following example

curl -X POST http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $API_TOKEN" \
-d '{
    "model": "openai/gpt-oss-20b",
    "messages": [
      {
        "role": "user",
        "content": "what is AI ?"
      }
    ],
    "temperature": 0.3,
    "max_tokens": 256,
    "stream": false,
    "reasoning_effort": "high"
}' | jq -r

Result:
api-test-prompt.webp

Benchmark

Thanks to Yoosu-L for the easy-to-use llm benchmark tool.

curl -LO https://github.com/Yoosu-L/llmapibenchmark/releases/download/v1.0.7/llmapibenchmark_linux_amd64.tar.gz
tar xzvf llmapibenchmark_linux_amd64.tar.gz

We’ll test with up to 1,024 concurrency with run the following command:

./llmapibenchmark_linux_amd64 \
--base-url http://127.0.0.1:8000/v1 \
--api-key $API_TOKEN \
--model openai/gpt-oss-20b \
--concurrency 1,2,4,8,16,32,64,128,256,512,1024 \
--max-tokens 512 \
--num-words 512 \
--prompt "what is AI ?" \
--format yaml | tee benchmark-results-gpt-oss-20b-nvidia-l4.yaml

Results:
benchmark-results

After tested openai/gpt-oss-20b under different levels of concurrency to understand how it behaves as load increases. As concurrency goes up, the model clearly becomes faster at generating tokens overall, showing strong scaling behavior. Prompt throughput also improves with higher concurrency, especially in the mid range.

While the fastest responses (min TTFT) stay consistently low, the slowest responses (max TTFT) increase noticeably as concurrency grows. This means most requests are still served quickly, but a small portion start to experience longer waits when the system is heavily loaded. Importantly, all requests succeed, even at the highest concurrency levels.

Benchmark data can be found here

References