Running openai/gpt-oss-20b local model with NVIDIA L4 GPU this model can actually run on a consumer RTX Series GPU with ~16GB of VRAM. I divided it into two parts: running manually and using a container using the Ubuntu 24.04 LTS operating system.
Preparation
Installing drivers and dependencies
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb && rm -rf cuda-keyring_1.1-1_all.deb
sudo apt update && sudo apt install -y \
linux-headers-$(uname -r) \
libnvidia-compute-580 nvidia-dkms-580-open \
datacenter-gpu-manager-4-cuda-all \
datacenter-gpu-manager-exporter \
cuda-toolkit nvtop build-essential
We need a host reboot to apply the GPU driver.
Driver Validation
nvidia-smi
nvidia-smi -L

Running vLLM Manually
Installing git LFS and clone openai/gpt-oss-20b repository
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash
apt install git-lfs && git lfs install
git clone https://huggingface.co/openai/gpt-oss-20b /models/gpt-oss-20b
Installing UV for python virtual environment
curl -LsSf https://astral.sh/uv/install.sh | sh
cd /models && uv venv
source .venv/bin/activate
uv pip install vllm==0.10.2 --torch-backend=auto
Start serving the model with vllm using the following command
export VLLM_LOGGING_LEVEL=ERROR
python3 -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 --port 8000 \
--api-key $API_TOKEN \
--served-model-name openai/gpt-oss-20b \
--model /models/gpt-oss-20b \
--gpu-memory-utilization 0.90 \
--chat-template-content-format openai \
--tool-call-parser openai \
--enable-auto-tool-choice \
--trust-remote-code \
--async-scheduling \
--disable-log-requests
You can monitor GPU utilization in real time using the
nvtopcommand.
When the model is successfully loaded, a message stating “application startup complete” will appear, and you will be able to use the model.

Running vLLM with Container
Installing Nvidia Container Toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo \
gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.18.0-1
sudo apt update && apt-get install -y \
nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION}
Installing Docker and setup Nvidia Runtime
curl -fsSL https://get.docker.com | bash
nvidia-ctk runtime configure --runtime=docker
systemctl restart docker.service
docker run --rm --gpus all nvidia/cuda:12.3.2-base-ubuntu22.04 nvidia-smi
If the GPU has been detected at the container level, proceed to the next step.
Create docker-compose.yaml file configuration
services:
vllm:
image: vllm/vllm-openai:v0.12.0
container_name: vllm-gpt-oss-20b
restart: unless-stopped
ports:
- "8000:8000"
environment:
API_TOKEN: ${API_TOKEN}
VLLM_LOGGING_LEVEL: ERROR
volumes:
- /models:/models
command: >
--host 0.0.0.0
--port 8000
--api-key ${API_TOKEN}
--served-model-name openai/gpt-oss-20b
--model /models/gpt-oss-20b
--gpu-memory-utilization 0.90
--chat-template-content-format openai
--tool-call-parser openai
--enable-auto-tool-choice
--trust-remote-code
--async-scheduling
--disable-log-requests
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
ipc: host
ipc=hostmakes your container share memory space with the host machine, which is critical for running large models efficiently.
run it with the command docker compose up -d then monitor the log with the command docker compose logs -f.
Result:

Using The Model
Check the model with the GET API endpoint /v1/models as in the following example
curl -X GET http://127.0.0.1:8000/v1/models \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $API_TOKEN" | jq -r
Result:

Then perform a prompt test with POST to the API endpoint /v1/chat/completions as shown in the following example
curl -X POST http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $API_TOKEN" \
-d '{
"model": "openai/gpt-oss-20b",
"messages": [
{
"role": "user",
"content": "what is AI ?"
}
],
"temperature": 0.3,
"max_tokens": 256,
"stream": false,
"reasoning_effort": "high"
}' | jq -r
Result:

Benchmark
Thanks to Yoosu-L for the easy-to-use llm benchmark tool.
curl -LO https://github.com/Yoosu-L/llmapibenchmark/releases/download/v1.0.7/llmapibenchmark_linux_amd64.tar.gz
tar xzvf llmapibenchmark_linux_amd64.tar.gz
We’ll test with up to 1,024 concurrency with run the following command:
./llmapibenchmark_linux_amd64 \
--base-url http://127.0.0.1:8000/v1 \
--api-key $API_TOKEN \
--model openai/gpt-oss-20b \
--concurrency 1,2,4,8,16,32,64,128,256,512,1024 \
--max-tokens 512 \
--num-words 512 \
--prompt "what is AI ?" \
--format yaml | tee benchmark-results-gpt-oss-20b-nvidia-l4.yaml
Results:

After tested openai/gpt-oss-20b under different levels of concurrency to understand how it behaves as load increases. As concurrency goes up, the model clearly becomes faster at generating tokens overall, showing strong scaling behavior. Prompt throughput also improves with higher concurrency, especially in the mid range.
While the fastest responses (min TTFT) stay consistently low, the slowest responses (max TTFT) increase noticeably as concurrency grows. This means most requests are still served quickly, but a small portion start to experience longer waits when the system is heavily loaded. Importantly, all requests succeed, even at the highest concurrency levels.
Benchmark data can be found here
References
- https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/ubuntu.html
- https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
- https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html
- https://cookbook.openai.com/articles/gpt-oss/run-vllm
- https://docs.astral.sh/uv/getting-started/installation
- https://docs.redhat.com/en/documentation/red_hat_ai_inference_server
- https://github.com/Yoosu-L/llmapibenchmark