Run a vLLM Server on HF Jobs in One Command

Hugging Face Jobs now allows users to launch private, OpenAI-compatible LLM endpoints with a single command, offering a quick and cost-effective solution for testing and evaluation. This pay-per-second service eliminates the need for managing servers and simplifies the deployment of large language models. Users can easily query the deployed models from various environments, enabling rapid experimentation and development. Learn how to deploy and interact with vLLM servers on Hugging Face infrastructure. The service is ideal for quick tests and batch generation, providing a secure and scalable option for LLM deployment.

Hugging Face Jobs offers a streamlined solution for deploying private, OpenAI-compatible Large Language Model (LLM) endpoints. With a single command, users can launch a vLLM server on Hugging Face infrastructure, eliminating the complexities of server provisioning and Kubernetes. This pay-per-second service is ideal for rapid testing, evaluations, and batch generation, providing a cost-effective alternative to managed production services. Users only need a payment method, huggingface_hub version 1.20.0 or higher, and to be locally logged in via `hf auth login`.

The deployment process involves using `hf jobs run`, which acts as a 'docker run' for Hugging Face infrastructure. Users can specify the desired GPU flavor, expose vLLM’s port, and set a timeout for automatic shutdown. For instance, a command like `hf jobs run --flavor a10g-large --expose 8000 --timeout 2h vllm/vllm-openai:latest vllm serve Qwen/Qwen3-4B --host 0.0.0.0 --port 8000` will initiate the server. The command then provides a URL to access the deployed endpoint. Keep track of the job ID, as it is crucial for querying and managing the server.

Once the server is live and logs confirm 'Application startup complete,' users can query the vLLM server using the OpenAI API. Requests require an HF token as a bearer token for authentication, as the endpoint is gated and not public. This ensures secure access, scoped to the user or organization. Examples include using `curl` commands or integrating with the OpenAI Python client, by pointing the `base_url` to the exposed HF Jobs URL and using `get_token()` for `api_key`. For larger models, users can scale their deployment by selecting beefier `--flavor` options and configuring `--tensor-parallel-size` to shard the model across multiple GPUs, ensuring optimal performance and resource utilization. Additionally, parameters like `--max-model-len` and `--max-num-seqs` can be adjusted to prevent out-of-memory errors with demanding models. When done, `hf jobs cancel <job_id>` promptly stops the server to avoid unnecessary billing.

Run a vLLM Server on HF Jobs in One Command

Related articles

Accelerating Transformers Fine-Tuning with NVIDIA NeMo AutoModel

The emergence of the web data infrastructure layer for AI

Postman Passport: Secure API access for the Agentic Era