Run a vLLM Server on HF Jobs in One Command
Hugging Face Jobs now allows users to launch private, OpenAI-compatible LLM endpoints with a single command, offering a quick and cost-effective solution for testing and evaluation. This pay-per-second service eliminates the need for managing servers and simplifies the deployment of large language models. Users can easily query the deployed models from various environments, enabling rapid experimentation and development. Learn how to deploy and interact with vLLM servers on Hugging Face infrastructure. The service is ideal for quick tests and batch generation, providing a secure and scalable option for LLM deployment.
Hugging Face Jobs offers a streamlined solution for deploying private, OpenAI-compatible Large Language Model (LLM) endpoints. With a single command, users can launch a vLLM server on Hugging Face infrastructure, eliminating the complexities of server provisioning and Kubernetes. This pay-per-second service is ideal for rapid testing, evaluations, and batch generation, providing a cost-effective alternative to managed production services. Users only need a payment method, huggingface_hub version 1.20.0 or higher, and to be locally logged in via `hf auth login`.
The deployment process involves using `hf jobs run`, which acts as a 'docker run' for Hugging Face infrastructure. Users can specify the desired GPU flavor, expose vLLM’s port, and set a timeout for automatic shutdown. For instance, a command like `hf jobs run --flavor a10g-large --expose 8000 --timeout 2h vllm/vllm-openai:latest vllm serve Qwen/Qwen3-4B --host 0.0.0.0 --port 8000` will initiate the server. The command then provides a URL to access the deployed endpoint. Keep track of the job ID, as it is crucial for querying and managing the server.
Once the server is live and logs confirm 'Application startup complete,' users can query the vLLM server using the OpenAI API. Requests require an HF token as a bearer token for authentication, as the endpoint is gated and not public. This ensures secure access, scoped to the user or organization. Examples include using `curl` commands or integrating with the OpenAI Python client, by pointing the `base_url` to the exposed HF Jobs URL and using `get_token()` for `api_key`. For larger models, users can scale their deployment by selecting beefier `--flavor` options and configuring `--tensor-parallel-size` to shard the model across multiple GPUs, ensuring optimal performance and resource utilization. Additionally, parameters like `--max-model-len` and `--max-num-seqs` can be adjusted to prevent out-of-memory errors with demanding models. When done, `hf jobs cancel <job_id>` promptly stops the server to avoid unnecessary billing.
Related articles
Accelerating Transformers Fine-Tuning with NVIDIA NeMo AutoModel
NVIDIA NeMo has introduced AutoModel for efficient fine-tuning of transformer models, significantly reducing the time and resources required. This new capability streamlines the process for various natural language processing tasks.
Tools & PlatformsThe emergence of the web data infrastructure layer for AI
AI needs real-time web data to overcome limitations of static training. A new web data infrastructure layer is emerging to provide fresh, relevant information at scale, enabling AI models to navigate the dynamic digital landscape and improve performance. This infrastructure can help reduce AI hallucinations and ensure models deliver current, trustworthy outputs.
Postman Passport: Secure API access for the Agentic Era
Postman is introducing "Postman Passport" to secure API access for humans, machines, and AI agents, addressing the exploding risk of API key leakage in uncontrolled environments. It inverts the secret sharing model into an access control model and shifts secret resolution to a proxy layer within a VPC, preventing secrets from ever reaching consumers directly.
