OpenAI’s New Open Models Accelerated Locally on NVIDIA GeForce RTX and RTX PRO GPUs

In a significant advancement for the AI landscape, NVIDIA has collaborated with OpenAI to enhance the capabilities of its new open-source gpt-oss models specifically for NVIDIA GPUs. This partnership enables rapid and intelligent inference, bridging the gap from cloud services to personal computing. The introduction of the gpt-oss-20b and gpt-oss-120b models opens the door for millions of users, empowering AI developers and enthusiasts to leverage these optimized models on NVIDIA RTX AI PCs and workstations. Utilizing popular frameworks such as Ollama, llama.cpp, and Microsoft AI Foundry Local, users can expect impressive performance, achieving up to 256 tokens per second on the NVIDIA GeForce RTX 5090 GPU.

“OpenAI showed the world what could be built on NVIDIA AI — and now they’re advancing innovation in open-source software,” remarked Jensen Huang, founder and CEO of NVIDIA. This statement underscores the potential of the gpt-oss models, which allow developers to build upon a state-of-the-art open-source foundation, reinforcing the United States’ leadership in AI technology, all supported by the world’s largest AI compute infrastructure.

Open for All

The gpt-oss-20b and gpt-oss-120b models are designed with flexibility in mind, featuring open-weight reasoning capabilities and adjustable reasoning effort levels through a popular mixture-of-experts architecture. These models excel in tasks such as instruction-following and tool utilization, having been trained on NVIDIA H100 GPUs. Developers eager to explore these models can find comprehensive instructions on the NVIDIA Technical Blog.

Notably, these models support context lengths of up to 131,072 tokens, making them among the longest available for local inference. This capability allows for sophisticated reasoning across various applications, including web search, coding assistance, document comprehension, and extensive research. The open models represent the first MXFP4 models supported on NVIDIA RTX, offering high quality and efficient performance while minimizing resource requirements compared to other precision types.

Run the OpenAI Models on NVIDIA RTX With Ollama

For those looking to experiment with these models on RTX AI PCs equipped with a minimum of 24GB of VRAM, the new Ollama app provides a user-friendly solution. Renowned among AI developers and enthusiasts for its seamless integration, Ollama’s updated user interface offers out-of-the-box support for OpenAI’s open-weight models. Fully optimized for RTX, Ollama allows users to experience the power of personal AI without the need for complex configurations.

Once installed, users can engage in quick conversations with the models by simply selecting one from a dropdown menu and sending a message. Ollama’s latest features also include support for PDF and text files within chats, multimodal capabilities for applicable models, and customizable context lengths for managing large documents or conversations.

Testing OpenAI’s open models in Ollama is easy.

Developers can further harness Ollama via a command line interface or the app’s software development kit (SDK) to enhance their applications and workflows.

Other Ways to Use the New OpenAI Models on RTX

Beyond Ollama, enthusiasts and developers can explore the gpt-oss models on RTX AI PCs through various applications and frameworks, all powered by RTX and requiring GPUs with at least 16GB of VRAM. NVIDIA remains committed to collaborating with the open-source community, optimizing performance on RTX GPUs through contributions to projects like llama.cpp and the GGML tensor library. Recent enhancements include the implementation of CUDA Graphs to minimize overhead and the introduction of algorithms that reduce CPU demands. Interested developers can visit the llama.cpp GitHub repository to get started.

Overall performance of the gpt-oss-20b model on various RTX AI PCs.

Windows developers can also access OpenAI’s new models through Microsoft AI Foundry Local, which is currently in public preview. This on-device AI inferencing solution integrates seamlessly into workflows via command line, SDK, or application programming interfaces. Utilizing ONNX Runtime, optimized through CUDA, Foundry Local is set to support NVIDIA TensorRT for RTX in the near future. Getting started is straightforward: simply install Foundry Local and execute the command “Foundry model run gpt-oss-20b” in a terminal.

The launch of these open-source models heralds a new era of AI innovation, inviting developers and enthusiasts alike to incorporate advanced reasoning capabilities into their AI-accelerated Windows applications.

Each week, the RTX AI Garage blog series showcases community-driven AI innovations and content for those interested in exploring NVIDIA NIM microservices and AI Blueprints, as well as building AI agents, creative workflows, productivity apps, and more on AI PCs and workstations.

Stay connected with NVIDIA AI PC on Facebook, Instagram, TikTok, and X, and keep informed by subscribing to the RTX AI PC newsletter. Join NVIDIA’s Discord server to engage with community developers and AI enthusiasts in discussions about the possibilities with RTX AI.

Follow NVIDIA Workstation on LinkedIn and X.

See notice regarding software product information.

Winsage
OpenAI’s New Open Models Accelerated Locally on NVIDIA GeForce RTX and RTX PRO GPUs