Available today: DeepSeek R1 7B & 14B distilled models for Copilot+ PCs via Azure AI Foundry – further expanding AI on the edge

At Microsoft, we are witnessing the dawn of a new era in artificial intelligence, one that seamlessly integrates capabilities from the cloud to the edge. Our ambition is clear: to transform Windows into the premier platform for AI innovation, where intelligence is not confined to the cloud but is intricately embedded within the system, silicon, and hardware at the edge.

Building on our recent announcement regarding the introduction of NPU-optimized versions of the DeepSeek-R1 1.5B distilled model for Copilot+ PCs, we are excited to announce the rollout of the DeepSeek R1 7B and 14B distilled models via Azure AI Foundry. This significant advancement underscores our dedication to providing state-of-the-art AI capabilities that are not only fast and efficient but also tailored for practical applications, empowering developers, businesses, and creators to explore new horizons.

<figure class="c-videowrapper”>

The availability of these models begins with Copilot+ PCs powered by Qualcomm Snapdragon X, followed by Intel Core Ultra 200V and AMD Ryzen processors. The ability to execute 7B and 14B parameter reasoning models on Neural Processing Units (NPUs) marks a pivotal step in making artificial intelligence more accessible and democratized. Researchers, developers, and enthusiasts can now harness the robust power and functionalities of large-scale machine learning models directly from their Copilot+ PCs, which feature NPUs capable of executing over 40 trillion operations per second (TOPS).

NPUs are purpose-built to run AI models locally on-device with exceptional efficiency

NPUs integrated into Copilot+ PCs are specifically designed to run AI models with remarkable efficiency, striking a balance between speed and power consumption. They facilitate sustained AI computing with minimal impact on battery life, thermal performance, and resource usage. This optimization allows CPUs and GPUs to focus on other tasks, enabling reasoning models to operate longer and deliver superior results, all while maintaining smooth PC performance.

The significance of efficient inferencing has grown, particularly in light of a new scaling law for language models, which suggests that extended chain-of-thought reasoning during inference can enhance response quality across various tasks. By leveraging additional computational power rather than merely increasing parameters or training data, we can achieve better outcomes. The DeepSeek distilled models exemplify how even smaller pretrained models can excel in reasoning capabilities, especially when paired with the NPUs on Copilot+ PCs, unlocking exciting opportunities for innovation.

Reasoning capabilities emerge in models of a certain minimum scale, necessitating a large number of tokens for effective multi-step reasoning. While NPU hardware helps reduce inference costs, it is equally crucial to maintain a manageable memory footprint for these models on consumer PCs, ideally with 16GB RAM.

Pushing the boundaries of what’s possible on Windows

Our research investments have propelled us to extend the limits of what can be achieved on Windows, both at the system and model levels, leading to innovations such as Phi Silica. Through our work on Phi Silica, we have developed a scalable platform for low-bit inference on NPUs, enabling powerful performance with minimal memory and bandwidth requirements. Coupled with the data privacy afforded by local computing, this advancement empowers application developers to explore advanced scenarios like Retrieval Augmented Generation (RAG) and model fine-tuning.

We have implemented techniques such as QuaRot and sliding window for rapid initial token responses, alongside numerous other optimizations to facilitate the release of DeepSeek 1.5B. Using Aqua, our internal automatic quantization tool, we quantized all DeepSeek model variants to int4 weights with QuaRot, preserving most of the accuracy. This toolchain enabled us to efficiently integrate all optimizations into an ONNX QdQ model with low precision weights.

Similar to the 1.5B model, the 7B and 14B variants utilize 4-bit block-wise quantization for embeddings and the language model head, executing memory-intensive operations on the CPU. The compute-heavy transformer block, responsible for context processing and token iteration, employs int4 per-channel quantization for weights alongside int16 activations. We are already observing approximately 8 tokens per second on the 14B model (with the 1.5B model achieving close to 40 tokens per second), and further optimizations are on the horizon as we adopt more advanced techniques. With these enhancements, these agile language models are capable of deeper and more extended reasoning.

This steadfast journey towards innovation has enabled us to optimize larger variants of DeepSeek models (7B and 14B) more rapidly and will continue to facilitate the introduction of new models that run efficiently on Windows.

Get started today

Developers can access all distilled variants (1.5B, 7B, and 14B) of DeepSeek models and run them on Copilot+ PCs by downloading the AI Toolkit VS Code extension. The DeepSeek model optimized in the ONNX QDQ format is available in the AI Toolkit’s model catalog, directly sourced from Azure AI Foundry. Users can download it locally with a simple click of the “Download” button. Once downloaded, experimenting with the model is straightforward: open the Playground, load the “deepseekr11_5” model, and start sending prompts.

Run models across Copilot+ PCs and Azure

Copilot+ PCs provide local compute capabilities that extend the functionalities enabled by Azure, offering developers enhanced flexibility to train and fine-tune small language models on-device while leveraging the cloud for larger, more intensive workloads. In addition to the ONNX model optimized for Copilot+ PCs, users can also explore the cloud-hosted source model in Azure Foundry by clicking the “Try in Playground” button under “DeepSeek R1.” The AI Toolkit integrates seamlessly into the developer workflow, allowing for model experimentation and preparation for deployment. Through this playground, developers can effortlessly test the DeepSeek models available in Azure AI Foundry for local deployment. This grants access to the most comprehensive set of DeepSeek models available, bridging the gap from cloud to client.

Copilot+ PCs combine efficient computing with the vast computational resources offered by Microsoft’s Azure services. With reasoning capabilities spanning both the cloud and edge, we are entering a new paradigm of continuous computing that creates substantial value for our customers. The future of AI compute is indeed bright, and we eagerly anticipate the innovative contributions from our developer community as they harness these rich capabilities. Your feedback is invaluable as we continue this journey together.

Winsage
Available today: DeepSeek R1 7B & 14B distilled models for Copilot+ PCs via Azure AI Foundry – further expanding AI on the edge