model optimization Archives

Winsage

December 6, 2024

Phi Silica, small but mighty on-device SLM

The Applied Sciences team has developed the small language model (SLM) Phi Silica, which enhances power efficiency, inference speed, and memory efficiency for Windows 11 Copilot+ PCs using Snapdragon X Series NPUs. Phi Silica is designed for on-device use and supports multiple languages, featuring a 4k context length. Microsoft announced that developers will have access to the Phi Silica API starting January 2025. The Copilot+ PCs can perform over 40 trillion operations per second, achieving significant performance improvements when connected to the cloud. Phi Silica utilizes a Cyber-EO compliant derivative of Phi-3.5-mini, and its architecture includes components such as a tokenizer, detokenizer, embedding model, transformer block, and language model head. The model's context processing consumes only 4.8mWh of energy on the NPU, with a 56% improvement in power consumption compared to CPU operation. Phi Silica features 4-bit weight quantization for efficiency, rapid time to first token, and high accuracy across languages. The model was developed using QuaRot for low-precision inference, achieving 4-bit quantization with minimal accuracy loss. Techniques like weight sharing and memory-mapped embeddings were employed to optimize memory usage, resulting in a ~60% reduction in memory consumption. Innovations such as a sliding window for context processing and a dynamic KV cache were introduced to expand context length. The model has undergone safety alignment and is subject to Responsible AI assessments and content moderation measures.

Winsage

November 19, 2024

Microsoft and NVIDIA Supercharge AI Development on RTX AI PCs

Generative AI-powered laptops and PCs are advancing sectors like gaming, content creation, productivity, and software development, with over 600 Windows applications and games utilizing AI on more than 100 million GeForce RTX AI PCs globally. At the Microsoft Ignite event, NVIDIA and Microsoft introduced tools for Windows developers to build and optimize AI applications on RTX AI PCs, enhancing workflows for AI agents and digital humans. NVIDIA's interactive digital human, James, utilizes NVIDIA NIM microservices, NVIDIA ACE, and ElevenLabs technologies for immersive interactions. NVIDIA ACE enhances digital entities' engagement through visual perception, allowing context-aware responses. The multimodal small language models developed by NVIDIA process text and imagery, optimizing rapid response times. The upcoming NVIDIA Nemovision-4B-Instruct model operates on RTX GPUs while maintaining accuracy, enabling digital humans to interpret visual imagery for relevant responses. NVIDIA will also launch the Mistral NeMo Minitron 128k Instruct family, offering large-context small language models in various parameter versions for efficient digital human interactions. These models process extensive datasets without segmentation, improving efficiency on low-power devices. NVIDIA announced enhancements to the TensorRT Model Optimizer for Windows, addressing challenges in model deployment due to limited memory and computational resources. The updates streamline models for ONNX Runtime deployment, utilizing GPU execution providers. The TensorRT-ModelOpt includes advanced quantization algorithms, significantly reducing memory footprint and improving throughput performance on RTX GPUs, achieving up to a 2.6x reduction in memory footprint compared to FP16 models.