NVIDIA DGX-Ready Managed Services

Penguin Solutions™ is an expert in managing large NVIDIA DGX clusters.

Penguin Solutions™ has 25 years of experience in managing HPC clusters and 6 years of experience in managing very large NVIDIA DGX clusters. Our years of experience have allowed us to develop unmatched capabilities with running large AI factories. For example, we are helping Meta manage the Meta Research Super Cluster, with over 2000 NVIDIA DGX systems, 16,000 NVIDIA A100 Tensor Core GPUs, 500 PB of storage and 40,000 NVIDIA InfiniBand networking links.

Unlike traditional IT systems, AI infrastructures use different processors, platforms, networks, and involve precision operations. These differences can create gaps in your team’s ability to hit the schedules, performance, and uptime that you need to win.

Penguin continues to provide exceptional uptime and availability for Meta’s large NVIDIA DGX cluster.

“Working in partnership with our implementation partner, Penguin Computing, we improved our overall cluster management. By the time we completed the second phase of building RSC, availability stayed above 95 percent on a consistent basis. This was no small feat given that we added a 10K GPU cluster while concurrently running multiple research projects.” (Meta website, May 18, 2023).

Penguin Solutions is a certified NVIDIA DGX-ready Managed Services partner.

Penguin has designed large NVIDIA DGX clusters, with high-speed NVIDIA InfiniBand networking and optimized storage. We have relationships and expertise with most storage vendors, allowing us to provide bespoke solutions for every customer. Our designs are field proven, scalable, future proof, and they de-risk our customers’ investments.

Design
Build
Deploy
Manage

Tailored

System, network and storage designs that de-risk deployments and enhance stability and productivity

Proven

24 years in HPC
Over six years building AI Factories
Deployment of over 50,000 GPUs

Innovative

Hybrid Cloud architectures to accelerate system availability and enable burst and disaster recovery
Sustainable computing with immersion cooling

Tailored

Provision AI and HPC server clusters at scale
Sophisticated factory capabilities to rack, cable, and validate full AI clusters in Penguin’s environment

Proven

Expert cluster integration
Validated software stack process eliminating compatibility issues

Innovative

In-factory burn-in testing and performance validation drives smooth deployments and rapid user access

Penguin is an optimal integration partner offering greater buying power, accurate and rapid installation, comprehensive supply chain management, and predictable project orchestration that addresses the most complex solutions. Our solutions complete full in-factory integration and burn-in testing to ensure that deployed systems are stable and robust at initial deployment.

Thanks to our NVIDIA DGX DevOps and Services teams, Penguin provides:

Monitoring
Detection and remediation of issues with all core components (nodes, network, storage, overall cluster, DNS servers, overall compute availability)
Automated ticketing and integration to customer dashboards
Automation scripting (Ansible, Python)
Operational playbooks
Regular improvements on bandwidth targets and compute targets, node deployment and provisioning

Expert Factory Cluster Integration

Validated Software Stack

Penguin Cluster Provisioning

In-Factory Performance Validation

Tailored

On-site integration and validation by Penguin’s expert services team

Proven

System-level testing and project management expertise accelerates system availability and performance
700 racks built, delivered, live within 8 months

Innovative

Penguin-supplied monitoring software continuously validates system health – and maintains cluster availability

White Glove Services

Measurable Success

Production Readiness Reporting

Complete DevOps-based Monitoring

Tailored

On-site spares inventory and services personnel for maximum system availability

Proven

Expert support team and Service Level Agreement (SLA) management

Innovative

Cloud-first deployment model to drive immediate value
Cloud consumption cost management reporting and guardrails

With Penguin Managed Services, customers enjoy enhanced system availability levels and stable system operation for long-running AI workloads. The result is improved return on investment in AI technology.