Workflow
上海智能算力科技有限公司
icon
Search documents
全新GPU高速互联设计,为大模型训练降本增效!北大/阶跃/曦智提出新一代高带宽域架构
量子位· 2025-05-19 04:37
Core Viewpoint - The article discusses the limitations of existing High-Bandwidth Domain (HBD) architectures for large model training and introduces InfiniteHBD, a new architecture that addresses these limitations through innovative design and technology [1][3][4]. Group 1: Limitations of Existing HBD Architectures - Current HBD architectures face fundamental limitations in scalability, cost, and fault tolerance, with switch-centric designs being expensive and hard to scale, GPU-centric designs suffering from fault propagation issues, and hybrid designs like TPUv4 still not ideal in cost and fault tolerance [3][10][19]. - The existing architectures can be categorized into three types: switch-centric, GPU-centric, and hybrid, each with its own set of limitations regarding scalability, interconnect cost, fault explosion radius, and fragmentation [7][22]. Group 2: Introduction of InfiniteHBD - InfiniteHBD is proposed as a solution, utilizing Optical Circuit Switching (OCS) technology embedded in optical-electrical conversion modules to achieve low-cost scalability and node-level fault isolation [4][29]. - The cost of InfiniteHBD is only 31% of that of NVL-72, with near-zero GPU wastage, significantly improving Model FLOPs Utilization (MFU) by up to 3.37 times compared to traditional architectures [4][48][63]. Group 3: Key Innovations of InfiniteHBD - InfiniteHBD incorporates three key innovations: OCS-based optical-electrical conversion modules (OCSTrx), a reconfigurable K-Hop Ring topology, and an HBD-DCN orchestration algorithm [30][32][44]. - The OCSTrx allows for dynamic point-to-multipoint connections and low resource fragmentation, enhancing scalability and cost-effectiveness [29][35]. Group 4: Performance Evaluation - The performance evaluation of InfiniteHBD shows it can effectively meet the dual demands of computational efficiency and communication performance for large-scale training of language models [65]. - The orchestration algorithm optimizes communication efficiency, significantly reducing cross-Top of Rack (ToR) traffic and demonstrating resilience against node failures [68][70]. Group 5: Cost and Energy Efficiency - InfiniteHBD exhibits significant advantages in interconnect cost and energy consumption, with interconnect costs being 31% of NVL-72 and energy consumption being 75% of NVL-72, while maintaining low energy levels comparable to TPUv4 [74].
上海智算资源统筹调度平台上线,江苏银行发布“算力贷”产品
Guo Ji Jin Rong Bao· 2025-03-30 03:09
目前,平台已与上海三大运营商以及多家云算力企业达成初步合作意向,并探索接入西部对口合作地区优质算力。 上海智能算力科技有限公司总经理孙跃介绍称,上海市智能算力资源统筹调度服务平台以社会闲散算力纳管交易为核心,通过"一云多芯、一云多池"灵 活调度的云服务平台,为垂类行业应用、开源社区、AI研究者、个人开发者等提供高性价比普惠算力服务。目前,平台已首批接入上海仪电、上海电信、 上海移动、上海联通、商汤科技、碳和科技等互联网企业算力、云计算服务商、电信运营商等。 现场还颁发了2024年度上海市算力网络高质量发展标杆应用案例、上海市智算中心综合能效卓越奖和绿色技术单项成就奖。其中,商汤临港智算中心获 得标杆应用案例一等奖。 近日,2025年"智算申城"高峰论坛在上海召开。会上,上海市智能算力资源统筹调度服务平台上线。 上海市经济和信息化委员会主任张英在致辞中指出,上海正顺应智能化变革机遇,将人工智能作为重点发展的三大先导产业之一,引领制造业数智化转 型,带动构建具有竞争力的智算产业生态。面向未来,上海将充分利用超大城市的综合优势,加快建设更具国际影响力的人工智能"上海高地"。 现场,上海市智能算力资源统筹调度服务平台 ...