Seek .(SKLTY)
Search documents
DeepSeek开源大模型记忆模块,梁文锋署名新论文,下一代稀疏模型提前剧透
3 6 Ke· 2026-01-13 07:14
Core Insights - DeepSeek has introduced a new paradigm called "Conditional Memory" to enhance the Transformer model's knowledge retrieval capabilities, which were previously lacking [1][4][31] - The Engram module allows for significant improvements in model efficiency, enabling simpler tasks to be completed with fewer layers, thus freeing up resources for more complex reasoning tasks [4][21] Group 1: Conditional Memory and Engram Module - The paper presents Conditional Memory as an essential modeling primitive for the next generation of sparse models [1][4] - Engram enables the model to perform tasks that previously required six layers of attention in just one or two layers, optimizing resource allocation [4][21] - The Engram design incorporates a large vocabulary for static knowledge retrieval, allowing for O(1) speed in information retrieval [4][6] Group 2: Performance and Efficiency - The optimal allocation of parameters between MoE (Mixture of Experts) and Engram memory was found to be around 20% to 25%, leading to a reduction in model validation loss [17][21] - In experiments, the Engram-27B model outperformed the MoE-27B model in various knowledge-intensive tasks, with notable improvements in general reasoning and code mathematics [21][22] - The Engram-40B model further increased memory parameters, showing sustained performance improvements and indicating that memory capacity had not yet saturated [25][31] Group 3: Hardware Optimization - The Engram module allows for the offloading of large parameter tables to CPU memory, minimizing inference delays and maintaining high throughput [29][30] - The design principle of "hardware-aware efficiency" enables the decoupling of storage and computation, facilitating the use of massive parameter tables without significant performance costs [31]
梁文锋署名DeepSeek新论文发布,直指大模型“记忆”短板
Bei Ke Cai Jing· 2026-01-13 04:41
Core Insights - The paper published by DeepSeek addresses the memory limitations of current large language models and introduces the concept of "conditional memory" [2] - DeepSeek proposes a module named Engram, which breaks down language modeling tasks into two branches: "static pattern retrieval" for quick access to deterministic knowledge and "dynamic combinatorial reasoning" for complex logical operations [2] - The paper suggests that conditional memory is an essential modeling primitive for the next generation of sparse models, with speculation that DeepSeek's next model may be released before the Spring Festival [3] Group 1 - The paper titled "Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models" was co-authored by Peking University and DeepSeek [1] - The introduction of "conditional memory" aims to enhance the memory capabilities of large language models [2] - The Engram module is designed to improve efficiency in language modeling by separating tasks into static and dynamic components [2] Group 2 - The paper emphasizes the importance of conditional memory for future sparse model development [3] - There are speculations regarding the release of DeepSeek's next-generation model around the Spring Festival, potentially replicating the success of previous launches [3]
DeepSeek V4路线图隐现?梁文锋署名重磅论文发布,聚焦大模型条件记忆模块
Jin Rong Jie· 2026-01-13 04:38
Core Insights - DeepSeek has released a significant research paper focusing on the conditional memory module for large models, indicating it will be a core modeling primitive in the next generation of sparse large models [1][4] - The upcoming flagship model V4 is expected to be unveiled around the Spring Festival, with the recent research results potentially outlining its core research roadmap [1][4] Summary by Sections Research Findings - The paper titled "Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models" was co-authored by DeepSeek and Peking University, with DeepSeek's founder Liang Wenfeng among the authors [4] - The core insight of the paper is that large models handle two distinct types of tasks: deep dynamic computation for combinatorial reasoning and static knowledge retrieval [4] - Existing Transformer architectures lack a native knowledge retrieval mechanism, leading to inefficient computation when simulating retrieval processes [4] Proposed Solutions - To address these inefficiencies, DeepSeek proposes the use of conditional memory as a supplementary dimension of sparsity, implemented through a module called Engram [5] - The team discovered a "U-shaped scaling law," indicating that a mixed sparse capacity allocation between MoE experts and Engram memory significantly outperforms pure MoE baseline models [5] - The Engram module is designed to optimize the balance between neural computation (MoE) and static memory, allowing for improved efficiency and performance in various domains, including general reasoning, coding, and mathematics [5] Future Developments - DeepSeek plans to release the next-generation flagship model V4 in February, with preliminary internal tests showing its programming capabilities surpass existing top models [6] - The V4 model is anticipated to be a focal point in the industry, especially following the success of the V3 model released at the end of 2024, which outperformed OpenAI's GPT-5 and Google's Gemini 3.0 Pro in several benchmark tests [6]
梁文锋署名,DeepSeek论文上新
Di Yi Cai Jing Zi Xun· 2026-01-13 03:41
Core Insights - DeepSeek has released a new paper focusing on the conditional memory module of large models, suggesting it will be a core modeling primitive in the next generation of sparse large models [2][5][7] Group 1: Research and Development - The new paper, co-authored with Peking University, is titled "Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models" [5] - The research identifies two distinct tasks within large models: deep dynamic computation for combinatorial reasoning and static knowledge retrieval, highlighting inefficiencies in the current Transformer architecture [5][6] - DeepSeek introduces conditional memory as a supplementary sparse dimension to optimize the balance between neural computation (MoE) and static memory (Engram) [6][7] Group 2: Performance and Implications - The team discovered a U-shaped scaling law indicating that the mixed sparse capacity allocation between MoE experts and Engram memory significantly outperforms pure MoE baseline models [6] - The introduction of the memory module not only aids knowledge retrieval but also shows significant improvements in general reasoning, coding, and mathematical tasks [6][7] - The paper essentially proposes a "division of labor" optimization for large models, allowing specialized modules to handle specific tasks more efficiently [6][7] Group 3: Future Developments - Industry speculation suggests that the proposed conditional memory may be part of the technical architecture for DeepSeek's upcoming flagship model, DeepSeek V4, expected to be released around February [7] - Initial tests indicate that V4 may surpass other leading models in programming capabilities, with the previous V3 model having already outperformed OpenAI's GPT-5 and Google's Gemini 3.0 Pro in various benchmarks [7]
DeepSeek论文上新!下一代大模型实现“记忆分离”,V4不远了?
Di Yi Cai Jing Zi Xun· 2026-01-13 03:32
Core Insights - DeepSeek has released a new paper focusing on the conditional memory module of large models, suggesting it will be a core modeling primitive in the next generation of sparse large models [1][4]. Group 1: Research Findings - The new paper, co-authored with Peking University, is titled "Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models" and highlights the need for a native knowledge retrieval mechanism in existing Transformer architectures [4]. - The research identifies two distinct tasks in large models: deep dynamic computation for combinatorial reasoning and static knowledge retrieval, indicating that current models inefficiently simulate retrieval processes [4][5]. - DeepSeek introduces conditional memory as a supplementary dimension of sparsity, optimizing the trade-off between mixture of experts (MoE) and static memory (Engram) [4][6]. Group 2: Performance Improvements - The team discovered a U-shaped scaling law, showing that the mixed sparse capacity allocation between MoE experts and Engram memory significantly outperforms pure MoE baseline models [5]. - The introduction of the memory module not only aids knowledge retrieval but also yields notable improvements in general reasoning, coding, and mathematical tasks [5][6]. - The paper essentially proposes a "division of labor" optimization for large models, allowing specialized modules to handle specific tasks, thereby enhancing efficiency and resource allocation [6]. Group 3: Future Developments - Industry speculation suggests that the proposed conditional memory may be integral to the architecture of DeepSeek's upcoming flagship model, DeepSeek V4, expected to be released around February [6]. - Initial tests indicate that V4 may surpass other leading models in programming capabilities, with the previous model, V3, having already outperformed OpenAI's GPT-5 and Google's Gemini 3.0 Pro in various benchmarks [6].
DeepSeek-V4 即将发布,算力效率与性能双升级!低费率云计算ETF华夏、创业板人工智能ETF华夏获资金抢筹
Xin Lang Cai Jing· 2026-01-13 03:32
Group 1 - The three major indices turned negative again, with the technology sector adjusting alongside the market. The communication ETF Huaxia (515050) saw its decline expand to 2.39%, with mixed performance among its holdings [1] - The low-fee创业板人工智能 ETF Huaxia (159381) dropped by 1.64%, with trading volume quickly surpassing 300 million yuan, indicating active capital trading [1] - The low-fee cloud computing ETF Huaxia (516630) fell by 0.64%, but has seen a continuous net inflow of over 130 million yuan in the past three trading days, suggesting accelerated capital allocation [1] Group 2 - Ping An Securities noted that global AI computing platform capabilities are continuously improving, with major chip manufacturers like NVIDIA and AMD showcasing advancements in AI computing chips at CES 2026 [2] - NVIDIA announced the full production of the NVIDIARubin platform, with products based on Rubin expected to be available through partners in the second half of 2026 [2] - AMD unveiled the "Helios" platform and provided a preview of the next-generation MI500 series GPU, indicating a significant enhancement in global AI computing infrastructure [2] Group 3 - The cloud computing ETF Huaxia (516630) tracks the cloud computing index (930851) and has the lowest fee rate, focusing on domestic AI software and hardware computing, with a combined weight of 83.7% in computer software, cloud services, and computer equipment [3] - The创业板人工智能 ETF Huaxia (159381) supports investment in AI-focused companies on the创业板, with half of its weight in AI hardware computing and the other half in AI software applications, showcasing high elasticity and representativeness [3] - The communication ETF Huaxia (515050) tracks the CSI 5G communication theme index, deeply focusing on the supply chains of NVIDIA, Apple, and Huawei, with its top five holdings including Zhongji Xuchuang, Xinyi Sheng, Lixun Precision, Industrial Fulian, and Zhaoyi Innovation [3]
DeepSeek发布梁文锋署名新论文
Zheng Quan Shi Bao· 2026-01-13 03:02
Core Insights - DeepSeek released a new paper titled "Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models" on the evening of the 12th [1] - The paper was co-authored by Peking University and DeepSeek, with Liang Wenfeng listed as a co-author [1] - The concept of conditional memory is introduced, which significantly enhances model performance in knowledge retrieval, reasoning, coding, and mathematical tasks under equal parameters and computational conditions [1] - DeepSeek has also open-sourced a related memory module named Engram [1] Company and Industry Summary - The collaboration between DeepSeek and Peking University highlights the growing trend of partnerships between academia and industry in advancing AI technologies [1] - The introduction of scalable lookup structures in large language models represents a significant innovation in the field, potentially leading to improved efficiency and effectiveness in AI applications [1] - The open-sourcing of the Engram memory module may encourage further research and development in conditional memory systems, fostering a more collaborative environment in AI advancements [1]
DeepSeek等8大产品都是意外?! 改变世界的项目们,最初都没被“当个事儿办”
Sou Hu Cai Jing· 2026-01-13 01:47
Core Insights - Many groundbreaking products initially started as side projects, which were not considered significant at their inception [1][2][3][5][6] - Side projects are defined as non-core, non-KPI driven initiatives that are not part of a company's strategic plan [1] - The success of side projects can be attributed to their ability to operate without the constraints typically associated with mainline projects, allowing for greater innovation and flexibility [2][3][6] Group 1: Examples of Successful Side Projects - DeepSeek, a side project of Huansquare Quantitative, emerged from internal technical research and has become a significant tool in quantitative trading [2] - Qwen, developed by Alibaba, was initially a side project that allowed for more autonomy and faster iteration, ultimately leading to its integration into the company's main offerings [3] - Claude Code, initially a simple experimental project by an engineer, evolved into a key product for Anthropic, demonstrating the potential of side projects to gain traction unexpectedly [5] Group 2: Impact of AI on Project Development - The integration of AI into software engineering has lowered the cost of experimentation, enabling individuals to validate ideas more quickly and easily [7][8] - Side projects often begin by addressing specific problems and evolve through real-world usage, which enhances their maturity and relevance [8] - The shift towards AI-driven development suggests that early signals of future trends may increasingly emerge from projects that were initially overlooked [10] Group 3: Strategic Considerations - While AI enhances execution efficiency, it does not necessarily improve the accuracy of strategic judgments, highlighting a potential limitation of mainline projects [10] - The evolving landscape indicates that side projects may play a crucial role in validating directions before scaling up to mainline initiatives [10]
梁文锋署名新论文,DeepSeek V4架构首曝?直击Transformer致命缺陷
3 6 Ke· 2026-01-13 01:24
Core Insights - DeepSeek's new paper introduces a novel approach to address the memory limitations of Transformer models by proposing a complementary "conditional memory" sparse axis through the Engram module, which enables efficient knowledge retrieval with near O(1) complexity [1][6][11]. Group 1: Memory and Model Architecture - The paper highlights that while MoE (Mixture of Experts) has become a mainstream architecture for large models, it fundamentally still relies on Transformers, which lack a native knowledge retrieval mechanism, leading to inefficient computation [9][11]. - Engram is designed to offload static, repetitive patterns in language modeling to a scalable lookup module, allowing the Transformer backbone to focus on more complex tasks requiring combination and reasoning [11][15]. - The authors categorize language modeling tasks into two types: those requiring combination and reasoning, and those resembling pattern retrieval, emphasizing the need for a dedicated mechanism for the latter [12][13]. Group 2: Engram Architecture and Functionality - Engram is conceptualized as a modernized version of classic hash N-gram, functioning as a scalable lookup module integrated within the Transformer architecture [18][20]. - The architecture includes a two-stage process for handling input sequences, focusing on retrieval and fusion, which enhances the model's efficiency in processing static patterns [20][21]. - The introduction of a context-aware gating mechanism allows the model to dynamically adjust its responses based on the retrieved embeddings, improving the overall expressiveness and reducing noise from hash collisions [25][27]. Group 3: Performance and Scaling - The paper presents a U-shaped scaling law indicating that an optimal resource allocation between MoE and Engram can enhance model performance, suggesting that a balance between dynamic computation and static memory is crucial [3][33]. - Experimental results show that Engram, when scaled to 27 billion parameters, outperforms the MoE baseline under equivalent parameter and FLOPs conditions, demonstrating its effectiveness in various benchmarks [5][38]. - Engram's architecture not only improves knowledge retrieval but also enhances reasoning, mathematics, and coding capabilities, indicating a significant leap in performance metrics across multiple tasks [39][48]. Group 4: Future Implications - The findings suggest a paradigm shift in model architecture towards a dual-axis approach of computation and memory, with potential integration into future iterations of large language models, such as V4 [46][50]. - The paper posits that the integration of Engram could lead to substantial improvements in model efficiency and capability, paving the way for more advanced applications in natural language processing [51][52].
刚刚,梁文锋署名开源「记忆」模块,DeepSeek V4更细节了
3 6 Ke· 2026-01-13 00:42
Core Insights - DeepSeek has released a new paper titled "Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models," in collaboration with Peking University, introducing a new module called Engram to enhance the efficiency of large language models [1][3]. Group 1: Research Overview - The current approach to sparsity in large language models primarily relies on Mixture of Experts (MoE) for conditional computation, but existing Transformer architectures lack a native knowledge retrieval mechanism [3][8]. - DeepSeek proposes conditional memory as a complementary dimension to MoE, introducing the Engram module to facilitate efficient knowledge retrieval with O(1) time complexity [8][9]. Group 2: Engram Module Implementation - The Engram module has been implemented and made available on GitHub, allowing for community engagement and further development [4][5]. - Engram separates static memory storage from dynamic computation processes within the Transformer architecture, enhancing overall model performance [10][12]. Group 3: Performance Metrics - Engram has shown significant improvements in various benchmarks, including a +3.4% increase in MMLU accuracy and a +4.0% increase in CMMLU accuracy, as well as notable gains in general reasoning tasks [9][28]. - The architecture allows for better long-context retrieval capabilities, with accuracy in Multi-Query NIAH increasing from 84.2 to 97.0 [9]. Group 4: Experimental Results - DeepSeek trained four models: Dense-4B (4.1 billion parameters), MoE-27B (26.7 billion), Engram-27B (26.7 billion), and Engram-40B (39.5 billion), all under the same training conditions [25][27]. - The sparse architectures (MoE-27B, Engram-27B/40B) outperformed the dense model (Dense-4B) across all benchmarks, demonstrating superior scalability [28][30]. Group 5: Memory and Computation Decoupling - Engram's deterministic retrieval mechanism allows for the decoupling of parameter storage from computational resources, enabling efficient scaling without increasing computational costs [15][17]. - The architecture supports a multi-level cache hierarchy, optimizing memory access and reducing latency [18]. Group 6: U-Shaped Scaling Law - DeepSeek identified a U-shaped scaling law for optimal allocation between MoE and Engram, suggesting that a balanced distribution of sparse parameters leads to improved performance [19][24]. - The optimal allocation ratio was found to be around 20%-25% of the sparse parameter budget for Engram, confirming the structural complementarity between the two modules [23][24].