Workflow
机器之心
icon
Search documents
ICLR 2026 | 北航开源Code2Bench:双扩展动态评测,代码大模型告别躺平刷分
机器之心· 2026-02-21 04:06
在衡量大语言模型(LLM)代码生成能力的竞赛中,一个日益严峻的问题正浮出水面:当模型在 HumanEval、MBPP 等经典基准上纷纷取得近乎饱和的成绩时,我 们究竟是在评估其真实的泛化推理能力,还是在检验其对训练语料库的「 记忆力」? 现有的代码基准正面临两大核心挑战: 数据污染 的风险,以及 测试严谨性不足 。前者使评测可能退化为「 开卷考试」,后者则常常导致一种「 正确的幻觉 」 (Illusion of Correctness)—— 模型生成的代码或许能通过少数示例,却在复杂的真实世界边缘场景中不堪一击。 为了打破这种「 高分幻觉」,来自北京航空航天大学的研究团队提出了一种全新的基准构建哲学 —— 双重扩展(Dual Scaling) ,并基于此构建了端到端的自动化 框架 Code2Bench 。该研究旨在为代码大模型的评估,建立一个更动态、更严苛、也更具诊断性的新范式。 目前,该论文已被 ICLR 2026 接收。 论文标题:Code2Bench: Scaling Source and Rigor for Dynamic Benchmark Construction 我们需要什么样的 Benchma ...
不卷视频卷「造人」?Pika推出AI Selves,让你亲手「养大」数字分身
机器之心· 2026-02-21 04:06
官方介绍,Pika AI Selves 是一个由你「孕育、培养并放手」的 AI 分身,成为你的一个活生生的延伸。 它们拥有丰富多面的个性、持久记忆,甚至连「花生过 敏」这种细节都可以设定 —— 一切由你决定! 它能替你在群聊里发照片、给你的宠物鱼做一款视频游戏、在你忙着做事情的时候替你打电话给妈妈……「可能性就像星空一样无限。」 当大多数的 AI 厂商都在忙着打造更多 AI 工具的时候,一家以制作 AI 视频著称的公司居然开始制作「第二个你」了。 近日,Pika 推出 AI Selves 产品, 宣称可以生成「AI 版的你」。 机器之心编辑部 网友 Aakash Gupta 称, 「Pika 的 AI Selves,或许是今年 AI 领域最具野心的一次品类跃迁,但几乎没人讨论:为什么率先做到这一点的,会是一家 AI 视频公 司?」 数据显示, 当前几乎所有大型科技公司都在竞速构建自主 AI 智能体,整个市场正以 46% 的年复合增长率扩张,预计到 2030 年将达到 520 亿美元规模。 但市面 上几乎所有智能体,都是基于文本展开:文本输入、文本输出、完成任务、自动化流程。 也有网友认为,「这不就是《黑镜》吗 ...
仅凭"动作剪影",打通视频生成与机器人世界模型!BridgeV2W让机器人学会"预演未来"
机器之心· 2026-02-21 02:57
机器人如何 "脑补" 未来? 想象一下,你面前摆着一杯咖啡,你伸手去拿,在你的手真正触碰到杯子之前,你的大脑已经在 "脑补" 了整个过程:手臂将如何移动、杯子会是什么触 感、抬起后桌面的样子…… 这种对未来场景的想象和预测能力,正是人类操控世界的核心认知基石。 那么,能否赋予机器人同样的 "预演能力",先在 "脑海" 中模拟动作后果,再付诸执行?这就是 具身世界模型 要做的事情:让机器人在行动前,就能 "看 见" 未来。近年来,借助大规模视频生成模型(如 Sora、Wan 等)强大的视觉先验,这一方向取得了令人瞩目的进展。 然而,一个尴尬的问题始终悬而未决: 视频生成模型的世界由像素编织而成,而机器人的语言却是关节角度与位姿坐标,它们使用完全不同的 "表征语言" 描述同一个物理世界。 为了解决上述问题, 具身智能公司中科第五纪联合中科院自动化所团队推出 BridgeV2W ,它通过一个极为优雅的设计, 具身掩码(Embodiment Mask) ,一种由机器人动作渲染出的 "动作剪影",将坐标空间的动作无缝映射到像素空间,从而真正打通预训练视频生成模型与世界模型之间的桥梁,让 机器人学会可靠地 "预演未来"。 ...
App Store模式过时了,未来属于即兴创作!Karpathy激进言论被「怼惨」
机器之心· 2026-02-21 02:57
编辑|杜伟 多年前,苹果用「There's an app for that」开启了移动互联网的黄金时代。从那之后,一个个应用图标统治了我们的数字生活。 而如今,随着 LLM、Agent 的快速发展,这一切正在发生变化。 就在昨天,AI 大神 Karpathy「现身说法」,并抛出了一个激进的观点: 未来的应用不应该是被「下载」的,而应该是被「即兴创作」的 。 他以自己正在进行的有氧运动为例,没有选择去应用商店搜索任何一款「心率管理工具」,而是直接命令 AI 逆向工程了跑步机的云端 API,为自己量身定制了一 个为期八周的、极其私密的实验仪表盘。 这释放出了一个明显的信号:软件的本质正在从现成的商品降维成瞬时的服务。 那么问题来了: 当应用可以随用随建,我们还需要那个臃肿的应用商店吗 ? | RHR 50 -> 45 Experiment | | | | | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | 8-week zone 2 + HIIT plan · Feb 18 - Apr 14, 2026 · Dashboard | ...
谷歌夺回王座:Gemini 3.1 Pro来了!姚顺宇:后面还有更好的
机器之心· 2026-02-19 23:43
Core Insights - Google has launched Gemini 3.1 Pro, an upgraded version of its AI model, to tackle complex challenges in science, research, and engineering [1][4][15] - The new model demonstrates significant improvements in reasoning capabilities, achieving a verified score of 77.1% on the ARC-AGI-2 benchmark, which is more than double the performance of its predecessor, Gemini 3 Pro [5][6] Performance Metrics - Gemini 3.1 Pro outperforms other models in various benchmarks, including: - 44.4% in Humanity's Last Exam for academic reasoning [6] - 94.3% in GPQA Diamond for scientific knowledge [8] - 68.5% in Terminal-Bench 2.0 for coding tasks [6] - 80.6% in SWE-Bench Verified for agentic coding [8] - The model's performance in multi-modal understanding reached 92.6% in the MMMLU test [8] Applications and Features - Gemini 3.1 Pro can visualize complex topics, organize scattered data, and turn creative projects into reality [12][20] - Notable applications include: 1. Generating animated SVG images based on text prompts [21] 2. Integrating complex systems, such as a real-time aviation dashboard [22] 3. Creating interactive designs, like a 3D simulation of a flock of birds [23] 4. Transforming literary themes into practical code for modern web design [24] Deployment and Pricing - The model is being integrated into various consumer and developer products, with a phased rollout starting now [15][26] - Pricing structure includes: - Developer access through Google AI Studio and Gemini API, with costs based on token usage [17] - Enterprise access via Vertex AI and Gemini Enterprise [17] - Consumer access through the Gemini app and NotebookLM [17] Future Plans - Google plans to further enhance Gemini 3.1 Pro in autonomous workflows and will soon open it for broader public use [26]
从AlphaGo到DeepSeek R1,推理的未来将走向何方?
机器之心· 2026-02-19 23:43
Core Insights - The article discusses the transformative impact of AI, particularly in the context of reasoning models that have evolved from basic language models to systems capable of systematic thinking and causal reasoning [1][4]. Group 1: Evolution of AI Models - Since the introduction of ChatGPT in 2022, AI has shifted from mere statistical language imitation to understanding and manipulating logic [1]. - Eric Jang emphasizes that the real change lies in models beginning to think systematically, which could lead to a restructuring of productivity, organizational forms, and power structures in society [1][4]. Group 2: Capabilities of Modern AI - Modern programming agents, such as Claude Code, have become proficient in coding and reasoning, allowing users to automate coding tasks and generate hypotheses and conclusions [5][8]. - The ability of AI to run experiments and optimize parameters has evolved, enabling it to modify its own code and reflect on experimental results [8][9]. Group 3: Reasoning in AI - Reasoning can be categorized into deductive and inductive reasoning, with the former relying on strict logical rules and the latter focusing on probabilistic judgments [19][20]. - The limitations of traditional reasoning systems highlight the need for AI to handle the complexities and uncertainties of the real world, which neural networks can approximate through end-to-end probabilistic modeling [20][21]. Group 4: Future of AI Reasoning - The article suggests that the future of reasoning in AI will involve powerful base models that can utilize reinforcement learning and rule-based rewards to enhance reasoning capabilities [38][39]. - There is potential for further simplification and optimization of reasoning processes, which could lead to significant advancements in AI's ability to handle complex tasks [39][40]. Group 5: Implications for Research and Development - The automation of research processes is expected to become standard, significantly increasing productivity in various fields, including non-AI domains [43]. - The demand for reasoning computational power is anticipated to grow astronomically, similar to how air conditioning has transformed productivity in warmer regions [44].
ICLR 2026|新版「图灵测试」:当VLA走进生物实验室
机器之心· 2026-02-19 23:43
近期,来自香港大学MMLAB 罗平老师团队和上海交大穆尧老师团队的工作 ——Autobio 正式被 ICLR 2026 接收,并获得了 8-8-6-6 的同行评议分数。AutoBio 是 一个面向数字化生物实验室的机器人仿真系统与基准测试平台。我们通过这篇工作,尝试系统性回答一个关键问题: 当前主流的视觉 - 语言 - 动作(Vision-Language-Action, VLA)模型,是否已经具备在真实生物实验室中执行实验流程的能力? 论文标题:AutoBio: A Simulation and Benchmark for Robotic Automation in Digital Biology Laboratory 论文链接:https://openreview.net/forum?id=UUE6HEtjhu 论文代码:https://github.com/autobio-bench/AutoBio https://huggingface.co/autobio-bench 一.研究背景:为何生物实验室构成关键挑战 现有 VLA 模型的研究和基准测试多局限于 家庭场景 (如整理餐桌、折叠衣物),缺乏对专业 ...
训练奖励太稀疏?港中文联合美团给Agent加上「过程分」
机器之心· 2026-02-19 23:43
Core Insights - The article discusses the limitations of traditional reward systems in training agents, which often only consider the final outcome, neglecting the complexity of the reasoning process involved in multi-step tasks [2][3][6]. - The Reagent framework aims to address this issue by providing detailed feedback on the entire reasoning process, rather than just the final answer, thus enhancing the training of agents [5][10][12]. Group 1: Problem Identification - Agents require long-term, granular feedback, but most existing systems only provide coarse-grained rewards based on final outcomes [3]. - The traditional approach fails to differentiate between slightly successful attempts and completely misguided efforts, leading to a lack of valuable learning opportunities [2][6]. Group 2: Solution Development - The authors developed a reasoning reward model (Agent-RRM) that evaluates the entire trajectory of an agent's reasoning process, providing scores and critiques [10][11]. - This model outputs an internal analysis, a critique for the agent, and an overall score, allowing for a more nuanced understanding of the agent's performance [10][11]. Group 3: Implementation of Reagent Framework - The Reagent framework integrates textual critiques and scoring into the training process, allowing agents to learn from their reasoning [13][15]. - Three levels of implementation are proposed: 1. Adding critiques without modifying the model (Reagent-C) [15]. 2. Incorporating process scores as additional rewards (Reagent-R) [16]. 3. Training with both initial and revised responses (Reagent-U), which is emphasized as the most effective method [17][18]. Group 4: Experimental Results - The Reagent-U method showed significant improvements in performance across various tasks, with average scores reaching 43.7% on the GAIA benchmark, comparable to larger models [28][30]. - The integration of process scores led to agents being more willing to pursue correct reasoning paths, even if the final answer was incorrect [27][28]. Group 5: Conclusion - The Reagent framework successfully incorporates detailed feedback into agent training, demonstrating that even smaller models can achieve competitive results in complex tasks when provided with comprehensive reasoning evaluations [30][31].
春晚宇树四分半:全球人形机器人一哥的功夫梦
机器之心· 2026-02-19 12:07
编辑 | 泽南 这已经是宇树机器人第三次亮相春晚,我们却感到了前所未有的震撼。 今年的央视春节联欢晚会上,还是那家全球领先的宇树科技,把舞台当成了新技术的展示场。一群活力 十足的人形机器人,上演了一出武术表演《武 BOT》,全程都没怎么切镜头: 这些人形机器人的型号包括现象级的 G1,以及刚刚发布的 H2,它们在快速奔跑中完成了穿插变阵和 武术动作。 这种高动态、高协同的全自主集群控制技术是全球首次亮相 ,节目还没结束就引发了网络 上的热烈讨论。 人工智能的大潮下,马年春晚成为了机器人、AI 行业的角斗场,多家科技公司轮番上阵,全球十几亿 观众成为评委。其中最受关注的宇树,一上场直接终结了比赛。 2025 年春晚人形机器人的扭秧歌、抛手绢: 再到 2026 年的武打片: 短短一年时间,画风完全变了,突出一种「活人感」—— 以至于网友们开玩笑说,学了这么多年的机 械舞,结果发现真正的机器人根本没有卡顿。 而对于业内人士来说,看到这个恐怕会两眼一黑,感觉到一大堆技术细节铺天盖地涌过来。 这 24 台 G1 机器人完成了全自主的表演 :它们利用自带的 3D 激光雷达进行扫描定位,调用运控算 法精准完成武术动作序列, ...
让AI智能体「记住」失败经验:微软提出Re-TRAC框架,4B性能SOTA,30B超越358B
机器之心· 2026-02-19 12:07
想象一下,你让 AI 助手结合搜索工具探索一个复杂问题。它第一次探索时走错了方向,但第二次、第三次,它依然重复同样的错误探索路径。虽然你可能 可以从最终得到的多次探索结果中挑选出一个勉强满意的答案,但是这既低效,也需要人工干预。这就是当前大多数深度搜索智能体面临的困境——它们无 法「记住」之前的探索经验,每次都是从头开始,导致大量冗余搜索和资源浪费。 现有的深度搜索智能体大多基于 ReAct 框架构建,采用线性推理方式:「思考→调用工具→观察→再思考 」 。这种设计在简单任务上表现良好,但在需 要多轮探索的深度搜索任务中,往往陷入局部最优、重复探索和低效搜索的困境。 来自东南大学、微软亚洲研究院等机构的研究团队提出了一种全新的解决方案—— Re-TRAC (REcursive TRAjectory Compression) ,这个框架让 AI 智能体能够「记住」每次探索的经验,在多个探索轨迹之间传递经验,实现渐进式的智能搜索。 让探索变成「渐进式学习」过程 为什么 ReAct 会失败? ReAct 框架的核心问题在于其 线性设计 。每个探索轨迹都是独立的,模型无法回顾先前尝试的状态。在长上下文场景下,早期制定 ...