Analysis of the Impact of BabyVision Benchmark Release on the Competitive Landscape of Domestic Multimodal Large Models and xbench's Layout Strategy
Unlock More Features
Login to access AI-powered analysis, deep research reports and more advanced features

About us: Ginlix AI is the AI Investment Copilot powered by real data, bridging advanced AI with professional financial databases to provide verifiable, truth-based answers. Please use the chat box below to ask any financial question.
On January 12, 2026, Sequoia China xbench joined hands with the UniPat AI team, together with multiple large model companies and university researchers, officially released BabyVision[1], a multimodal understanding benchmark. The release of this benchmark stems from a key industry insight: despite the rapid progress of large models in language and text reasoning—capable of writing papers, solving difficult problems, and achieving repeated success in top academic competitions—when problems cannot be ‘put into words’, there are still significant doubts about whether models can truly ‘see’ the world[1].
BabyVision’s design concept focuses on quantifying those visual atomic capabilities that ‘humans intuitively grasp and form the foundation of intelligence’. The benchmark divides visual capabilities into four categories with a total of 22 subtasks[1][2]:
| Capability Category | Core Capability Description | Number of Subtasks |
|---|---|---|
| Fine Discrimination | Extraction and differentiation of fine-grained visual information | 9 subtasks |
| Visual Tracking | Recognition of motion trajectories and connectivity | 4 subtasks |
| Spatial Perception | Understanding of 3D structures and their relationships | 5 subtasks |
| Visual Pattern Recognition | Recognition of logical and geometric patterns | 4 subtasks |
The uniqueness of this evaluation system lies in that it minimizes language dependence, ensuring that the question requirements are simple but the answers must be derived from visual information itself, thus effectively testing the real visual capability of models[2].
The BabyVision evaluation results reveal significant deficiencies in the current visual foundation capabilities of multimodal large models, with overall performance even falling within the ‘three-year-old child’ level range[1][2]. The specific evaluation results are as follows:
- Gemini 3-Pro-Preview: 49.7% (top performer)
- GPT-5.2: 34.8%
- Doubao-1.8: 30.2%[2]
- Qwen3VL-235B-Thinking: 22.2% (top performer)
- Most models cluster in the 12%-19% range[2]
More notably, the BabyVision-Mini small test paper experiment: the research team assigned 20 visual-centric tasks to children of different ages (3/6/10/12 years old) and top multimodal models for simultaneous testing. The results show that most models scored significantly lower than the average three-year-old child. Gemini 3-Pro-Preview is the only model that stably exceeds the three-year-old baseline, but it is still about 20 percentage points behind six-year-old children[1].
The release of the BabyVision benchmark has brought far-reaching impacts on the competitive landscape of domestic multimodal large models. First, it clearly points out the core shortcoming of current multimodal large models—the ‘unspeakable’ problem[2]: visual details cannot be losslessly compressed into tokens, and once models take the language shortcut of paraphrasing before reasoning, key information will be lost during compression. This finding reveals that models have systematic defects in four categories of basic visual capabilities: non-verbal detail discrimination, visual tracking, spatial imagination, and graphic pattern induction[1][2].
Traditional multimodal evaluations tend to focus on the model’s ability to ‘speak and write’, while the release of BabyVision shifts the focus of competition to the ‘real visual understanding’ level. For domestic multimodal large model manufacturers, this means that pure language reasoning ability is no longer sufficient to build competitive barriers, and supplementing visual foundation capabilities will become the key to competition in the next stage.
The core design concept of the benchmark directly serves the needs of embodied AI moving towards the real world[2]. Since the real world does not operate on language prompts, visual foundation capabilities have become a required course for embodied AI. BabyVision breaks down ‘seeing the world’ into 22 measurable, diagnosable, and iterable atomic capabilities, providing key evaluation standards and a development roadmap for technological breakthroughs in embodied AI[1].
BabyVision proposes a new direction for generative answering: ‘let the model draw’[1][2]. BabyVision-Gen re-annotates 280 questions from the benchmark that are suitable for generative answering, requiring models to output problem-solving processes using images/videos. The experimental results show that the consistency between automatic evaluation and human evaluation reaches 96%, and it is closer to human operations in tasks such as tracking and fine discrimination. This provides a new research idea for the industry: visual reasoning ‘translating to visual operations’ may be an effective path to make up for the visual shortcomings of multimodal large models[2].
xbench adopts a dual-track evaluation system, dividing the evaluation benchmarks into two tracks: AGI Tracking and Profession Aligned[1][2]:
| Track | Positioning | Objective |
|---|---|---|
| AGI Tracking | Tracking the evolution process of AGI | Evaluating the upper limit of capabilities and technical boundaries of AI systems |
| Profession Aligned | Quantifying the economic and practical value of models in the real world | Evaluating the utility value of AI systems in real scenarios |
BabyVision belongs to the multimodal evaluation benchmark series of the AGI Tracking track in xbench’s dual-track evaluation. Looking ahead to 2026, xbench judges that a new round of breakthroughs will emerge in world models and visual multimodality. The release of BabyVision at the beginning of the year is to welcome and participate in this new round of technological breakthroughs[1].
xbench adopts the Evergreen Evaluation mechanism[3]:
- Continuous update and maintenance: xbench-ScienceQA and xbench-DeepSearch update rankings monthly and update the benchmark at least once a quarter to ensure the timeliness of the benchmark
- Black/white box anti-contamination mechanism: While open-sourcing the benchmark, maintain an internal closed-source black-box version; if the model rankings between the open-source and closed-source versions differ significantly, the relevant rankings and scores will be removed from the list to avoid list-boosting and benchmark contamination[3]
xbench has carried out precise layout for different application scenarios[3]:
| Scenario | Benchmark | Core Assessed Capabilities | Target Users |
|---|---|---|---|
| Foundation Models | xbench-ScienceQA | Professional knowledge and reasoning capabilities | Large model researchers |
| AI Agent | xbench-DeepSearch | End-to-end deep search capabilities including planning, searching, reasoning, and summarization | AI Agent developers |
| Multimodal | BabyVision | Real visual understanding capabilities | Multimodal model developers |
| Vertical Fields | Profession-Recruitment/Marketing | Domain-specific Agent capabilities | Vertical industry developers |
xbench chooses to open-source two core benchmarks, xbench-ScienceQA and xbench-DeepSearch, to attract domestic and overseas large enterprises, startups, researchers, etc. to participate in ecosystem construction[3]. At the same time, it opens channels for test submission and question co-creation, and welcomes participation in ecosystem construction via team@xbench.org. This strategy aims to rely on community power to promote the continuous evolution of the benchmark, ensuring that the evaluation system always maintains cutting-edge and effectiveness.
xbench’s layout strategy has a clear sense of complementary positioning[3]:
- Addressing the issue that classic subject benchmarks are close to full marks and cannot measure model progress: Launch xbench-ScienceQA, focusing on high-difficulty, high-discrimination evaluation in STEM disciplines
- Addressing the issue that there are few graduate-level benchmarks and they are difficult to update: Continuously maintain and dynamically update evaluation content
- Addressing the current situation that the industry lacks high-quality benchmarks for AI Agent deep search capabilities: Launch xbench-DeepSearch, filling the gap in Agent deep search evaluation in the Chinese context
The release of the BabyVision benchmark marks a new stage in the evaluation of domestic multimodal large models. From the perspective of competitive landscape, this benchmark will prompt major model manufacturers to re-examine their own visual capability shortcomings, shifting R&D resources from pure language reasoning capability to real visual understanding. From the perspective of technological development, the proposal of the ‘unspeakable’ problem and the new direction of ‘let the model draw’ provide a clear optimization roadmap for the industry.
Through a systematic layout of a dual-track evaluation system, an evergreen evaluation mechanism, and an open-source co-creation ecosystem, Sequoia China xbench is building a comprehensive evaluation system covering foundation models, AI Agents, multimodal understanding, and vertical fields. This system not only serves current model capability evaluation but also lays the groundwork for the upcoming new round of breakthroughs in world models and visual multimodality. As the AI industry’s attention to embodied AI and world models continues to rise in 2026, xbench’s evaluation layout will play an even more critical guiding role in industry development.
[1] Sina Finance - “Multimodal Large Models Lose to Three-Year-Old Kids? xbench x UniPat Jointly Release New Benchmark BabyVision” (https://finance.sina.com.cn/chanjing/gsnews/2026-01-12/doc-inhfzcnv8496112.shtml)
[2] UniPat AI Blog - “BabyVision Benchmark Release” (https://unipat.ai/blog/BabyVision)
[3] Sequoia China Official Website - “xbench Benchmarks Officially Open Sourced” (https://www.hongshan.com/article/xbench评测集正式开源/)
Insights are generated using AI models and historical data for informational purposes only. They do not constitute investment advice or recommendations. Past performance is not indicative of future results.
About us: Ginlix AI is the AI Investment Copilot powered by real data, bridging advanced AI with professional financial databases to provide verifiable, truth-based answers. Please use the chat box below to ask any financial question.
