FORGE Icon

FORGE: Fine-grained Multimodal
Evaluation for Manufacturing Scenarios

A comprehensive benchmark combining 2D images and 3D point clouds with fine-grained domain semantics to evaluate 18 state-of-the-art MLLMs on real-world manufacturing quality inspection tasks.

Xiangru Jian1,*   Hao Xu2,*,†   Wei Pang4,*   Xinjian Zhao4   Chengyu Tao5   Qixin Zhang6   Xikun Zhang7,†   Chao Zhang1
Guanzhi Deng8   Alex Xue1   Juan Du9   Tianshu Yu4   Garth Tarr2   Qiuzhuang Sun3   Dacheng Tao6

1University of Waterloo, Canada   2University of Sydney, Australia   3SMU, Singapore   4CUHK, Shenzhen, China
5Hunan University, China   6NTU, Singapore   7RMIT University, Australia   8City University of Hong Kong, China   9HKUST (Guangzhou), China
*Equal contribution   Corresponding author

Paper Code Dataset Leaderboard
18
MLLMs Evaluated
3
Manufacturing Tasks
3,320
Evaluation Cases
14
Workpiece Categories
90
Distinct Model Numbers

Overview

Existing VLM benchmarks test coarse-grained perception ("What is this?"). FORGE demands fine-grained, model-number-level understanding for real manufacturing quality control.

FORGE vs Previous Benchmarks

Why FORGE?

Previous benchmarks ask "Is this a screw?" FORGE asks "Is this an M8 screw, and does it match the M10 specification required for this assembly?"

We provide the first comprehensive evaluation combining 2D images and 3D point clouds with domain-specific manufacturing knowledge, covering 14 workpiece categories and 90 distinct model specifications.

Our evaluation of 18 frontier and open-source MLLMs reveals that visual grounding is not the bottleneck — instead, insufficient domain-specific knowledge is the primary limitation.

FORGE Pipeline

From physical workpieces to standardized evaluation: FORGE bridges the gap between industrial reality and MLLM reasoning.

FORGE Pipeline

Figure 1. The FORGE pipeline: (1) Raw manufacturing data from the physical world, (2) Standardization with fine-grained domain knowledge injection, (3) Task-oriented scenarios with both 3D point clouds and 2D rendered views, (4) MLLM cognition evaluation revealing macro-perception vs. micro-reasoning gaps.

Manufacturing Tasks

Three core quality inspection tasks spanning workpiece verification, surface defect inspection, and assembly verification.

 Task 1: WorkVeri

Workpiece Verification — Given an assembly, identify which part has the wrong model number. Requires fine-grained discrimination between similar-looking parts (e.g., M8 vs M10 bolts).

451 image + 496 three-view cases

 Task 2: SurfInsp

Surface Inspection — Classify manufacturing defects: crack, cut, deformation, or dent. Two-part answer: first determine if normal, then identify defect type.

830 three-view cases, 4 defect types

 Task 3: AssyVeri

Assembly Verification — Identify extra, mismatched, or missing parts in an assembly. Tests compositional understanding of multi-component systems.

857 image + 309 three-view + 377 missing-part cases

Task Examples

Figure 2. Examples for each task showing inputs (2D images / 3D three-view renders), dialogue format, and expected outputs.

Evaluation Results

Comprehensive evaluation of 18 MLLMs across all tasks with zero-shot (Z), reference (R), and in-context learning (F) settings.

Model Performance Radar

Figure 3. Radar chart comparing open-source and closed-source model performance across all tasks and modalities.

Task 1: Workpiece Verification (Accuracy %)

Model Image Three-View
ZRFZRF
Closed-Source Models
GPT-574.764.285.260.729.856.5
GPT-5.270.951.979.261.328.261.7
GPT-5 Mini73.672.776.837.933.930.6
O375.264.176.357.928.856.9
Gemini-3-Flash72.276.382.369.665.167.3
Gemini-2.5-Flash55.851.754.556.043.553.6
Claude-Opus-4.559.456.161.039.127.855.0
Seed-1.667.049.770.354.426.846.2
Kimi-K2.560.156.762.719.410.144.0
Open-Source Models
Qwen3-VL-235B64.157.666.351.432.934.3
Qwen3-VL-8B35.323.925.141.326.826.2
InternVL3-78B32.653.965.332.324.830.0
Llama-4-MAV37.339.048.836.320.435.7
GLM-4.6V25.350.349.846.443.844.4
Gemma-3-27B25.930.434.627.623.628.8
Mistral-3-Large25.725.431.033.920.837.3
Mistral-3-14B32.520.224.232.525.028.4
Mistral-3-8B29.924.232.030.818.128.6

Task 2: Surface Inspection (Accuracy %)

ModelZero-ShotReferenceFew-Shot
Closed-Source Models
Gemini-3-Flash18.529.647.1
Claude-Opus-4.58.77.744.3
Seed-1.622.636.242.3
O321.136.240.0
Llama-4-MAV27.024.139.1
GPT-522.035.738.3
Gemini-2.5-Flash17.226.438.1
GPT-5 Mini17.033.436.2
GPT-5.216.621.931.7
Kimi-K2.513.216.830.1
Open-Source Models
Mistral-3-8B24.327.138.9
Gemma-3-27B21.723.933.2
Mistral-3-14B28.327.733.2
Qwen3-VL-235B19.218.732.2
GLM-4.6V23.523.838.4
Mistral-3-Large19.819.826.7
InternVL3-78B19.221.625.7
Qwen3-VL-8B19.421.325.8

Task 3: Assembly Verification (Accuracy %)

Model Image Three-View
ZRFZRF
Closed-Source Models
Gemini-3-Flash58.170.471.447.246.347.6
GPT-5.243.253.563.254.057.053.7
Claude-Opus-4.552.156.462.942.150.845.3
O348.259.861.049.834.051.8
GPT-550.149.860.553.730.751.8
GPT-5 Mini43.651.152.249.529.445.3
Seed-1.639.742.547.941.426.534.3
Gemini-2.5-Flash39.632.246.745.040.144.0
Kimi-K2.518.811.725.811.76.522.0
Open-Source Models
Qwen3-VL-235B36.940.250.241.128.839.5
GLM-4.6V27.937.245.338.836.637.9
InternVL3-78B33.136.142.434.322.028.8
Llama-4-MAV32.125.836.338.230.137.5
Qwen3-VL-8B31.928.430.339.233.038.8
Gemma-3-27B28.331.732.927.232.032.0
Mistral-3-8B29.626.830.324.618.429.4
Mistral-3-14B29.732.028.828.219.125.9
Mistral-3-Large29.123.728.526.226.925.9

Grounding Ablation (Accuracy %)

Testing spatial grounding ability independently from domain knowledge. High grounding accuracy confirms visual perception is NOT the bottleneck.

Model Single-Image Cross-Image Avg
C→LL→CL→LC→C
Gemini-3-Flash98.299.688.779.991.6
Qwen3-VL-235B85.498.880.365.782.6
GPT-5.274.697.685.675.483.3
Seed-1.642.099.279.371.272.9
Mistral-3-8B66.070.662.033.958.1
Model Comparison

Figure 4. Model-number-level tasks (darker bars) are significantly harder than coarse workpiece-level tasks (lighter bars), revealing the domain knowledge gap.

Data Examples

Sample images from each task showing the diversity of industrial components, defect types, and evaluation modalities.

Error Analysis

Analyzing where and why MLLMs fail on manufacturing tasks.

Error Analysis

Figure 5. Representative error cases from Gemini-2.5-Flash. Left: the model incorrectly identifies a flat washer material mismatch (predicting E instead of A). Right: the model provides extensive reasoning about worn parts but selects the wrong component (predicting B instead of D).

Key Findings

Insights from evaluating 18 state-of-the-art MLLMs on manufacturing quality inspection.

🔎

Visual Grounding is Not the Bottleneck

Frontier models achieve 86–98% accuracy on spatial grounding ablations, confirming they can locate and match parts across images. The limitation lies elsewhere.

📚

Domain Knowledge is the Primary Gap

MLLMs lack fine-grained manufacturing knowledge (model numbers, specifications). Model-number tasks are 2–3x harder than coarse workpiece identification.

📈

ICL Consistently Helps

In-context learning with labeled examples improves performance across nearly all models, with gains of 10–20% on average. Best result: GPT-5 reaches 85.2% on Task 1 with ICL.

⚠️

References Can Hurt

Counter-intuitively, normal reference images sometimes degrade performance on three-view tasks, suggesting models struggle with multi-image comparative reasoning.

🛠

SFT Shows 90.8% Relative Gain

A 3B-parameter model fine-tuned on domain data matches a 235B model, demonstrating that domain-specific training dramatically closes the gap.

📡

Task 2 is the Hardest

Surface defect classification peaks at only 47.1% accuracy (Gemini-3-Flash with ICL), suggesting fine-grained defect discrimination remains an open challenge.

SFT Results

Figure 6. Domain-specific supervised fine-tuning on a 3B model achieves +25.6% absolute improvement on held-out manufacturing scenarios, matching the 235B model's performance.

Citation

If you find FORGE useful, please cite our paper:

@inproceedings{jian2026forge, title={FORGE: Fine-grained Multimodal Evaluation for Manufacturing Scenarios}, author={Jian, Xiangru and Xu, Hao and Pang, Wei and Zhao, Xinjian and Tao, Chengyu and Zhang, Qixin and Zhang, Xikun and Zhang, Chao and Deng, Guanzhi and Xue, Alex and Du, Juan and Yu, Tianshu and Tarr, Garth and Sun, Qiuzhuang and Tao, Dacheng}, booktitle={Arxiv}, year={2026} }