MLPerf Inference Benchmarks
Overview
The currently valid MLPerf Inference Benchmarks as of MLPerf inference v4.0 round are listed below, categorized by tasks. Under each model you can find its details like the dataset used, reference accuracy, server latency constraints etc.
Image Classification
ResNet50-v1.5
- Dataset: Imagenet-2012 (224x224) Validation
- Dataset Size: 50,000
- QSL Size: 1,024
- Number of Parameters: 25.6 million
- FLOPs: 3.8 billion
- Reference Model Accuracy: 76.46% ACC
- Server Scenario Latency Constraint: 15ms
- Equal Issue mode: False
- High accuracy variant: No
- Submission Category: Datacenter, Edge
Text to Image
Stable Diffusion
- Dataset: Subset of Coco2014
- Dataset Size: 5,000
- QSL Size: 5,000
- Number of Parameters: 3.5 billion
- FLOPs: 1.28 - 2.4 trillion
- Reference Model Accuracy (fp32): CLIP: 31.74981837, FID: 23.48046692
- Required Accuracy (Closed Division):
- CLIP: 31.68631873 ≤ CLIP ≤ 31.81331801 (within 0.2% of the reference model CLIP score)
- FID: 23.01085758 ≤ FID ≤ 23.95007626 (within 2% of the reference model FID score)
- Equal Issue mode: False
- High accuracy variant: No
- Submission Category: Datacenter, Edge
Object Detection
Retinanet
- Dataset: OpenImages
- Dataset Size: 24,781
- QSL Size: 64
- Number of Parameters: TBD
- Reference Model Accuracy (fp32) : 0.3755 mAP
- Server Scenario Latency Constraint: 100ms
- Equal Issue mode: False
- High accuracy variant: Yes
- Submission Category: Datacenter, Edge
Medical Image Segmentation
3d-unet
- Dataset: KiTS2019
- Dataset Size: 42
- QSL Size: 42
- Number of Parameters: 32.5 million
- FLOPs: 100-300 billion
- Reference Model Accuracy (fp32) : 0.86330 Mean DICE Score
- Server Scenario: Not Applicable
- Equal Issue mode: True
- High accuracy variant: Yes
- Submission Category: Datacenter, Edge
Language Tasks
Question Answering
Bert-Large
- Dataset: Squad v1.1 (384 Sequence Length)
- Dataset Size: 10,833
- QSL Size: 10,833
- Number of Parameters: 340 million
- FLOPs: ~128 billion
- Reference Model Accuracy (fp32) : F1 Score = 90.874%
- Server Scenario Latency Constraint: 130ms
- Equal Issue mode: False
- High accuracy variant: yes
- Submission Category: Datacenter, Edge
LLAMA2-70B
- Dataset: OpenORCA (GPT-4 split, max_seq_len=1024)
- Dataset Size: 24,576
- QSL Size: 24,576
- Number of Parameters: 70 billion
- FLOPs: ~500 trillion
- Reference Model Accuracy (fp32) :
- Rouge1: 44.4312
- Rouge2: 22.0352
- RougeL: 28.6162
- Tokens_per_sample: 294.45
- Server Scenario Latency Constraint:
- TTFT: 2000ms
- TPOT: 200ms
- Equal Issue mode: True
- High accuracy variant: Yes
- Submission Category: Datacenter
Text Summarization
GPT-J
- Dataset: CNN Daily Mail v3.0.0
- Dataset Size: 13,368
- QSL Size: 13,368
- Number of Parameters: 6 billion
- FLOPs: ~148 billion
- Reference Model Accuracy (fp32) :
- Rouge1: 42.9865
- Rouge2: 20.1235
- RougeL: 29.9881
- Gen_len: 4,016,878
- Server Scenario Latency Constraint: 20s
- Equal Issue mode: True
- High accuracy variant: Yes
- Submission Category: Datacenter, Edge
Mixed Tasks (Question Answering, Math, and Code Generation)
Mixtral-8x7B
- Datasets:
- OpenORCA (5k samples of GPT-4 split, max_seq_len=2048)
- GSM8K (5k samples of the validation split, max_seq_len=2048)
- MBXP (5k samples of the validation split, max_seq_len=2048)
- Dataset Size: 15,000
- QSL Size: 15,000
- Number of Parameters: 47 billion
- Reference Model Accuracy (fp16) :
- OpenORCA
- Rouge1: 45.4911
- Rouge2: 23.2829
- RougeL: 30.3615
- GSM8K Accuracy: 73.78%
- MBXP Accuracy: 60.12%
- OpenORCA
- Tokens_per_sample: 294.45
- Server Scenario Latency Constraint:
- TTFT: 2000ms
- TPOT: 200ms
- Equal Issue mode: True
- High accuracy variant: Yes
- Submission Category: Datacenter
Recommendation
DLRM_v2
- Dataset: Synthetic Multihot Criteo
- Dataset Size: 204,800
- QSL Size: 204,800
- Number of Parameters: ~23 billion
- Reference Model Accuracy: AUC = 80.31%
- Server Scenario Latency Constraint: 60ms
- Equal Issue mode: False
- High accuracy variant: Yes
- Submission Category: Datacenter
Submission Categories
- Datacenter Category: All the current inference benchmarks are applicable to the datacenter category.
- Edge Category: All benchmarks except DLRMv2, LLAMA2-70B, and Mixtral-8x7B are applicable to the edge category.
High Accuracy Variants
- Benchmarks:
bert
,llama2-70b
,gpt-j
,dlrm_v2
, and3d-unet
have a normal accuracy variant as well as a high accuracy variant. - Requirement: Must achieve at least 99.9% of the reference model accuracy, compared to the default 99% accuracy requirement.