Model Kernel Coverage

This document tracks which kernels are supported in FlashInfer-Bench for each model.

✅ Definition JSON exists and workload has been collected
🟡 Definition JSON exists but workload has not yet been collected
❌ Definition is referenced in models.ts but the file does not exist (missing)
— Module exists in the architecture but no definition is mapped (unmapped)

Summary

Model	Architecture	Coverage
DeepSeek V3/R1	MLA + Dense/MoE	🟡 Partial
DeepSeek V3.2	DSA + Dense/MoE	✅ Fully covered
Llama 3.1 8B	GQA + Dense	✅ Fully covered
Llama 3.1/3.3 70B	GQA + Dense	🟡 Partial
Llama 3.2 3B	GQA + Dense	🟡 Partial
Mistral 7B v0.3	GQA + Dense	🟡 Partial
Mistral Nemo 12B	GQA + Dense	🟡 Partial
Mixtral 8x7B	GQA + MoE	🟡 Partial
Mixtral 8x22B	GQA + MoE	🟡 Partial
Qwen2.5 7B	GQA + Dense	🟡 Partial
Qwen2.5 72B	GQA + Dense	🟡 Partial
Qwen3 8B	GQA + Dense	🟡 Partial
Qwen3 30B A3B	GQA + MoE	🟡 Partial
Qwen3 32B	GQA + Dense	🟡 Partial
Qwen3 235B A22B	GQA + MoE	🟡 Partial
Qwen3 Next 80B A3B	GDN + GQA + MoE	🟡 Partial
Kimi K2	MLA + MoE	🟡 Partial
Phi-4 14B	GQA + Dense	🟡 Partial
Llama 3.1 405B	GQA + Dense	🟡 Partial
Llama 4 Scout 17B-16E	GQA + MoE	🟡 Partial
Llama 4 Maverick 17B-128E	GQA + MoE	🟡 Partial
Mistral Small 3.1 24B	GQA + Dense	🟡 Partial
GLM-4.6	GQA + Dense	❌ Not covered
MiniMax-Text-01	Lightning Attn + MoE	❌ Not covered
MiniMax M2	GQA + MoE	🟡 Partial
Gemma 3 27B	GQA + Dense	🟡 Partial
Qwen3 14B	GQA + Dense	🟡 Partial
NemotronH 47B	GQA + Mamba2 Hybrid	❌ Not covered

DeepSeek V3 / R1

Architecture: 61 decoder layers, MLA attention, hybrid Dense+MoE FFN

Definition	Op Type	Status
`rmsnorm_h7168`	rmsnorm	✅
`fused_add_rmsnorm_h7168`	rmsnorm	✅
`rmsnorm_h1536`	rmsnorm	✅
`rmsnorm_h512`	rmsnorm	✅
`gemm_n256_k7168`	gemm	✅
`mla_ragged_prefill_causal_h16_qk192_vo128`	mla_ragged	✅
`mla_paged_prefill_causal_h16_ckv512_kpe64_ps1`	mla_paged	✅
`mla_paged_prefill_causal_h16_ckv512_kpe64_ps64`	mla_paged	✅
`mla_paged_decode_h16_ckv512_kpe64_ps1`	mla_paged	✅
`mla_paged_decode_h16_ckv512_kpe64_ps64`	mla_paged	✅
`moe_fp8_block_scale_ds_routing_topk8_ng8_kg4_e32_h7168_i2048`	moe	✅
`top_k_sampling_from_probs_v129280`	sampling	✅
`top_k_top_p_sampling_from_probs_v129280`	sampling	✅
`top_p_sampling_from_probs_v129280`	sampling	✅

Coverage: 13 / 14 definitions present. Missing: MLA ragged prefill definition.

DeepSeek V3.2

Architecture: 61 decoder layers, DSA (DeepSeek Sparse Attention) replacing dense MLA, hybrid Dense+MoE FFN. Standard serving configuration: TP=8. DSA introduces a learned TopK indexer that selects a sparse subset of KV pages before running attention, reducing computation for long contexts while preserving accuracy.

Definition	Op Type	Status
`rmsnorm_h7168`	rmsnorm	✅
`fused_add_rmsnorm_h7168`	rmsnorm	✅
`rmsnorm_h1536`	rmsnorm	✅
`rmsnorm_h512`	rmsnorm	✅
`dsa_topk_indexer_fp8_h64_d128_topk2048_ps64`	dsa_paged	✅
`dsa_sparse_attention_h16_ckv512_kpe64_topk2048_ps1`	dsa_paged	✅
`dsa_sparse_attention_h16_ckv512_kpe64_topk2048_ps64`	dsa_paged	✅
`moe_fp8_block_scale_ds_routing_topk8_ng8_kg4_e32_h7168_i2048`	moe	✅

Coverage: 8 / 8 definitions present. Fully covered.

Llama 3.1 8B

Architecture: 32 decoder layers, GQA attention, dense MLP

Definition	Op Type	Status
`rmsnorm_h4096`	rmsnorm	✅
`fused_add_rmsnorm_h4096`	rmsnorm	✅
`gemm_n6144_k4096`	gemm	✅
`gemm_n4096_k4096`	gemm	✅
`gemm_n28672_k4096`	gemm	✅
`gemm_n4096_k14336`	gemm	✅
`gqa_paged_prefill_causal_h32_kv8_d128_ps1`	gqa_paged	✅
`gqa_paged_prefill_causal_h32_kv8_d128_ps64`	gqa_paged	✅
`gqa_paged_decode_h32_kv8_d128_ps1`	gqa_paged	✅
`gqa_paged_decode_h32_kv8_d128_ps64`	gqa_paged	✅
`gqa_ragged_prefill_causal_h32_kv8_d128`	gqa_ragged	✅
`top_k_sampling_from_probs_v128256`	sampling	✅
`top_k_top_p_sampling_from_probs_v128256`	sampling	✅
`top_p_sampling_from_probs_v128256`	sampling	✅

Coverage: 14 / 14 definitions present. Fully covered.

Qwen3 30B A3B

Architecture: 32 decoder layers, GQA attention, MoE FFN (30 MoE + 2 dense layers)

Definition	Op Type	Status
`rmsnorm_h128`	rmsnorm	✅
`rmsnorm_h2048`	rmsnorm	✅
`fused_add_rmsnorm_h2048`	rmsnorm	✅
`gemm_n128_k2048`	gemm	✅
`gemm_n2048_k4096`	gemm	✅
`gemm_n5120_k2048`	gemm	✅
`gqa_paged_prefill_causal_h32_kv4_d128_ps1`	gqa_paged	✅
`gqa_paged_prefill_causal_h32_kv4_d128_ps64`	gqa_paged	✅
`gqa_paged_decode_h32_kv4_d128_ps1`	gqa_paged	✅
`gqa_paged_decode_h32_kv4_d128_ps64`	gqa_paged	✅
`gqa_ragged_prefill_causal_h32_kv4_d128`	gqa_ragged	✅
`top_k_sampling_from_probs_v151936`	sampling	✅
`top_k_top_p_sampling_from_probs_v151936`	sampling	✅
`top_p_sampling_from_probs_v151936`	sampling	✅
MoE gate / topk / experts	moe	—
`moe_fp8_block_scale_renorm_topk8_e128_h2048_i768`	moe EP=1	🟡
`trtllm_fp4_block_scale_moe_topk8_e128_h2048_i768`	moe (TRT-LLM FP4)	🟡
`trtllm_fp4_block_scale_routed_moe_topk8_e128_h2048_i768`	moe (TRT-LLM FP4 routed)	🟡
`trtllm_fp8_per_tensor_scale_moe_topk8_e128_h2048_i768`	moe (TRT-LLM FP8)	🟡

Coverage: 14 / 14 referenced definitions present. MoE kernels added (not yet mapped in models.ts).

Qwen3 Next 80B A3B

Architecture: 48 layers total — 36 GDN (linear attention) + 12 GQA (standard attention), all layers use MoE FFN. Standard serving configuration: TP=2 or TP=4.

Definition	Op Type	Status
`rmsnorm_h2048`	rmsnorm	✅
`fused_add_rmsnorm_h2048`	rmsnorm	✅
`gdn_prefill_qk16_v32_d128_k_last`	gdn TP=1	🟡
`gdn_prefill_qk8_v16_d128_k_last`	gdn TP=2	✅
`gdn_prefill_qk4_v8_d128_k_last`	gdn TP=4	✅
`gdn_decode_qk16_v32_d128_k_last`	gdn TP=1	🟡
`gdn_decode_qk8_v16_d128_k_last`	gdn TP=2	✅
`gdn_decode_qk4_v8_d128_k_last`	gdn TP=4	✅
`gdn_mtp_qk16_v32_d128_k_last`	gdn TP=1	🟡
`gdn_mtp_qk8_v16_d128_k_last`	gdn TP=2	✅
`gdn_mtp_qk4_v8_d128_k_last`	gdn TP=4	✅
`gqa_paged_prefill_causal_h8_kv1_d256_ps1`	gqa_paged TP=2	❌
`gqa_paged_decode_h8_kv1_d256_ps1`	gqa_paged TP=2	❌
`gqa_ragged_prefill_causal_h8_kv1_d256`	gqa_ragged TP=2	✅
MoE gate / topk / experts (GDN layers)	moe	—
MoE gate / topk / experts (GQA layers)	moe	—
`moe_fp8_block_scale_renorm_topk10_e128_h2048_i512`	moe EP=1	🟡
`trtllm_fp4_block_scale_moe_topk10_e128_h2048_i512`	moe (TRT-LLM FP4, EP=4)	🟡
`trtllm_fp4_block_scale_routed_moe_topk10_e128_h2048_i512`	moe (TRT-LLM FP4 routed, EP=4)	🟡

Coverage: 10 / 14 referenced definitions present. MoE definition added (shared across GDN and GQA layers). Missing GDN definitions: TP=1 prefill and decode (qk16_v32). Missing GQA: h=8, kv=1, d=256 (TP=2 of original h=16, kv=2, d=256).

Llama 3.1 / 3.3 70B

Architecture: 80 decoder layers, GQA attention, dense MLP. Standard serving configuration: TP=4 (from sgl-cookbook). Llama 3.1 70B and 3.3 70B share identical architecture dimensions; only training data and context window differ.

Definition	Op Type	Status
`rmsnorm_h8192`	rmsnorm	❌
`fused_add_rmsnorm_h8192`	rmsnorm	❌
`gqa_paged_prefill_causal_h16_kv2_d128_ps1`	gqa_paged TP=4	✅
`gqa_paged_prefill_causal_h16_kv2_d128_ps64`	gqa_paged TP=4	✅
`gqa_paged_decode_h16_kv2_d128_ps1`	gqa_paged TP=4	✅
`gqa_paged_decode_h16_kv2_d128_ps64`	gqa_paged TP=4	✅
`gqa_ragged_prefill_causal_h16_kv2_d128`	gqa_ragged TP=4	✅
`gemm_n10240_k8192`	gemm	🟡
`gemm_n8192_k8192`	gemm	🟡
`gemm_n57344_k8192`	gemm	🟡
`gemm_n8192_k28672`	gemm	🟡
`top_k_sampling_from_probs_v128256`	sampling	✅
`top_k_top_p_sampling_from_probs_v128256`	sampling	✅
`top_p_sampling_from_probs_v128256`	sampling	✅

Coverage: 14 / 14 definitions present. GQA kernels shared with Qwen3-32B (same h=16, kv=2, d=128 at TP=4).

Llama 3.2 3B

Architecture: 28 decoder layers, GQA attention, dense MLP.

Definition	Op Type	Status
`rmsnorm_h3072`	rmsnorm	🟡
`fused_add_rmsnorm_h3072`	rmsnorm	🟡
`gqa_paged_prefill_causal_h24_kv8_d128_ps1`	gqa_paged	✅
`gqa_paged_prefill_causal_h24_kv8_d128_ps64`	gqa_paged	✅
`gqa_paged_decode_h24_kv8_d128_ps1`	gqa_paged	✅
`gqa_paged_decode_h24_kv8_d128_ps64`	gqa_paged	✅
`gqa_ragged_prefill_causal_h24_kv8_d128`	gqa_ragged	✅
`gemm_n5120_k3072`	gemm	🟡
`gemm_n3072_k3072`	gemm	🟡
`gemm_n16384_k3072`	gemm	🟡
`gemm_n3072_k8192`	gemm	🟡
`top_k_sampling_from_probs_v128256`	sampling	✅
`top_k_top_p_sampling_from_probs_v128256`	sampling	✅
`top_p_sampling_from_probs_v128256`	sampling	✅

Coverage: 6 / 14 definitions present. GQA ragged prefill kernel added. Missing: rmsnorm h3072, paged GQA variants, and GEMM kernels for hidden=3072.

Mistral 7B v0.3

Architecture: 32 decoder layers, GQA attention, dense MLP. Shares identical hidden, attention, and MLP dimensions with Llama 3.1 8B (hidden=4096, 32q/8kv heads, intermediate=14336).

Definition	Op Type	Status
`rmsnorm_h4096`	rmsnorm	✅
`fused_add_rmsnorm_h4096`	rmsnorm	✅
`gqa_paged_prefill_causal_h32_kv8_d128_ps1`	gqa_paged	✅
`gqa_paged_prefill_causal_h32_kv8_d128_ps64`	gqa_paged	✅
`gqa_paged_decode_h32_kv8_d128_ps1`	gqa_paged	✅
`gqa_paged_decode_h32_kv8_d128_ps64`	gqa_paged	✅
`gqa_ragged_prefill_causal_h32_kv8_d128`	gqa_ragged	✅
`gemm_n6144_k4096`	gemm	✅
`gemm_n4096_k4096`	gemm	✅
`gemm_n28672_k4096`	gemm	✅
`gemm_n4096_k14336`	gemm	✅
`top_k_sampling_from_probs_v32000`	sampling	❌
`top_k_top_p_sampling_from_probs_v32000`	sampling	❌
`top_p_sampling_from_probs_v32000`	sampling	❌

Coverage: 11 / 14 definitions present. Missing: sampling definitions for vocab_size=32000.

Mistral Nemo 12B

Architecture: 40 decoder layers, GQA attention (explicit head_dim=128), dense MLP. Standard serving configuration: TP=1 (from sgl-cookbook).

Definition	Op Type	Status
`rmsnorm_h5120`	rmsnorm	🟡
`fused_add_rmsnorm_h5120`	rmsnorm	🟡
`gqa_paged_prefill_causal_h32_kv8_d128_ps1`	gqa_paged	✅
`gqa_paged_prefill_causal_h32_kv8_d128_ps64`	gqa_paged	✅
`gqa_paged_decode_h32_kv8_d128_ps1`	gqa_paged	✅
`gqa_paged_decode_h32_kv8_d128_ps64`	gqa_paged	✅
`gqa_ragged_prefill_causal_h32_kv8_d128`	gqa_ragged	✅
`gemm_n6144_k5120`	gemm	❌
`gemm_n5120_k4096`	gemm	❌
`gemm_n28872_k5120`	gemm	❌
`gemm_n5120_k14436`	gemm	❌
`top_k_sampling_from_probs_v131072`	sampling	❌
`top_k_top_p_sampling_from_probs_v131072`	sampling	❌
`top_p_sampling_from_probs_v131072`	sampling	❌

Coverage: 7 / 14 definitions present. GQA defs are shared with Llama 3.1 8B; rmsnorm h5120 is now shared with Qwen3 14B. Missing: all GEMM (different hidden=5120 input dim), sampling v131072.

Mixtral 8x7B

Architecture: 32 decoder layers, GQA attention, sparse MoE FFN (8 experts, top-2 routing). Shares attention and normalization dimensions with Llama 3.1 8B / Mistral 7B.

Definition	Op Type	Status
`rmsnorm_h4096`	rmsnorm	✅
`fused_add_rmsnorm_h4096`	rmsnorm	✅
`gqa_paged_prefill_causal_h32_kv8_d128_ps1`	gqa_paged	✅
`gqa_paged_prefill_causal_h32_kv8_d128_ps64`	gqa_paged	✅
`gqa_paged_decode_h32_kv8_d128_ps1`	gqa_paged	✅
`gqa_paged_decode_h32_kv8_d128_ps64`	gqa_paged	✅
`gqa_ragged_prefill_causal_h32_kv8_d128`	gqa_ragged	✅
`gemm_n6144_k4096`	gemm	✅
`gemm_n4096_k4096`	gemm	✅
MoE experts (top-2, 8 experts, inter=14336)	moe	—
`top_k_sampling_from_probs_v32000`	sampling	❌
`top_k_top_p_sampling_from_probs_v32000`	sampling	❌
`top_p_sampling_from_probs_v32000`	sampling	❌

Coverage: 9 / 12 referenced definitions present. MoE uses standard top-2 routing (not DeepSeek FP8 block-scale), so the existing MoE definition does not apply (unmapped). Missing: sampling v32000.

Mixtral 8x22B

Architecture: 56 decoder layers, GQA attention, sparse MoE FFN (8 experts, top-2 routing). All dimensions are new (hidden=6144, 48q/8kv heads). Standard serving configuration: TP=2 (from sgl-cookbook), giving 24q/4kv heads per GPU.

Definition	Op Type	Status
`rmsnorm_h6144`	rmsnorm	❌
`fused_add_rmsnorm_h6144`	rmsnorm	❌
`gqa_paged_prefill_causal_h24_kv4_d128_ps64`	gqa_paged	✅
`gqa_paged_prefill_causal_h48_kv8_d128_ps1`	gqa_paged	❌
`gqa_paged_prefill_causal_h48_kv8_d128_ps64`	gqa_paged	❌
`gqa_paged_decode_h48_kv8_d128_ps1`	gqa_paged	✅
`gqa_paged_decode_h48_kv8_d128_ps64`	gqa_paged	❌
`gqa_ragged_prefill_causal_h48_kv8_d128`	gqa_ragged	❌
`gemm_n8192_k6144`	gemm	❌
`gemm_n6144_k6144`	gemm	❌
MoE experts (top-2, 8 experts, inter=16384)	moe	—
`top_k_sampling_from_probs_v32768`	sampling	❌
`top_k_top_p_sampling_from_probs_v32768`	sampling	❌
`top_p_sampling_from_probs_v32768`	sampling	❌

Coverage: 3 / 13 referenced definitions present. TP=2 prefill + decode definitions added. Missing: rmsnorm, remaining GQA variants, GEMM, and sampling definitions.

Mixtral 8x22B at TP=2

At tensor parallelism TP=2, attention head counts are halved (48→24 q-heads, 8→4 kv-heads).

Definition	Op Type	Status
`gqa_paged_prefill_causal_h24_kv4_d128_ps64`	gqa_paged	✅
`gqa_paged_prefill_causal_h24_kv4_d128_ps1`	gqa_paged	✅
`gqa_paged_decode_h24_kv4_d128_ps64`	gqa_paged	✅

Coverage: 3 / 3 TP=2 attention definitions present (prefill ps1 + ps64, decode ps64).

Qwen2.5 7B

Architecture: 28 decoder layers, GQA attention, dense MLP.

Definition	Op Type	Status
`rmsnorm_h3584`	rmsnorm	❌
`fused_add_rmsnorm_h3584`	rmsnorm	❌
`gqa_paged_prefill_causal_h28_kv4_d128_ps1`	gqa_paged	❌
`gqa_paged_prefill_causal_h28_kv4_d128_ps64`	gqa_paged	❌
`gqa_paged_decode_h28_kv4_d128_ps1`	gqa_paged	❌
`gqa_paged_decode_h28_kv4_d128_ps64`	gqa_paged	❌
`gqa_ragged_prefill_causal_h28_kv4_d128`	gqa_ragged	❌
`gemm_n4608_k3584`	gemm	🟡
`gemm_n3584_k3584`	gemm	🟡
`gemm_n37888_k3584`	gemm	🟡
`gemm_n3584_k18944`	gemm	🟡
`top_k_sampling_from_probs_v152064`	sampling	🟡
`top_k_top_p_sampling_from_probs_v152064`	sampling	🟡
`top_p_sampling_from_probs_v152064`	sampling	🟡

Coverage: 9 / 14 definitions present. Missing: all rmsnorm, GQA, and GEMM definitions for hidden=3584.

Qwen2.5 72B

Architecture: 80 decoder layers, GQA attention, dense MLP. Standard serving configuration: TP=8 (from sgl-cookbook).

Definition	Op Type	Status
`rmsnorm_h8192`	rmsnorm	❌
`fused_add_rmsnorm_h8192`	rmsnorm	❌
`gqa_paged_prefill_causal_h8_kv1_d128_ps1`	gqa_paged TP=8	❌
`gqa_paged_prefill_causal_h8_kv1_d128_ps64`	gqa_paged TP=8	❌
`gqa_paged_decode_h8_kv1_d128_ps1`	gqa_paged TP=8	❌
`gqa_paged_decode_h8_kv1_d128_ps64`	gqa_paged TP=8	❌
`gqa_ragged_prefill_causal_h8_kv1_d128`	gqa_ragged TP=8	❌
`gemm_n10240_k8192`	gemm	❌
`gemm_n8192_k8192`	gemm	❌
`gemm_n59392_k8192`	gemm	❌
`gemm_n8192_k29696`	gemm	❌
`top_k_sampling_from_probs_v152064`	sampling	🟡
`top_k_top_p_sampling_from_probs_v152064`	sampling	🟡
`top_p_sampling_from_probs_v152064`	sampling	🟡

Coverage: 3 / 14 definitions present. Missing: rmsnorm h8192, all GQA definitions (h8_kv1_d128 at TP=8), all GEMM definitions for hidden=8192.

Qwen3 8B

Architecture: 36 decoder layers, GQA attention, dense MLP. Shares hidden size and attention dimensions with Llama 3.1 8B (hidden=4096, 32q/8kv heads, head_dim=128), but uses a larger MLP intermediate size (22016 vs 14336).

Definition	Op Type	Status
`rmsnorm_h4096`	rmsnorm	✅
`fused_add_rmsnorm_h4096`	rmsnorm	✅
`gqa_paged_prefill_causal_h32_kv8_d128_ps1`	gqa_paged	✅
`gqa_paged_prefill_causal_h32_kv8_d128_ps64`	gqa_paged	✅
`gqa_paged_decode_h32_kv8_d128_ps1`	gqa_paged	✅
`gqa_paged_decode_h32_kv8_d128_ps64`	gqa_paged	✅
`gqa_ragged_prefill_causal_h32_kv8_d128`	gqa_ragged	✅
`gemm_n6144_k4096`	gemm	✅
`gemm_n4096_k4096`	gemm	✅
`gemm_n44032_k4096`	gemm	❌
`gemm_n4096_k22016`	gemm	❌
`top_k_sampling_from_probs_v151936`	sampling	✅
`top_k_top_p_sampling_from_probs_v151936`	sampling	✅
`top_p_sampling_from_probs_v151936`	sampling	✅

Coverage: 12 / 14 definitions present. Missing: gate_up GEMM (gemm_n44032_k4096, intermediate=22016 × 2) and down GEMM (gemm_n4096_k22016). All normalization, attention, and non-MLP GEMM kernels are shared with Llama 3.1 8B.

Qwen3 32B

Architecture: 64 decoder layers, GQA attention, dense MLP. hidden=5120, 64 query heads, 8 KV heads, head_dim=128, intermediate=25600. Standard serving configuration: TP=4.

Definition	Op Type	Status
`rmsnorm_h5120`	rmsnorm	✅
`fused_add_rmsnorm_h5120`	rmsnorm	✅
`gqa_paged_prefill_causal_h16_kv2_d128_ps1`	gqa_paged TP=4	✅
`gqa_paged_prefill_causal_h16_kv2_d128_ps64`	gqa_paged TP=4	✅
`gqa_paged_decode_h16_kv2_d128_ps1`	gqa_paged TP=4	✅
`gqa_paged_decode_h16_kv2_d128_ps64`	gqa_paged TP=4	✅
`gqa_ragged_prefill_causal_h16_kv2_d128`	gqa_ragged TP=4	✅
`gemm_n10240_k5120`	gemm (QKV)	❌
`gemm_n5120_k8192`	gemm (o_proj)	❌
`gemm_n51200_k5120`	gemm (gate_up)	❌
`gemm_n5120_k25600`	gemm (down)	❌
`top_k_sampling_from_probs_v151936`	sampling	✅
`top_k_top_p_sampling_from_probs_v151936`	sampling	✅
`top_p_sampling_from_probs_v151936`	sampling	✅

Coverage: 10 / 14 definitions present. RMSNorm shared with Qwen3 14B (same hidden=5120). GQA kernels shared with Llama 3.1/3.3 70B (same h=16, kv=2, d=128 at TP=4). Missing: all GEMM definitions.

Qwen3 235B A22B

Architecture: 94 decoder layers, GQA attention, sparse MoE FFN (128 experts, top-8 routing). Uses head_dim=128 (hidden=8192, 64 query heads). Standard serving configuration: TP=8, EP=2 (FP8 variant from sgl-cookbook). With 4 KV heads, effective per-device TP for attention is TP=4 (kv=1 per device).

Definition	Op Type	Status
`rmsnorm_h4096`	rmsnorm	✅
`fused_add_rmsnorm_h4096`	rmsnorm	✅
`gqa_paged_prefill_causal_h16_kv1_d128_ps1`	gqa_paged TP=4	❌
`gqa_paged_prefill_causal_h16_kv1_d128_ps64`	gqa_paged TP=4	✅
`gqa_paged_decode_h16_kv1_d128_ps1`	gqa_paged TP=4	❌
`gqa_paged_decode_h16_kv1_d128_ps64`	gqa_paged TP=4	❌
`gqa_ragged_prefill_causal_h16_kv1_d128`	gqa_ragged TP=4	❌
`gemm_n4608_k4096`	gemm	❌
`gemm_n4096_k4096`	gemm	✅
`moe_fp8_block_scale_renorm_topk8_e128_h4096_i1536`	moe EP=1	🟡
`moe_fp8_block_scale_ds_routing_topk8_ng?_kg?_e64_h4096_i1536`	moe EP=2	❌
`trtllm_fp4_block_scale_moe_topk8_e64_h4096_i1536`	moe (TRT-LLM FP4, EP=2)	🟡
`trtllm_fp4_block_scale_routed_moe_topk8_e64_h4096_i1536`	moe (TRT-LLM FP4 routed, EP=2)	🟡
`top_k_sampling_from_probs_v151936`	sampling	✅
`top_k_top_p_sampling_from_probs_v151936`	sampling	✅
`top_p_sampling_from_probs_v151936`	sampling	✅

Coverage: 8 / 14 referenced definitions present. MoE EP=1 + TRT-LLM FP4 EP=2 definitions added. Missing: all GQA defs (head_dim=64), QKV GEMM. The o_proj GEMM and rmsnorm are shared with other h=4096 models.

Kimi K2

Architecture: 61 decoder layers, MLA attention (same structure as DeepSeek V3), sparse MoE FFN (384 total experts, top-8 routing). Standard serving configuration: TP=8, EP=4 (from sgl-cookbook). Kimi K2 uses DeepSeek V3-style MLA with the same kv_lora_rank=512 and qk_rope_head_dim=64, but has 64 attention heads (vs 128 in DeepSeek V3). With TP=8 this gives h=8, requiring separate MLA definitions from DeepSeek V3’s h=16.

Definition	Op Type	Status
`rmsnorm_h7168`	rmsnorm	✅
`fused_add_rmsnorm_h7168`	rmsnorm	✅
`rmsnorm_h1536`	rmsnorm	✅
`rmsnorm_h512`	rmsnorm	✅
`mla_paged_prefill_causal_h8_ckv512_kpe64_ps1`	mla_paged TP=8	🟡
`mla_paged_prefill_causal_h8_ckv512_kpe64_ps64`	mla_paged TP=8	❌
`mla_paged_decode_h8_ckv512_kpe64_ps1`	mla_paged TP=8	🟡
`mla_paged_decode_h8_ckv512_kpe64_ps64`	mla_paged TP=8	❌
`mla_ragged_prefill_causal_h8_qk192_vo128`	mla_ragged	🟡
`moe_fp8_block_scale_ds_routing_topk8_ng1_kg1_e384_h7168_i2048`	moe EP=1	🟡
`moe_fp8_block_scale_ds_routing_topk8_ng?_kg?_e96_h7168_i2048`	moe EP=4	❌
`moe_fp8_block_scale_ds_routing_topk8_ng1_kg1_e48_h7168_i2048`	moe EP=8	🟡
`top_k_sampling_from_probs_v160000`	sampling	❌
`top_k_top_p_sampling_from_probs_v160000`	sampling	❌
`top_p_sampling_from_probs_v160000`	sampling	❌

Coverage: 6 / 15 definitions present. RMSNorm definitions are shared with DeepSeek V3 (same hidden=7168 and sub-module dims). MoE EP=1 and EP=8 definitions added. All MLA defs require new h=8 variants; MoE EP=4 variant (e=96) and sampling (v=160000) still missing.

Phi-4 14B

Architecture: 40 decoder layers, GQA attention (unusual 10 KV heads), dense MLP. All dimensions are new for this project.

Definition	Op Type	Status
`rmsnorm_h5120`	rmsnorm	🟡
`fused_add_rmsnorm_h5120`	rmsnorm	🟡
`gqa_paged_prefill_causal_h40_kv10_d128_ps1`	gqa_paged	✅
`gqa_paged_prefill_causal_h40_kv10_d128_ps64`	gqa_paged	❌
`gqa_paged_decode_h40_kv10_d128_ps1`	gqa_paged	❌
`gqa_paged_decode_h40_kv10_d128_ps64`	gqa_paged	❌
`gqa_ragged_prefill_causal_h40_kv10_d128`	gqa_ragged	❌
`gemm_n7680_k5120`	gemm	❌
`gemm_n5120_k5120`	gemm	🟡
`gemm_n35840_k5120`	gemm	❌
`gemm_n5120_k17920`	gemm	❌
`top_k_sampling_from_probs_v100352`	sampling	❌
`top_k_top_p_sampling_from_probs_v100352`	sampling	❌
`top_p_sampling_from_probs_v100352`	sampling	❌

Coverage: 4 / 14 definitions present. rmsnorm h5120 is now shared with Qwen3 14B; gemm_n5120_k5120 (o_proj shape) is shared since 40q*128=5120=hidden; gqa_paged_prefill_causal_h40_kv10_d128_ps1 has workloads collected (20/20 PASSED). Missing: remaining GQA defs (unusual 10 KV-head config), most GEMM, sampling v100352.

Llama 3.1 405B

Architecture: 126 decoder layers, GQA attention, dense MLP. Standard serving configuration: TP=4 (from sgl-cookbook). Uses the same Llama architecture as Llama 3.1 8B / 3.3 70B but at significantly larger scale (hidden=16384).

Definition	Op Type	Status
`rmsnorm_h16384`	rmsnorm	❌
`fused_add_rmsnorm_h16384`	rmsnorm	❌
`gqa_paged_prefill_causal_h32_kv2_d128_ps1`	gqa_paged TP=4	❌
`gqa_paged_prefill_causal_h32_kv2_d128_ps64`	gqa_paged TP=4	❌
`gqa_paged_decode_h32_kv2_d128_ps1`	gqa_paged TP=4	❌
`gqa_paged_decode_h32_kv2_d128_ps64`	gqa_paged TP=4	❌
`gqa_ragged_prefill_causal_h32_kv2_d128`	gqa_ragged TP=4	❌
`gemm_n18432_k16384`	gemm	❌
`gemm_n16384_k16384`	gemm	❌
`gemm_n106496_k16384`	gemm	❌
`gemm_n16384_k53248`	gemm	❌
`top_k_sampling_from_probs_v128256`	sampling	✅
`top_k_top_p_sampling_from_probs_v128256`	sampling	✅
`top_p_sampling_from_probs_v128256`	sampling	✅

Coverage: 3 / 14 definitions present. Sampling definitions are shared with Llama 3.1 8B (same vocab). Missing: rmsnorm h16384 and all GQA/GEMM definitions for this scale (TP=4 gives h=128/4=32 q-heads, kv=8/4=2 — the h32_kv2 configuration does not exist in current definitions).

Llama 4 Scout 17B-16E

Architecture: 48 decoder layers, interleaved GQA attention (NoPE global + RoPE local in 1:3 ratio), sparse MoE FFN (16 total experts, top-1 routing). Standard serving configuration: TP=8 (from sgl-cookbook). Multimodal (vision+text).

Note: Exact config.json values (hidden_size, intermediate_size) are pending verification from HuggingFace. Parameters below are estimates from the public model spec (17B activated parameters, 16 experts).

Definition	Op Type	Status
`rmsnorm_h5120`	rmsnorm	🟡
`fused_add_rmsnorm_h5120`	rmsnorm	🟡
`gqa_paged_prefill_causal_h5_kv1_d128_ps1`	gqa_paged TP=8	✅
`gqa_paged_prefill_causal_h5_kv1_d128_ps64`	gqa_paged TP=8	🟡
`gqa_paged_decode_h5_kv1_d128_ps1`	gqa_paged TP=8	✅
`gqa_paged_decode_h5_kv1_d128_ps64`	gqa_paged TP=8	🟡
`gqa_ragged_prefill_causal_h5_kv1_d128`	gqa_ragged TP=8	🟡
MoE experts (top-1, 16 experts, standard routing)	moe	—
`trtllm_fp4_block_scale_moe_topk1_e16_h5120_i8192`	moe (TRT-LLM FP4, Llama4 routing)	🟡
`trtllm_fp4_block_scale_routed_moe_topk1_e16_h5120_i8192`	moe (TRT-LLM FP4 routed, Llama4 routing)	🟡
`trtllm_fp8_per_tensor_scale_moe_topk1_e16_h5120_i8192`	moe (TRT-LLM FP8)	🟡
`top_k_sampling_from_probs_v202048`	sampling	🟡
`top_k_top_p_sampling_from_probs_v202048`	sampling	🟡
`top_p_sampling_from_probs_v202048`	sampling	🟡

Coverage: 8 / 13 definitions present. rmsnorm h5120 shared with Qwen3 14B. TRT-LLM FP4 + FP8 MoE kernels added (top-1, 16 experts, Llama4 routing). Missing: ps64 GQA variants (no definition files), sampling v202048.

Llama 4 Maverick 17B-128E

Architecture: Same base architecture as Llama 4 Scout but with 128 total experts (vs 16). Standard serving configuration: TP=8 (from sgl-cookbook). hidden_size=5120, 40 q-heads, 8 kv-heads, head_dim=128, intermediate_size=8192.

Definition	Op Type	Status
`rmsnorm_h5120`	rmsnorm	🟡
`fused_add_rmsnorm_h5120`	rmsnorm	🟡
`gqa_paged_prefill_causal_h5_kv1_d128_ps1`	gqa_paged TP=8	✅
`gqa_paged_prefill_causal_h5_kv1_d128_ps64`	gqa_paged TP=8	🟡
`gqa_paged_decode_h5_kv1_d128_ps1`	gqa_paged TP=8	✅
`gqa_paged_decode_h5_kv1_d128_ps64`	gqa_paged TP=8	🟡
`gqa_ragged_prefill_causal_h5_kv1_d128`	gqa_ragged TP=8	🟡
MoE experts (top-1, 128 experts, standard routing)	moe	—
`trtllm_fp4_block_scale_moe_topk1_e128_h5120_i8192`	moe (TRT-LLM FP4, Llama4 routing)	🟡
`trtllm_fp4_block_scale_routed_moe_topk1_e128_h5120_i8192`	moe (TRT-LLM FP4 routed, Llama4 routing)	🟡
`trtllm_fp8_per_tensor_scale_moe_topk1_e128_h5120_i8192`	moe (TRT-LLM FP8)	🟡
`top_k_sampling_from_probs_v202048`	sampling	🟡
`top_k_top_p_sampling_from_probs_v202048`	sampling	🟡
`top_p_sampling_from_probs_v202048`	sampling	🟡

Coverage: 10 / 13 definitions present. rmsnorm h5120 shared with Qwen3 14B. TRT-LLM FP4 + FP8 MoE kernels added (top-1, 128 experts, Llama4 routing). Same base dimensions as Llama 4 Scout; MoE expert count differs. ps1 GQA workloads collected; ps64 definitions present but workloads pending.

Mistral Small 3.1 24B

Architecture: 40 decoder layers, GQA attention (explicit head_dim=128), dense MLP. Standard serving configuration: TP=2 (from sgl-cookbook). Shares the same attention configuration as Mistral Nemo 12B (hidden=5120 with explicit head_dim=128 giving 32 effective query heads).

Definition	Op Type	Status
`rmsnorm_h5120`	rmsnorm	🟡
`fused_add_rmsnorm_h5120`	rmsnorm	🟡
`gqa_paged_prefill_causal_h32_kv8_d128_ps1`	gqa_paged	✅
`gqa_paged_prefill_causal_h32_kv8_d128_ps64`	gqa_paged	✅
`gqa_paged_decode_h32_kv8_d128_ps1`	gqa_paged	✅
`gqa_paged_decode_h32_kv8_d128_ps64`	gqa_paged	✅
`gqa_ragged_prefill_causal_h32_kv8_d128`	gqa_ragged	✅
`gemm_n6144_k5120`	gemm	❌
`gemm_n5120_k4096`	gemm	❌
`gemm_n28672_k5120`	gemm	❌
`gemm_n5120_k14336`	gemm	❌
`top_k_sampling_from_probs_v131072`	sampling	❌
`top_k_top_p_sampling_from_probs_v131072`	sampling	❌
`top_p_sampling_from_probs_v131072`	sampling	❌

Coverage: 7 / 14 definitions present. GQA kernels are shared with Mistral Nemo 12B and Llama 3.1 8B; rmsnorm h5120 is now shared with Qwen3 14B. Missing: GEMM defs with k=5120 input dim (Mistral-specific intermediate sizes), sampling v131072.

GLM-4.6

Architecture: Dense transformer with Dual Chunk Attention (DCA) — a variant of full attention with rotary embeddings. Served on Together AI and Fireworks; sgl-cookbook shows TP=8, EP=8 (high-throughput configuration), suggesting a very large MoE variant.

Note: Exact architecture parameters for GLM-4.6 require verification from the HuggingFace config.json (zai-org/GLM-4.6). The params below are based on the SGLang glm4.py defaults and may not reflect the actual model dimensions.

Definition	Op Type	Status
`rmsnorm_h4096`	rmsnorm	❌ (if hidden=4096)
`fused_add_rmsnorm_h4096`	rmsnorm	❌
GQA or custom DCA attention	attention	—
MoE FFN (if applicable)	moe	—
Sampling (vocab TBD)	sampling	—

Coverage: 0 / ? definitions present. Architecture requires research. DCA attention may use standard GQA kernels at the computation level (FlashInfer’s paged/ragged wrappers) or require custom handling. Run /track-models --model-name glm46 --hf-repo-id zai-org/GLM-4.6 to fetch the exact config and update this section.

MiniMax-Text-01

Architecture: Hybrid linear + softmax attention with MoE FFN. Uses a 7:1 ratio of Lightning Attention (linear) to standard Softmax Attention layers per 8-layer block, plus sparse MoE (32 experts, top-2 routing). Total parameters: ~456B with ~45.9B activated. 80 decoder layers, 64 attention heads, head_dim=128, hidden_size=6144. Lightning Attention is a novel linear attention variant that does not use the standard softmax attention mechanism. It is not currently supported by FlashInfer and requires a new op type.

Definition	Op Type	Status
`rmsnorm_h6144`	rmsnorm	❌
`fused_add_rmsnorm_h6144`	rmsnorm	❌
Lightning Attention layers (7/8 of all layers)	lightning_attn	❌ (op type not supported)
Softmax Attention layers (1/8 of all layers)	gqa_paged	❌
MoE experts (top-2, 32 experts)	moe	—
Sampling (vocab TBD)	sampling	—

Coverage: 0 / ? definitions present. The primary blocker is Lightning Attention — a linear attention variant not yet in FlashInfer. The softmax attention layers (GQA-style) also require new definitions for this model’s specific dimensions. To add support, a new lightning_attn op type would first need to be defined.

MiniMax M2

Architecture: 62 decoder layers, GQA attention (6:1 ratio, 48 q-heads / 8 kv-heads, head_dim=128, hidden_size=3072), MoE FFN with sigmoid routing (256 experts, top-8, FP8 block-scale quantization). Total parameters: ~230B with ~10B activated. Note: MiniMax M2 is a separate model from MiniMax-Text-01 (which uses Lightning Attention). M2 uses standard GQA attention.

Definition	Op Type	Status
`rmsnorm_h3072`	rmsnorm	🟡
`fused_add_rmsnorm_h3072`	rmsnorm	🟡
`rope_with_cos_sin_cache_neox_style_d128_rd64`	rope	🟡
`gqa_paged_prefill_causal_h6_kv1_d128_ps1`	gqa_paged	🟡
`gqa_paged_prefill_causal_h6_kv1_d128_ps64`	gqa_paged	🟡
`gqa_paged_decode_h6_kv1_d128_ps1`	gqa_paged	🟡
`gqa_paged_decode_h6_kv1_d128_ps64`	gqa_paged	🟡
`gqa_ragged_prefill_causal_h6_kv1_d128`	gqa_ragged	🟡
`gemm_n8192_k3072`	gemm (fused qkv_proj)	🟡
`gemm_n3072_k6144`	gemm (o_proj)	🟡
`gemm_n256_k3072`	gemm (MoE gate)	🟡
MoE gate / topk / experts	moe	—
`top_k_sampling_from_probs_v200064`	sampling	🟡
`top_k_top_p_sampling_from_probs_v200064`	sampling	🟡
`top_p_sampling_from_probs_v200064`	sampling	🟡

Coverage: 14 / 15 definitions present. Workloads not yet collected.

Gemma 3 27B

Architecture: 62 decoder layers, GQA attention (2:1 ratio, 32 q-heads / 16 kv-heads, explicit head_dim=128 decoupled from hidden_size=5376), dense MLP with GeGLU activation. Note: hidden_size=5376 is non-standard; head_dim is explicitly 128 (not 5376/32=168). This is a multimodal model (vision+text) but the language backbone uses standard transformer attention.

Definition	Op Type	Status
`rmsnorm_h5376`	rmsnorm	🟡
`fused_add_rmsnorm_h5376`	rmsnorm	🟡
`gqa_paged_prefill_causal_h32_kv16_d128_ps1`	gqa_paged	🟡
`gqa_paged_prefill_causal_h32_kv16_d128_ps64`	gqa_paged	🟡
`gqa_paged_decode_h32_kv16_d128_ps1`	gqa_paged	🟡
`gqa_paged_decode_h32_kv16_d128_ps64`	gqa_paged	🟡
`gqa_ragged_prefill_causal_h32_kv16_d128`	gqa_ragged	✅
`gemm_n4096_k5376`	gemm (q_proj)	🟡
`gemm_n2048_k5376`	gemm (k/v proj)	🟡
`gemm_n5376_k4096`	gemm (o_proj)	🟡
`gemm_n21504_k5376`	gemm (gate/up proj)	🟡
`gemm_n5376_k21504`	gemm (down proj)	🟡
`top_k_sampling_from_probs_v262208`	sampling	🟡
`top_k_top_p_sampling_from_probs_v262208`	sampling	🟡
`top_p_sampling_from_probs_v262208`	sampling	🟡

Coverage: 15 / 15 definitions present. All dimensions are unique to this model: hidden=5376, intermediate=21504, vocab=262208. GQA ratio is 2:1 (vs 4:1 for Llama/Qwen), so kv_heads=16 (not 8). Workloads not yet collected.

Qwen3 14B

Architecture: 40 decoder layers, GQA attention (5:1 ratio, 40 q-heads / 8 kv-heads, head_dim=128), dense MLP. Standard serving configuration: TP=2 (from sgl-cookbook), giving 20 q-heads and 4 kv-heads per device.

Definition	Op Type	Status
`rmsnorm_h5120`	rmsnorm	🟡
`fused_add_rmsnorm_h5120`	rmsnorm	🟡
`gqa_paged_prefill_causal_h20_kv4_d128_ps1`	gqa_paged TP=2	🟡
`gqa_paged_prefill_causal_h20_kv4_d128_ps64`	gqa_paged TP=2	🟡
`gqa_paged_decode_h20_kv4_d128_ps1`	gqa_paged TP=2	🟡
`gqa_paged_decode_h20_kv4_d128_ps64`	gqa_paged TP=2	🟡
`gqa_ragged_prefill_causal_h20_kv4_d128`	gqa_ragged TP=2	✅
`gemm_n7168_k5120`	gemm (qkv_proj combined)	🟡
`gemm_n5120_k5120`	gemm (o_proj)	🟡
`gemm_n34816_k5120`	gemm (gate_up combined)	🟡
`gemm_n5120_k17408`	gemm (down proj)	🟡
`top_k_sampling_from_probs_v151936`	sampling	✅
`top_k_top_p_sampling_from_probs_v151936`	sampling	✅
`top_p_sampling_from_probs_v151936`	sampling	✅

Coverage: 14 / 14 definitions present. The rmsnorm_h5120 definition is also shared with Mistral Nemo 12B, Mistral Small 3.1 24B, Phi-4 14B, and Llama 4 Scout/Maverick. Non-sampling workloads not yet collected.

NemotronH 47B

Architecture: 52 decoder layers total — hybrid of standard GQA (Transformer) and Mamba2 SSM layers. Uses 20 GQA attention layers and 32 Mamba2 layers in an interleaved pattern. Standard serving configuration: TP=8 (from sgl-cookbook). Mamba2 SSM (Structured State Space Model) is a linear recurrent architecture that does not use softmax attention. It maintains a fixed-size state matrix updated at each step, analogous to a hidden state in RNNs. Mamba2 is not currently supported as an op type in FlashInfer-Bench and requires defining a new mamba_ssu (Selective State-space Unit) operation type before this model can be tracked.

Definition	Op Type	Status
`rmsnorm_h{hidden}`	rmsnorm	❌ (dims TBD)
GQA attention layers (20 layers, TP=8)	gqa_paged	❌
Mamba2 SSM layers (32 layers)	mamba_ssu	❌ (op type not supported)
MLP / MoE FFN	gemm / moe	❌
Sampling	sampling	❌

Coverage: 0 / ? definitions present. The primary blocker is the Mamba2 SSM op type — a selective state-space operation not yet defined in FlashInfer-Bench. This is analogous to MiniMax-Text-01’s Lightning Attention blocker. To add support, a new mamba_ssu op type schema would first need to be defined. Once that exists, the GQA attention layers could reuse existing definitions if dimensions match.

​Model Kernel Coverage

​Summary

​DeepSeek V3 / R1

​DeepSeek V3.2

​Llama 3.1 8B

​Qwen3 30B A3B

​Qwen3 Next 80B A3B

​Llama 3.1 / 3.3 70B

​Llama 3.2 3B

​Mistral 7B v0.3

​Mistral Nemo 12B

​Mixtral 8x7B

​Mixtral 8x22B

​Mixtral 8x22B at TP=2

​Qwen2.5 7B

​Qwen2.5 72B

​Qwen3 8B

​Qwen3 32B

​Qwen3 235B A22B

​Kimi K2

​Phi-4 14B

​Llama 3.1 405B

​Llama 4 Scout 17B-16E

​Llama 4 Maverick 17B-128E

​Mistral Small 3.1 24B

​GLM-4.6

​MiniMax-Text-01

​MiniMax M2

​Gemma 3 27B

​Qwen3 14B

​NemotronH 47B