Skip to main content

Model Kernel Coverage

This document tracks which kernels are supported in FlashInfer-Bench for each model.
  • βœ… Definition JSON exists and workload has been collected
  • 🟑 Definition JSON exists but workload has not yet been collected
  • ❌ Definition is referenced in models.ts but the file does not exist (missing)
  • β€” Module exists in the architecture but no definition is mapped (unmapped)

Summary

ModelArchitectureCoverage
DeepSeek V3/R1MLA + Dense/MoE🟑 Partial
DeepSeek V3.2DSA + Dense/MoEβœ… Fully covered
Llama 3.1 8BGQA + Denseβœ… Fully covered
Llama 3.1/3.3 70BGQA + Dense🟑 Partial
Llama 3.2 3BGQA + Dense🟑 Partial
Mistral 7B v0.3GQA + Dense🟑 Partial
Mistral Nemo 12BGQA + Dense🟑 Partial
Mixtral 8x7BGQA + MoE🟑 Partial
Mixtral 8x22BGQA + MoE🟑 Partial
Qwen2.5 7BGQA + Dense🟑 Partial
Qwen2.5 72BGQA + Dense🟑 Partial
Qwen3 8BGQA + Dense🟑 Partial
Qwen3 30B A3BGQA + MoE🟑 Partial
Qwen3 32BGQA + Dense🟑 Partial
Qwen3 235B A22BGQA + MoE🟑 Partial
Qwen3 Next 80B A3BGDN + GQA + MoE🟑 Partial
Kimi K2MLA + MoE🟑 Partial
Phi-4 14BGQA + Dense🟑 Partial
Llama 3.1 405BGQA + Dense🟑 Partial
Llama 4 Scout 17B-16EGQA + MoE🟑 Partial
Llama 4 Maverick 17B-128EGQA + MoE🟑 Partial
Mistral Small 3.1 24BGQA + Dense🟑 Partial
GLM-4.6GQA + Dense❌ Not covered
MiniMax-Text-01Lightning Attn + MoE❌ Not covered
MiniMax M2GQA + MoE🟑 Partial
Gemma 3 27BGQA + Dense🟑 Partial
Qwen3 14BGQA + Dense🟑 Partial
NemotronH 47BGQA + Mamba2 Hybrid❌ Not covered

DeepSeek V3 / R1

Architecture: 61 decoder layers, MLA attention, hybrid Dense+MoE FFN
DefinitionOp TypeStatus
rmsnorm_h7168rmsnormβœ…
fused_add_rmsnorm_h7168rmsnormβœ…
rmsnorm_h1536rmsnormβœ…
rmsnorm_h512rmsnormβœ…
gemm_n256_k7168gemmβœ…
mla_ragged_prefill_causal_h16_qk192_vo128mla_raggedβœ…
mla_paged_prefill_causal_h16_ckv512_kpe64_ps1mla_pagedβœ…
mla_paged_prefill_causal_h16_ckv512_kpe64_ps64mla_pagedβœ…
mla_paged_decode_h16_ckv512_kpe64_ps1mla_pagedβœ…
mla_paged_decode_h16_ckv512_kpe64_ps64mla_pagedβœ…
moe_fp8_block_scale_ds_routing_topk8_ng8_kg4_e32_h7168_i2048moeβœ…
top_k_sampling_from_probs_v129280samplingβœ…
top_k_top_p_sampling_from_probs_v129280samplingβœ…
top_p_sampling_from_probs_v129280samplingβœ…
Coverage: 13 / 14 definitions present. Missing: MLA ragged prefill definition.

DeepSeek V3.2

Architecture: 61 decoder layers, DSA (DeepSeek Sparse Attention) replacing dense MLA, hybrid Dense+MoE FFN. Standard serving configuration: TP=8. DSA introduces a learned TopK indexer that selects a sparse subset of KV pages before running attention, reducing computation for long contexts while preserving accuracy.
DefinitionOp TypeStatus
rmsnorm_h7168rmsnormβœ…
fused_add_rmsnorm_h7168rmsnormβœ…
rmsnorm_h1536rmsnormβœ…
rmsnorm_h512rmsnormβœ…
dsa_topk_indexer_fp8_h64_d128_topk2048_ps64dsa_pagedβœ…
dsa_sparse_attention_h16_ckv512_kpe64_topk2048_ps1dsa_pagedβœ…
dsa_sparse_attention_h16_ckv512_kpe64_topk2048_ps64dsa_pagedβœ…
moe_fp8_block_scale_ds_routing_topk8_ng8_kg4_e32_h7168_i2048moeβœ…
Coverage: 8 / 8 definitions present. Fully covered.

Llama 3.1 8B

Architecture: 32 decoder layers, GQA attention, dense MLP
DefinitionOp TypeStatus
rmsnorm_h4096rmsnormβœ…
fused_add_rmsnorm_h4096rmsnormβœ…
gemm_n6144_k4096gemmβœ…
gemm_n4096_k4096gemmβœ…
gemm_n28672_k4096gemmβœ…
gemm_n4096_k14336gemmβœ…
gqa_paged_prefill_causal_h32_kv8_d128_ps1gqa_pagedβœ…
gqa_paged_prefill_causal_h32_kv8_d128_ps64gqa_pagedβœ…
gqa_paged_decode_h32_kv8_d128_ps1gqa_pagedβœ…
gqa_paged_decode_h32_kv8_d128_ps64gqa_pagedβœ…
gqa_ragged_prefill_causal_h32_kv8_d128gqa_raggedβœ…
top_k_sampling_from_probs_v128256samplingβœ…
top_k_top_p_sampling_from_probs_v128256samplingβœ…
top_p_sampling_from_probs_v128256samplingβœ…
Coverage: 14 / 14 definitions present. Fully covered.

Qwen3 30B A3B

Architecture: 32 decoder layers, GQA attention, MoE FFN (30 MoE + 2 dense layers)
DefinitionOp TypeStatus
rmsnorm_h128rmsnormβœ…
rmsnorm_h2048rmsnormβœ…
fused_add_rmsnorm_h2048rmsnormβœ…
gemm_n128_k2048gemmβœ…
gemm_n2048_k4096gemmβœ…
gemm_n5120_k2048gemmβœ…
gqa_paged_prefill_causal_h32_kv4_d128_ps1gqa_pagedβœ…
gqa_paged_prefill_causal_h32_kv4_d128_ps64gqa_pagedβœ…
gqa_paged_decode_h32_kv4_d128_ps1gqa_pagedβœ…
gqa_paged_decode_h32_kv4_d128_ps64gqa_pagedβœ…
gqa_ragged_prefill_causal_h32_kv4_d128gqa_raggedβœ…
top_k_sampling_from_probs_v151936samplingβœ…
top_k_top_p_sampling_from_probs_v151936samplingβœ…
top_p_sampling_from_probs_v151936samplingβœ…
MoE gate / topk / expertsmoeβ€”
moe_fp8_block_scale_renorm_topk8_e128_h2048_i768moe EP=1🟑
trtllm_fp4_block_scale_moe_topk8_e128_h2048_i768moe (TRT-LLM FP4)🟑
trtllm_fp4_block_scale_routed_moe_topk8_e128_h2048_i768moe (TRT-LLM FP4 routed)🟑
trtllm_fp8_per_tensor_scale_moe_topk8_e128_h2048_i768moe (TRT-LLM FP8)🟑
Coverage: 14 / 14 referenced definitions present. MoE kernels added (not yet mapped in models.ts).

Qwen3 Next 80B A3B

Architecture: 48 layers total β€” 36 GDN (linear attention) + 12 GQA (standard attention), all layers use MoE FFN. Standard serving configuration: TP=2 or TP=4.
DefinitionOp TypeStatus
rmsnorm_h2048rmsnormβœ…
fused_add_rmsnorm_h2048rmsnormβœ…
gdn_prefill_qk16_v32_d128_k_lastgdn TP=1🟑
gdn_prefill_qk8_v16_d128_k_lastgdn TP=2βœ…
gdn_prefill_qk4_v8_d128_k_lastgdn TP=4βœ…
gdn_decode_qk16_v32_d128_k_lastgdn TP=1🟑
gdn_decode_qk8_v16_d128_k_lastgdn TP=2βœ…
gdn_decode_qk4_v8_d128_k_lastgdn TP=4βœ…
gdn_mtp_qk16_v32_d128_k_lastgdn TP=1🟑
gdn_mtp_qk8_v16_d128_k_lastgdn TP=2βœ…
gdn_mtp_qk4_v8_d128_k_lastgdn TP=4βœ…
gqa_paged_prefill_causal_h8_kv1_d256_ps1gqa_paged TP=2❌
gqa_paged_decode_h8_kv1_d256_ps1gqa_paged TP=2❌
gqa_ragged_prefill_causal_h8_kv1_d256gqa_ragged TP=2βœ…
MoE gate / topk / experts (GDN layers)moeβ€”
MoE gate / topk / experts (GQA layers)moeβ€”
moe_fp8_block_scale_renorm_topk10_e128_h2048_i512moe EP=1🟑
trtllm_fp4_block_scale_moe_topk10_e128_h2048_i512moe (TRT-LLM FP4, EP=4)🟑
trtllm_fp4_block_scale_routed_moe_topk10_e128_h2048_i512moe (TRT-LLM FP4 routed, EP=4)🟑
Coverage: 10 / 14 referenced definitions present. MoE definition added (shared across GDN and GQA layers). Missing GDN definitions: TP=1 prefill and decode (qk16_v32). Missing GQA: h=8, kv=1, d=256 (TP=2 of original h=16, kv=2, d=256).

Llama 3.1 / 3.3 70B

Architecture: 80 decoder layers, GQA attention, dense MLP. Standard serving configuration: TP=4 (from sgl-cookbook). Llama 3.1 70B and 3.3 70B share identical architecture dimensions; only training data and context window differ.
DefinitionOp TypeStatus
rmsnorm_h8192rmsnorm❌
fused_add_rmsnorm_h8192rmsnorm❌
gqa_paged_prefill_causal_h16_kv2_d128_ps1gqa_paged TP=4βœ…
gqa_paged_prefill_causal_h16_kv2_d128_ps64gqa_paged TP=4βœ…
gqa_paged_decode_h16_kv2_d128_ps1gqa_paged TP=4βœ…
gqa_paged_decode_h16_kv2_d128_ps64gqa_paged TP=4βœ…
gqa_ragged_prefill_causal_h16_kv2_d128gqa_ragged TP=4βœ…
gemm_n10240_k8192gemm🟑
gemm_n8192_k8192gemm🟑
gemm_n57344_k8192gemm🟑
gemm_n8192_k28672gemm🟑
top_k_sampling_from_probs_v128256samplingβœ…
top_k_top_p_sampling_from_probs_v128256samplingβœ…
top_p_sampling_from_probs_v128256samplingβœ…
Coverage: 14 / 14 definitions present. GQA kernels shared with Qwen3-32B (same h=16, kv=2, d=128 at TP=4).

Llama 3.2 3B

Architecture: 28 decoder layers, GQA attention, dense MLP.
DefinitionOp TypeStatus
rmsnorm_h3072rmsnorm🟑
fused_add_rmsnorm_h3072rmsnorm🟑
gqa_paged_prefill_causal_h24_kv8_d128_ps1gqa_pagedβœ…
gqa_paged_prefill_causal_h24_kv8_d128_ps64gqa_pagedβœ…
gqa_paged_decode_h24_kv8_d128_ps1gqa_pagedβœ…
gqa_paged_decode_h24_kv8_d128_ps64gqa_pagedβœ…
gqa_ragged_prefill_causal_h24_kv8_d128gqa_raggedβœ…
gemm_n5120_k3072gemm🟑
gemm_n3072_k3072gemm🟑
gemm_n16384_k3072gemm🟑
gemm_n3072_k8192gemm🟑
top_k_sampling_from_probs_v128256samplingβœ…
top_k_top_p_sampling_from_probs_v128256samplingβœ…
top_p_sampling_from_probs_v128256samplingβœ…
Coverage: 6 / 14 definitions present. GQA ragged prefill kernel added. Missing: rmsnorm h3072, paged GQA variants, and GEMM kernels for hidden=3072.

Mistral 7B v0.3

Architecture: 32 decoder layers, GQA attention, dense MLP. Shares identical hidden, attention, and MLP dimensions with Llama 3.1 8B (hidden=4096, 32q/8kv heads, intermediate=14336).
DefinitionOp TypeStatus
rmsnorm_h4096rmsnormβœ…
fused_add_rmsnorm_h4096rmsnormβœ…
gqa_paged_prefill_causal_h32_kv8_d128_ps1gqa_pagedβœ…
gqa_paged_prefill_causal_h32_kv8_d128_ps64gqa_pagedβœ…
gqa_paged_decode_h32_kv8_d128_ps1gqa_pagedβœ…
gqa_paged_decode_h32_kv8_d128_ps64gqa_pagedβœ…
gqa_ragged_prefill_causal_h32_kv8_d128gqa_raggedβœ…
gemm_n6144_k4096gemmβœ…
gemm_n4096_k4096gemmβœ…
gemm_n28672_k4096gemmβœ…
gemm_n4096_k14336gemmβœ…
top_k_sampling_from_probs_v32000sampling❌
top_k_top_p_sampling_from_probs_v32000sampling❌
top_p_sampling_from_probs_v32000sampling❌
Coverage: 11 / 14 definitions present. Missing: sampling definitions for vocab_size=32000.

Mistral Nemo 12B

Architecture: 40 decoder layers, GQA attention (explicit head_dim=128), dense MLP. Standard serving configuration: TP=1 (from sgl-cookbook).
DefinitionOp TypeStatus
rmsnorm_h5120rmsnorm🟑
fused_add_rmsnorm_h5120rmsnorm🟑
gqa_paged_prefill_causal_h32_kv8_d128_ps1gqa_pagedβœ…
gqa_paged_prefill_causal_h32_kv8_d128_ps64gqa_pagedβœ…
gqa_paged_decode_h32_kv8_d128_ps1gqa_pagedβœ…
gqa_paged_decode_h32_kv8_d128_ps64gqa_pagedβœ…
gqa_ragged_prefill_causal_h32_kv8_d128gqa_raggedβœ…
gemm_n6144_k5120gemm❌
gemm_n5120_k4096gemm❌
gemm_n28872_k5120gemm❌
gemm_n5120_k14436gemm❌
top_k_sampling_from_probs_v131072sampling❌
top_k_top_p_sampling_from_probs_v131072sampling❌
top_p_sampling_from_probs_v131072sampling❌
Coverage: 7 / 14 definitions present. GQA defs are shared with Llama 3.1 8B; rmsnorm h5120 is now shared with Qwen3 14B. Missing: all GEMM (different hidden=5120 input dim), sampling v131072.

Mixtral 8x7B

Architecture: 32 decoder layers, GQA attention, sparse MoE FFN (8 experts, top-2 routing). Shares attention and normalization dimensions with Llama 3.1 8B / Mistral 7B.
DefinitionOp TypeStatus
rmsnorm_h4096rmsnormβœ…
fused_add_rmsnorm_h4096rmsnormβœ…
gqa_paged_prefill_causal_h32_kv8_d128_ps1gqa_pagedβœ…
gqa_paged_prefill_causal_h32_kv8_d128_ps64gqa_pagedβœ…
gqa_paged_decode_h32_kv8_d128_ps1gqa_pagedβœ…
gqa_paged_decode_h32_kv8_d128_ps64gqa_pagedβœ…
gqa_ragged_prefill_causal_h32_kv8_d128gqa_raggedβœ…
gemm_n6144_k4096gemmβœ…
gemm_n4096_k4096gemmβœ…
MoE experts (top-2, 8 experts, inter=14336)moeβ€”
top_k_sampling_from_probs_v32000sampling❌
top_k_top_p_sampling_from_probs_v32000sampling❌
top_p_sampling_from_probs_v32000sampling❌
Coverage: 9 / 12 referenced definitions present. MoE uses standard top-2 routing (not DeepSeek FP8 block-scale), so the existing MoE definition does not apply (unmapped). Missing: sampling v32000.

Mixtral 8x22B

Architecture: 56 decoder layers, GQA attention, sparse MoE FFN (8 experts, top-2 routing). All dimensions are new (hidden=6144, 48q/8kv heads). Standard serving configuration: TP=2 (from sgl-cookbook), giving 24q/4kv heads per GPU.
DefinitionOp TypeStatus
rmsnorm_h6144rmsnorm❌
fused_add_rmsnorm_h6144rmsnorm❌
gqa_paged_prefill_causal_h24_kv4_d128_ps64gqa_pagedβœ…
gqa_paged_prefill_causal_h48_kv8_d128_ps1gqa_paged❌
gqa_paged_prefill_causal_h48_kv8_d128_ps64gqa_paged❌
gqa_paged_decode_h48_kv8_d128_ps1gqa_pagedβœ…
gqa_paged_decode_h48_kv8_d128_ps64gqa_paged❌
gqa_ragged_prefill_causal_h48_kv8_d128gqa_ragged❌
gemm_n8192_k6144gemm❌
gemm_n6144_k6144gemm❌
MoE experts (top-2, 8 experts, inter=16384)moeβ€”
top_k_sampling_from_probs_v32768sampling❌
top_k_top_p_sampling_from_probs_v32768sampling❌
top_p_sampling_from_probs_v32768sampling❌
Coverage: 3 / 13 referenced definitions present. TP=2 prefill + decode definitions added. Missing: rmsnorm, remaining GQA variants, GEMM, and sampling definitions.

Mixtral 8x22B at TP=2

At tensor parallelism TP=2, attention head counts are halved (48β†’24 q-heads, 8β†’4 kv-heads).
DefinitionOp TypeStatus
gqa_paged_prefill_causal_h24_kv4_d128_ps64gqa_pagedβœ…
gqa_paged_prefill_causal_h24_kv4_d128_ps1gqa_pagedβœ…
gqa_paged_decode_h24_kv4_d128_ps64gqa_pagedβœ…
Coverage: 3 / 3 TP=2 attention definitions present (prefill ps1 + ps64, decode ps64).

Qwen2.5 7B

Architecture: 28 decoder layers, GQA attention, dense MLP.
DefinitionOp TypeStatus
rmsnorm_h3584rmsnorm❌
fused_add_rmsnorm_h3584rmsnorm❌
gqa_paged_prefill_causal_h28_kv4_d128_ps1gqa_paged❌
gqa_paged_prefill_causal_h28_kv4_d128_ps64gqa_paged❌
gqa_paged_decode_h28_kv4_d128_ps1gqa_paged❌
gqa_paged_decode_h28_kv4_d128_ps64gqa_paged❌
gqa_ragged_prefill_causal_h28_kv4_d128gqa_ragged❌
gemm_n4608_k3584gemm🟑
gemm_n3584_k3584gemm🟑
gemm_n37888_k3584gemm🟑
gemm_n3584_k18944gemm🟑
top_k_sampling_from_probs_v152064sampling🟑
top_k_top_p_sampling_from_probs_v152064sampling🟑
top_p_sampling_from_probs_v152064sampling🟑
Coverage: 9 / 14 definitions present. Missing: all rmsnorm, GQA, and GEMM definitions for hidden=3584.

Qwen2.5 72B

Architecture: 80 decoder layers, GQA attention, dense MLP. Standard serving configuration: TP=8 (from sgl-cookbook).
DefinitionOp TypeStatus
rmsnorm_h8192rmsnorm❌
fused_add_rmsnorm_h8192rmsnorm❌
gqa_paged_prefill_causal_h8_kv1_d128_ps1gqa_paged TP=8❌
gqa_paged_prefill_causal_h8_kv1_d128_ps64gqa_paged TP=8❌
gqa_paged_decode_h8_kv1_d128_ps1gqa_paged TP=8❌
gqa_paged_decode_h8_kv1_d128_ps64gqa_paged TP=8❌
gqa_ragged_prefill_causal_h8_kv1_d128gqa_ragged TP=8❌
gemm_n10240_k8192gemm❌
gemm_n8192_k8192gemm❌
gemm_n59392_k8192gemm❌
gemm_n8192_k29696gemm❌
top_k_sampling_from_probs_v152064sampling🟑
top_k_top_p_sampling_from_probs_v152064sampling🟑
top_p_sampling_from_probs_v152064sampling🟑
Coverage: 3 / 14 definitions present. Missing: rmsnorm h8192, all GQA definitions (h8_kv1_d128 at TP=8), all GEMM definitions for hidden=8192.

Qwen3 8B

Architecture: 36 decoder layers, GQA attention, dense MLP. Shares hidden size and attention dimensions with Llama 3.1 8B (hidden=4096, 32q/8kv heads, head_dim=128), but uses a larger MLP intermediate size (22016 vs 14336).
DefinitionOp TypeStatus
rmsnorm_h4096rmsnormβœ…
fused_add_rmsnorm_h4096rmsnormβœ…
gqa_paged_prefill_causal_h32_kv8_d128_ps1gqa_pagedβœ…
gqa_paged_prefill_causal_h32_kv8_d128_ps64gqa_pagedβœ…
gqa_paged_decode_h32_kv8_d128_ps1gqa_pagedβœ…
gqa_paged_decode_h32_kv8_d128_ps64gqa_pagedβœ…
gqa_ragged_prefill_causal_h32_kv8_d128gqa_raggedβœ…
gemm_n6144_k4096gemmβœ…
gemm_n4096_k4096gemmβœ…
gemm_n44032_k4096gemm❌
gemm_n4096_k22016gemm❌
top_k_sampling_from_probs_v151936samplingβœ…
top_k_top_p_sampling_from_probs_v151936samplingβœ…
top_p_sampling_from_probs_v151936samplingβœ…
Coverage: 12 / 14 definitions present. Missing: gate_up GEMM (gemm_n44032_k4096, intermediate=22016 Γ— 2) and down GEMM (gemm_n4096_k22016). All normalization, attention, and non-MLP GEMM kernels are shared with Llama 3.1 8B.

Qwen3 32B

Architecture: 64 decoder layers, GQA attention, dense MLP. hidden=5120, 64 query heads, 8 KV heads, head_dim=128, intermediate=25600. Standard serving configuration: TP=4.
DefinitionOp TypeStatus
rmsnorm_h5120rmsnormβœ…
fused_add_rmsnorm_h5120rmsnormβœ…
gqa_paged_prefill_causal_h16_kv2_d128_ps1gqa_paged TP=4βœ…
gqa_paged_prefill_causal_h16_kv2_d128_ps64gqa_paged TP=4βœ…
gqa_paged_decode_h16_kv2_d128_ps1gqa_paged TP=4βœ…
gqa_paged_decode_h16_kv2_d128_ps64gqa_paged TP=4βœ…
gqa_ragged_prefill_causal_h16_kv2_d128gqa_ragged TP=4βœ…
gemm_n10240_k5120gemm (QKV)❌
gemm_n5120_k8192gemm (o_proj)❌
gemm_n51200_k5120gemm (gate_up)❌
gemm_n5120_k25600gemm (down)❌
top_k_sampling_from_probs_v151936samplingβœ…
top_k_top_p_sampling_from_probs_v151936samplingβœ…
top_p_sampling_from_probs_v151936samplingβœ…
Coverage: 10 / 14 definitions present. RMSNorm shared with Qwen3 14B (same hidden=5120). GQA kernels shared with Llama 3.1/3.3 70B (same h=16, kv=2, d=128 at TP=4). Missing: all GEMM definitions.

Qwen3 235B A22B

Architecture: 94 decoder layers, GQA attention, sparse MoE FFN (128 experts, top-8 routing). Uses head_dim=128 (hidden=8192, 64 query heads). Standard serving configuration: TP=8, EP=2 (FP8 variant from sgl-cookbook). With 4 KV heads, effective per-device TP for attention is TP=4 (kv=1 per device).
DefinitionOp TypeStatus
rmsnorm_h4096rmsnormβœ…
fused_add_rmsnorm_h4096rmsnormβœ…
gqa_paged_prefill_causal_h16_kv1_d128_ps1gqa_paged TP=4❌
gqa_paged_prefill_causal_h16_kv1_d128_ps64gqa_paged TP=4βœ…
gqa_paged_decode_h16_kv1_d128_ps1gqa_paged TP=4❌
gqa_paged_decode_h16_kv1_d128_ps64gqa_paged TP=4❌
gqa_ragged_prefill_causal_h16_kv1_d128gqa_ragged TP=4❌
gemm_n4608_k4096gemm❌
gemm_n4096_k4096gemmβœ…
moe_fp8_block_scale_renorm_topk8_e128_h4096_i1536moe EP=1🟑
moe_fp8_block_scale_ds_routing_topk8_ng?_kg?_e64_h4096_i1536moe EP=2❌
trtllm_fp4_block_scale_moe_topk8_e64_h4096_i1536moe (TRT-LLM FP4, EP=2)🟑
trtllm_fp4_block_scale_routed_moe_topk8_e64_h4096_i1536moe (TRT-LLM FP4 routed, EP=2)🟑
top_k_sampling_from_probs_v151936samplingβœ…
top_k_top_p_sampling_from_probs_v151936samplingβœ…
top_p_sampling_from_probs_v151936samplingβœ…
Coverage: 8 / 14 referenced definitions present. MoE EP=1 + TRT-LLM FP4 EP=2 definitions added. Missing: all GQA defs (head_dim=64), QKV GEMM. The o_proj GEMM and rmsnorm are shared with other h=4096 models.

Kimi K2

Architecture: 61 decoder layers, MLA attention (same structure as DeepSeek V3), sparse MoE FFN (384 total experts, top-8 routing). Standard serving configuration: TP=8, EP=4 (from sgl-cookbook). Kimi K2 uses DeepSeek V3-style MLA with the same kv_lora_rank=512 and qk_rope_head_dim=64, but has 64 attention heads (vs 128 in DeepSeek V3). With TP=8 this gives h=8, requiring separate MLA definitions from DeepSeek V3’s h=16.
DefinitionOp TypeStatus
rmsnorm_h7168rmsnormβœ…
fused_add_rmsnorm_h7168rmsnormβœ…
rmsnorm_h1536rmsnormβœ…
rmsnorm_h512rmsnormβœ…
mla_paged_prefill_causal_h8_ckv512_kpe64_ps1mla_paged TP=8🟑
mla_paged_prefill_causal_h8_ckv512_kpe64_ps64mla_paged TP=8❌
mla_paged_decode_h8_ckv512_kpe64_ps1mla_paged TP=8🟑
mla_paged_decode_h8_ckv512_kpe64_ps64mla_paged TP=8❌
mla_ragged_prefill_causal_h8_qk192_vo128mla_ragged🟑
moe_fp8_block_scale_ds_routing_topk8_ng1_kg1_e384_h7168_i2048moe EP=1🟑
moe_fp8_block_scale_ds_routing_topk8_ng?_kg?_e96_h7168_i2048moe EP=4❌
moe_fp8_block_scale_ds_routing_topk8_ng1_kg1_e48_h7168_i2048moe EP=8🟑
top_k_sampling_from_probs_v160000sampling❌
top_k_top_p_sampling_from_probs_v160000sampling❌
top_p_sampling_from_probs_v160000sampling❌
Coverage: 6 / 15 definitions present. RMSNorm definitions are shared with DeepSeek V3 (same hidden=7168 and sub-module dims). MoE EP=1 and EP=8 definitions added. All MLA defs require new h=8 variants; MoE EP=4 variant (e=96) and sampling (v=160000) still missing.

Phi-4 14B

Architecture: 40 decoder layers, GQA attention (unusual 10 KV heads), dense MLP. All dimensions are new for this project.
DefinitionOp TypeStatus
rmsnorm_h5120rmsnorm🟑
fused_add_rmsnorm_h5120rmsnorm🟑
gqa_paged_prefill_causal_h40_kv10_d128_ps1gqa_pagedβœ…
gqa_paged_prefill_causal_h40_kv10_d128_ps64gqa_paged❌
gqa_paged_decode_h40_kv10_d128_ps1gqa_paged❌
gqa_paged_decode_h40_kv10_d128_ps64gqa_paged❌
gqa_ragged_prefill_causal_h40_kv10_d128gqa_ragged❌
gemm_n7680_k5120gemm❌
gemm_n5120_k5120gemm🟑
gemm_n35840_k5120gemm❌
gemm_n5120_k17920gemm❌
top_k_sampling_from_probs_v100352sampling❌
top_k_top_p_sampling_from_probs_v100352sampling❌
top_p_sampling_from_probs_v100352sampling❌
Coverage: 4 / 14 definitions present. rmsnorm h5120 is now shared with Qwen3 14B; gemm_n5120_k5120 (o_proj shape) is shared since 40q*128=5120=hidden; gqa_paged_prefill_causal_h40_kv10_d128_ps1 has workloads collected (20/20 PASSED). Missing: remaining GQA defs (unusual 10 KV-head config), most GEMM, sampling v100352.

Llama 3.1 405B

Architecture: 126 decoder layers, GQA attention, dense MLP. Standard serving configuration: TP=4 (from sgl-cookbook). Uses the same Llama architecture as Llama 3.1 8B / 3.3 70B but at significantly larger scale (hidden=16384).
DefinitionOp TypeStatus
rmsnorm_h16384rmsnorm❌
fused_add_rmsnorm_h16384rmsnorm❌
gqa_paged_prefill_causal_h32_kv2_d128_ps1gqa_paged TP=4❌
gqa_paged_prefill_causal_h32_kv2_d128_ps64gqa_paged TP=4❌
gqa_paged_decode_h32_kv2_d128_ps1gqa_paged TP=4❌
gqa_paged_decode_h32_kv2_d128_ps64gqa_paged TP=4❌
gqa_ragged_prefill_causal_h32_kv2_d128gqa_ragged TP=4❌
gemm_n18432_k16384gemm❌
gemm_n16384_k16384gemm❌
gemm_n106496_k16384gemm❌
gemm_n16384_k53248gemm❌
top_k_sampling_from_probs_v128256samplingβœ…
top_k_top_p_sampling_from_probs_v128256samplingβœ…
top_p_sampling_from_probs_v128256samplingβœ…
Coverage: 3 / 14 definitions present. Sampling definitions are shared with Llama 3.1 8B (same vocab). Missing: rmsnorm h16384 and all GQA/GEMM definitions for this scale (TP=4 gives h=128/4=32 q-heads, kv=8/4=2 β€” the h32_kv2 configuration does not exist in current definitions).

Llama 4 Scout 17B-16E

Architecture: 48 decoder layers, interleaved GQA attention (NoPE global + RoPE local in 1:3 ratio), sparse MoE FFN (16 total experts, top-1 routing). Standard serving configuration: TP=8 (from sgl-cookbook). Multimodal (vision+text).
Note: Exact config.json values (hidden_size, intermediate_size) are pending verification from HuggingFace. Parameters below are estimates from the public model spec (17B activated parameters, 16 experts).
DefinitionOp TypeStatus
rmsnorm_h5120rmsnorm🟑
fused_add_rmsnorm_h5120rmsnorm🟑
gqa_paged_prefill_causal_h5_kv1_d128_ps1gqa_paged TP=8βœ…
gqa_paged_prefill_causal_h5_kv1_d128_ps64gqa_paged TP=8🟑
gqa_paged_decode_h5_kv1_d128_ps1gqa_paged TP=8βœ…
gqa_paged_decode_h5_kv1_d128_ps64gqa_paged TP=8🟑
gqa_ragged_prefill_causal_h5_kv1_d128gqa_ragged TP=8🟑
MoE experts (top-1, 16 experts, standard routing)moeβ€”
trtllm_fp4_block_scale_moe_topk1_e16_h5120_i8192moe (TRT-LLM FP4, Llama4 routing)🟑
trtllm_fp4_block_scale_routed_moe_topk1_e16_h5120_i8192moe (TRT-LLM FP4 routed, Llama4 routing)🟑
trtllm_fp8_per_tensor_scale_moe_topk1_e16_h5120_i8192moe (TRT-LLM FP8)🟑
top_k_sampling_from_probs_v202048sampling🟑
top_k_top_p_sampling_from_probs_v202048sampling🟑
top_p_sampling_from_probs_v202048sampling🟑
Coverage: 8 / 13 definitions present. rmsnorm h5120 shared with Qwen3 14B. TRT-LLM FP4 + FP8 MoE kernels added (top-1, 16 experts, Llama4 routing). Missing: ps64 GQA variants (no definition files), sampling v202048.

Llama 4 Maverick 17B-128E

Architecture: Same base architecture as Llama 4 Scout but with 128 total experts (vs 16). Standard serving configuration: TP=8 (from sgl-cookbook). hidden_size=5120, 40 q-heads, 8 kv-heads, head_dim=128, intermediate_size=8192.
DefinitionOp TypeStatus
rmsnorm_h5120rmsnorm🟑
fused_add_rmsnorm_h5120rmsnorm🟑
gqa_paged_prefill_causal_h5_kv1_d128_ps1gqa_paged TP=8βœ…
gqa_paged_prefill_causal_h5_kv1_d128_ps64gqa_paged TP=8🟑
gqa_paged_decode_h5_kv1_d128_ps1gqa_paged TP=8βœ…
gqa_paged_decode_h5_kv1_d128_ps64gqa_paged TP=8🟑
gqa_ragged_prefill_causal_h5_kv1_d128gqa_ragged TP=8🟑
MoE experts (top-1, 128 experts, standard routing)moeβ€”
trtllm_fp4_block_scale_moe_topk1_e128_h5120_i8192moe (TRT-LLM FP4, Llama4 routing)🟑
trtllm_fp4_block_scale_routed_moe_topk1_e128_h5120_i8192moe (TRT-LLM FP4 routed, Llama4 routing)🟑
trtllm_fp8_per_tensor_scale_moe_topk1_e128_h5120_i8192moe (TRT-LLM FP8)🟑
top_k_sampling_from_probs_v202048sampling🟑
top_k_top_p_sampling_from_probs_v202048sampling🟑
top_p_sampling_from_probs_v202048sampling🟑
Coverage: 10 / 13 definitions present. rmsnorm h5120 shared with Qwen3 14B. TRT-LLM FP4 + FP8 MoE kernels added (top-1, 128 experts, Llama4 routing). Same base dimensions as Llama 4 Scout; MoE expert count differs. ps1 GQA workloads collected; ps64 definitions present but workloads pending.

Mistral Small 3.1 24B

Architecture: 40 decoder layers, GQA attention (explicit head_dim=128), dense MLP. Standard serving configuration: TP=2 (from sgl-cookbook). Shares the same attention configuration as Mistral Nemo 12B (hidden=5120 with explicit head_dim=128 giving 32 effective query heads).
DefinitionOp TypeStatus
rmsnorm_h5120rmsnorm🟑
fused_add_rmsnorm_h5120rmsnorm🟑
gqa_paged_prefill_causal_h32_kv8_d128_ps1gqa_pagedβœ…
gqa_paged_prefill_causal_h32_kv8_d128_ps64gqa_pagedβœ…
gqa_paged_decode_h32_kv8_d128_ps1gqa_pagedβœ…
gqa_paged_decode_h32_kv8_d128_ps64gqa_pagedβœ…
gqa_ragged_prefill_causal_h32_kv8_d128gqa_raggedβœ…
gemm_n6144_k5120gemm❌
gemm_n5120_k4096gemm❌
gemm_n28672_k5120gemm❌
gemm_n5120_k14336gemm❌
top_k_sampling_from_probs_v131072sampling❌
top_k_top_p_sampling_from_probs_v131072sampling❌
top_p_sampling_from_probs_v131072sampling❌
Coverage: 7 / 14 definitions present. GQA kernels are shared with Mistral Nemo 12B and Llama 3.1 8B; rmsnorm h5120 is now shared with Qwen3 14B. Missing: GEMM defs with k=5120 input dim (Mistral-specific intermediate sizes), sampling v131072.

GLM-4.6

Architecture: Dense transformer with Dual Chunk Attention (DCA) β€” a variant of full attention with rotary embeddings. Served on Together AI and Fireworks; sgl-cookbook shows TP=8, EP=8 (high-throughput configuration), suggesting a very large MoE variant.
Note: Exact architecture parameters for GLM-4.6 require verification from the HuggingFace config.json (zai-org/GLM-4.6). The params below are based on the SGLang glm4.py defaults and may not reflect the actual model dimensions.
DefinitionOp TypeStatus
rmsnorm_h4096rmsnorm❌ (if hidden=4096)
fused_add_rmsnorm_h4096rmsnorm❌
GQA or custom DCA attentionattentionβ€”
MoE FFN (if applicable)moeβ€”
Sampling (vocab TBD)samplingβ€”
Coverage: 0 / ? definitions present. Architecture requires research. DCA attention may use standard GQA kernels at the computation level (FlashInfer’s paged/ragged wrappers) or require custom handling. Run /track-models --model-name glm46 --hf-repo-id zai-org/GLM-4.6 to fetch the exact config and update this section.

MiniMax-Text-01

Architecture: Hybrid linear + softmax attention with MoE FFN. Uses a 7:1 ratio of Lightning Attention (linear) to standard Softmax Attention layers per 8-layer block, plus sparse MoE (32 experts, top-2 routing). Total parameters: ~456B with ~45.9B activated. 80 decoder layers, 64 attention heads, head_dim=128, hidden_size=6144. Lightning Attention is a novel linear attention variant that does not use the standard softmax attention mechanism. It is not currently supported by FlashInfer and requires a new op type.
DefinitionOp TypeStatus
rmsnorm_h6144rmsnorm❌
fused_add_rmsnorm_h6144rmsnorm❌
Lightning Attention layers (7/8 of all layers)lightning_attn❌ (op type not supported)
Softmax Attention layers (1/8 of all layers)gqa_paged❌
MoE experts (top-2, 32 experts)moeβ€”
Sampling (vocab TBD)samplingβ€”
Coverage: 0 / ? definitions present. The primary blocker is Lightning Attention β€” a linear attention variant not yet in FlashInfer. The softmax attention layers (GQA-style) also require new definitions for this model’s specific dimensions. To add support, a new lightning_attn op type would first need to be defined.

MiniMax M2

Architecture: 62 decoder layers, GQA attention (6:1 ratio, 48 q-heads / 8 kv-heads, head_dim=128, hidden_size=3072), MoE FFN with sigmoid routing (256 experts, top-8, FP8 block-scale quantization). Total parameters: ~230B with ~10B activated. Note: MiniMax M2 is a separate model from MiniMax-Text-01 (which uses Lightning Attention). M2 uses standard GQA attention.
DefinitionOp TypeStatus
rmsnorm_h3072rmsnorm🟑
fused_add_rmsnorm_h3072rmsnorm🟑
rope_with_cos_sin_cache_neox_style_d128_rd64rope🟑
gqa_paged_prefill_causal_h6_kv1_d128_ps1gqa_paged🟑
gqa_paged_prefill_causal_h6_kv1_d128_ps64gqa_paged🟑
gqa_paged_decode_h6_kv1_d128_ps1gqa_paged🟑
gqa_paged_decode_h6_kv1_d128_ps64gqa_paged🟑
gqa_ragged_prefill_causal_h6_kv1_d128gqa_ragged🟑
gemm_n8192_k3072gemm (fused qkv_proj)🟑
gemm_n3072_k6144gemm (o_proj)🟑
gemm_n256_k3072gemm (MoE gate)🟑
MoE gate / topk / expertsmoeβ€”
top_k_sampling_from_probs_v200064sampling🟑
top_k_top_p_sampling_from_probs_v200064sampling🟑
top_p_sampling_from_probs_v200064sampling🟑
Coverage: 14 / 15 definitions present. Workloads not yet collected.

Gemma 3 27B

Architecture: 62 decoder layers, GQA attention (2:1 ratio, 32 q-heads / 16 kv-heads, explicit head_dim=128 decoupled from hidden_size=5376), dense MLP with GeGLU activation. Note: hidden_size=5376 is non-standard; head_dim is explicitly 128 (not 5376/32=168). This is a multimodal model (vision+text) but the language backbone uses standard transformer attention.
DefinitionOp TypeStatus
rmsnorm_h5376rmsnorm🟑
fused_add_rmsnorm_h5376rmsnorm🟑
gqa_paged_prefill_causal_h32_kv16_d128_ps1gqa_paged🟑
gqa_paged_prefill_causal_h32_kv16_d128_ps64gqa_paged🟑
gqa_paged_decode_h32_kv16_d128_ps1gqa_paged🟑
gqa_paged_decode_h32_kv16_d128_ps64gqa_paged🟑
gqa_ragged_prefill_causal_h32_kv16_d128gqa_raggedβœ…
gemm_n4096_k5376gemm (q_proj)🟑
gemm_n2048_k5376gemm (k/v proj)🟑
gemm_n5376_k4096gemm (o_proj)🟑
gemm_n21504_k5376gemm (gate/up proj)🟑
gemm_n5376_k21504gemm (down proj)🟑
top_k_sampling_from_probs_v262208sampling🟑
top_k_top_p_sampling_from_probs_v262208sampling🟑
top_p_sampling_from_probs_v262208sampling🟑
Coverage: 15 / 15 definitions present. All dimensions are unique to this model: hidden=5376, intermediate=21504, vocab=262208. GQA ratio is 2:1 (vs 4:1 for Llama/Qwen), so kv_heads=16 (not 8). Workloads not yet collected.

Qwen3 14B

Architecture: 40 decoder layers, GQA attention (5:1 ratio, 40 q-heads / 8 kv-heads, head_dim=128), dense MLP. Standard serving configuration: TP=2 (from sgl-cookbook), giving 20 q-heads and 4 kv-heads per device.
DefinitionOp TypeStatus
rmsnorm_h5120rmsnorm🟑
fused_add_rmsnorm_h5120rmsnorm🟑
gqa_paged_prefill_causal_h20_kv4_d128_ps1gqa_paged TP=2🟑
gqa_paged_prefill_causal_h20_kv4_d128_ps64gqa_paged TP=2🟑
gqa_paged_decode_h20_kv4_d128_ps1gqa_paged TP=2🟑
gqa_paged_decode_h20_kv4_d128_ps64gqa_paged TP=2🟑
gqa_ragged_prefill_causal_h20_kv4_d128gqa_ragged TP=2βœ…
gemm_n7168_k5120gemm (qkv_proj combined)🟑
gemm_n5120_k5120gemm (o_proj)🟑
gemm_n34816_k5120gemm (gate_up combined)🟑
gemm_n5120_k17408gemm (down proj)🟑
top_k_sampling_from_probs_v151936samplingβœ…
top_k_top_p_sampling_from_probs_v151936samplingβœ…
top_p_sampling_from_probs_v151936samplingβœ…
Coverage: 14 / 14 definitions present. The rmsnorm_h5120 definition is also shared with Mistral Nemo 12B, Mistral Small 3.1 24B, Phi-4 14B, and Llama 4 Scout/Maverick. Non-sampling workloads not yet collected.

NemotronH 47B

Architecture: 52 decoder layers total β€” hybrid of standard GQA (Transformer) and Mamba2 SSM layers. Uses 20 GQA attention layers and 32 Mamba2 layers in an interleaved pattern. Standard serving configuration: TP=8 (from sgl-cookbook). Mamba2 SSM (Structured State Space Model) is a linear recurrent architecture that does not use softmax attention. It maintains a fixed-size state matrix updated at each step, analogous to a hidden state in RNNs. Mamba2 is not currently supported as an op type in FlashInfer-Bench and requires defining a new mamba_ssu (Selective State-space Unit) operation type before this model can be tracked.
DefinitionOp TypeStatus
rmsnorm_h{hidden}rmsnorm❌ (dims TBD)
GQA attention layers (20 layers, TP=8)gqa_paged❌
Mamba2 SSM layers (32 layers)mamba_ssu❌ (op type not supported)
MLP / MoE FFNgemm / moe❌
Samplingsampling❌
Coverage: 0 / ? definitions present. The primary blocker is the Mamba2 SSM op type β€” a selective state-space operation not yet defined in FlashInfer-Bench. This is analogous to MiniMax-Text-01’s Lightning Attention blocker. To add support, a new mamba_ssu op type schema would first need to be defined. Once that exists, the GQA attention layers could reuse existing definitions if dimensions match.