vllm.benchmarks.iterations ¶
Batch iteration benchmark for precise prefill/decode phase measurement.
On the server side, run: vllm serve
On the client side, run: # Prefill benchmark: measure prefill of 8K new tokens (no existing context) vllm bench iterations \ --endpoints 127.0.0.1:8000 \ --input-len 8192 \ --batch-size 1 \ --mode prefill \ --profile \ --model
# Prefill benchmark: measure prefill of 2K new tokens against 4K existing context
vllm bench iterations \
--endpoints 127.0.0.1:8000 \
--context-len 4096 \
--input-len 2048 \
--batch-size 1 \
--mode prefill \
--profile \
--model <your_model>
# Decode benchmark: warmup with 8K context, measure 128 decode iterations
vllm bench iterations \
--endpoints 127.0.0.1:8000 \
--context-len 8192 \
--batch-size 64 \
--mode decode \
--iterations 128 \
--profile \
--model <your_model>
This benchmark uses sleep(level=0) to pause scheduling, queues requests, then resumes scheduling to measure precise batch execution times.
Prefix Cache Warmup
Before each benchmark run, the client sends warmup requests with context_len tokens to populate the prefix cache. The benchmark requests share this prefix, so the server can skip prefilling the context portion (prefix cache hit).
Modes
prefill: First warms up prefix cache with context_len tokens. Then measures prefill of input_len NEW tokens against existing context. Total prompt = context_len + input_len tokens. context_len=0 is valid (clean prefill of new input only).
decode: First warms up prefix cache with context_len tokens. Then measures decode throughput for --iterations output tokens. The benchmark prompt matches the warmup (full prefix cache hit), so we measure ONLY decode latency, not prefill. context_len > 0 is REQUIRED (cannot decode without context).
Batch Size Semantics
--batch-size specifies the batch size PER DP domain, matching the standalone benchmark (fbcode) semantics. The client automatically queries the server's DP configuration and multiplies to get the global batch size.
Example: With DP=8 and --batch-size 64, the client sends 64*8=512 total requests distributed round-robin across all DP ranks.
NOTE: For accurate prefill benchmarks, do NOT use --enable-chunked-prefill on the server. Chunked prefill breaks long prefills into multiple steps, which interferes with measuring true prefill performance.
BenchmarkConfig dataclass ¶
Configuration for the iterations benchmark.
Source code in vllm/benchmarks/iterations.py
EndpointRotator ¶
Round-robin endpoint selection (matches disagg_benchmarks pattern).
Source code in vllm/benchmarks/iterations.py
_normalize ¶
IterationResult dataclass ¶
Result of a single benchmark iteration.
Source code in vllm/benchmarks/iterations.py
ServerConfig dataclass ¶
add_cli_args ¶
add_cli_args(parser: ArgumentParser) -> None
Add CLI arguments for the iterations benchmark.
Source code in vllm/benchmarks/iterations.py
build_prompts ¶
Build context and benchmark prompts for a parameter combination.
Returns (context_prompt, benchmark_prompt) where: - context_prompt: Used for prefix cache warmup (None if context_len <= 0) - benchmark_prompt: Used for the actual benchmark run
For prefill mode
context_prompt = context_len tokens ("hello " repeated) benchmark_prompt = context_len + input_len tokens The first context_len tokens match context_prompt (prefix cache hit). We measure prefill of the remaining input_len new tokens.
For decode mode
context_prompt = context_len tokens benchmark_prompt = same as context_prompt (full prefix cache hit) We measure only decode iterations (no prefill work).
Source code in vllm/benchmarks/iterations.py
call_debug_endpoint async ¶
call_debug_endpoint(
session: ClientSession,
rotator: EndpointRotator,
path: str,
params: dict | None = None,
) -> bool
Call debug endpoint on ALL endpoints (for sleep/wake_up/profile).
Returns True if all calls succeeded, False if any failed.
Source code in vllm/benchmarks/iterations.py
count_tokens ¶
Extract token counts from completion response.
fetch_server_config async ¶
fetch_server_config(
session: ClientSession, rotator: EndpointRotator
) -> ServerConfig
Fetch server parallelism config from first endpoint.
Source code in vllm/benchmarks/iterations.py
fetch_traces async ¶
fetch_traces(
session: ClientSession,
rotator: EndpointRotator,
prefix: str,
output_dir: str,
) -> list[str]
Download trace files from all endpoints.
Source code in vllm/benchmarks/iterations.py
parse_comma_list ¶
print_results_summary ¶
print_results_summary(
results: list[IterationResult],
server_config: ServerConfig | None = None,
) -> None
Print a summary of benchmark results.
Source code in vllm/benchmarks/iterations.py
run_benchmark async ¶
run_benchmark(
config: BenchmarkConfig,
) -> tuple[list[IterationResult], ServerConfig]
Main benchmark loop with parameter sweeping.
Source code in vllm/benchmarks/iterations.py
413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 | |
run_compilation_warmup async ¶
run_compilation_warmup(
session: ClientSession,
rotator: EndpointRotator,
model: str,
) -> None
Send a warmup request to trigger runtime compilation.
Source code in vllm/benchmarks/iterations.py
run_prefix_cache_warmup async ¶
run_prefix_cache_warmup(
session: ClientSession,
rotator: EndpointRotator,
model: str,
context_prompt: str | None,
batch_size: int,
) -> None
Populate prefix cache with context tokens before benchmarking.
Sends batch_size requests with context_prompt to populate the prefix cache. The benchmark requests will share this prefix.
Source code in vllm/benchmarks/iterations.py
run_single_iteration async ¶
run_single_iteration(
session: ClientSession,
config: BenchmarkConfig,
rotator: EndpointRotator,
benchmark_prompt: str,
batch_size: int,
) -> tuple[float, int, int]
Run one iteration: sleep → queue requests → wake → measure.
Source code in vllm/benchmarks/iterations.py
write_results_json ¶
write_results_json(
results: list[IterationResult], output_path: str
) -> None
Write results to JSON file.