vLLMEngineCore
1. DPEngineCoreProc/EngineCoreProc/EngineCore
在执行流程(1)中梳理了DPEngineCoreProc对象的init方法的主要流程,run_engine_core()在成功初始化DPEngineCoreProc对象后会调用它的run_busy_loop()方法,启动这个引擎的主循环,开始通过ZMQ接收并处理推理请求
下面是DPEngineCoreProc的run_busy_loop()方法代码:
1 | # vllm/vllm/v1/engine/core.py > DPEngineCoreProc |
这里调用到的核心函数是 __EngineCore__类里的_process_engine_step()
1 | # /vllm/vllm/v1/engine/core.py > EngineCore |
2. MultiprocExecutor
在EngineCore对象中的self.model_executor被初始化为__MultiprocExecutor__
(这里executor类型的决策逻辑在/vllm/vllm/v1/executor/abstract.py的Executor.getclass()方法中)
1 | # /vllm/vllm/v1/executor/multiproc_executor.py |
3. MessageQueue类
1 | # vllm/vllm/distributed/device/device_cummunicators/shm_broadcast.py |
4. WorkProc类
MultiprocExecutor._init_executor() 循环中,每次调用make_worker_process这个函数就会有一个新的WorkerProc被创建出来
1 | # vllm/vllm/v1/executor/multiproc_executor.py |
在make_worker_process(...)中,创建子进程proc = context.Process(target=WorkerProc.worker_main, kwargs=process_kwargs, ...)一旦子进程启动,会执行worker.main函数,从中调用worker.__init__()完成子进程侧的初始化
1 | # /vllm/vllm/v1/executor/multiproc_executor.py > WorkerProc |
WorkerWrapperBase类
在WorkProc对象的静态方法worker_main中,初始化WorkerProc对象的时调用__init__方法,其中代码wrapper = WorkerWrapperBase(vllm_config=vllm_config, rpc_rank=rank)会实例化__wrapper__对象。
而后调用wrapper.init_worker(all_kwargs)构建真实的worker
1 | # /vllm/vllm/v1/worker/worker_base.py |
resolve_obj_by_qualname函数获取worker_class,代码为worker_class = resolve_obj_by_qualname(self.vllm_config.parallel_config.worker_cls)传入的self.vllm_config.parallel_config.worker_cls中worker_cls的类型为str
1 | # /vllm/vllm/utils/__init__.py |
这里self.vllm_config.parallel_config.worker_cls来自于vllm_configLLMEnginezaizai,在LLMEngine的from_engine_args函数中被加载,vllm_config通过 EngineArgs.create_engine_config() 构造出的 VllmConfig ,随后一路传递到 WorkerProc ,再注入到 WorkerWrapperBase.init_worker()
- 对于cuda/ROCm上,worker_cls = vllm.v1.worker.gpu_worker.Worker(/vllm/vllm/platforms/cuda.py)
所以这里得到的worker_cls为gpu_worker.Worker
1 | # in vllm/vllm/platform/cuda.py |
Worker(gpu_worker版本)
1 | # in vllm/vllm/vq/gpu_worker.py |
GPUModelRunner
1 | # vllm/vllm/v1/worker/gpu_model_runner.py |
这里加载模型用到的get_model_loader代码如下:
1 | # vllm/vllm/model_executor/model_loader/__init__.py |
而后加载模型的代码self.model = model_loader.load_model(vllm_config=self.vllm_config, model_config=self.model_config)对应的load_model函数代码如下。该函数会加载模型,得到一个nn.Module对象
1 | # vllm/vllm/model_executor/model_loader/base_loader.py > BaseModelLoader |
而后在load_model函数中会进行编译器选择
1 | # vllm/vllm/v1/worker/gpu_model_runner.py |
- Dynamo 编译:能进行算子融合与图级优化,适合较稳定的执行图;首次编译有编译开销,对高度动态形状可能收益不明显。
- FULL CUDA Graph:极大降低每次推理的 CPU 开销与抖动,但要求输入/内存地址较稳定;vLLM 通过自有调度尽量满足。
- UBatch + 可选 CUDA Graph:在更动态的并发/批次场景中维持较好的可控性与性能,适合开启 DBO 的场合。
对于backend = self.vllm_config.compilation_config.init_backend(self.vllm_config),会进行后端初始化:
1 | # vllm/vllm/config/compilation.py > CompilationConfig |
Qwen使用的是__Dynamo编译__,所以会执行这一步,init 一个backend对象
VLLMBackend
1 | class VllmBackend: |
1 | def enable_trace_function_call_for_thread(vllm_config: VllmConfig) -> None: |
- 预推理(预热/捕获)从
gpu_worker.py发起,主要通过两条路径完成:- 内存与编译的预热(
profile_run与_dummy_run); - CUDA Graph 的捕获(
capture_model→_capture_cudagraphs→_dummy_run触发捕获)。
- 内存与编译的预热(
first_run_finished 的运行起点
PiecewiseBackend中同名变量的真实使用:- 初始化:
self.first_run_finished = False(vllm/vllm/compilation/piecewise_backend.py:59)。 - 首次调用:
__call__的第一次执行把它置为True,并调用check_for_ending_compilation(),随后返回“一般形状”的已编译可运行体(compiled_graph_for_general_shape)(piecewise_backend.py:89-97)。 - 后续调用:按
runtime_shape选择或触发静态形状编译,并在全部必需形状编译结束后再次调用check_for_ending_compilation()(piecewise_backend.py:101-120)。
- 初始化:
- 总结:
first_run_finished的“开始运行”发生在PiecewiseBackend.__call__的第一次执行;在CUDAGraphWrapper中该标志目前未被使用。
预推理/预热整体流程(从 gpu_worker.py 出发)
设备与模型加载
Worker.init_device设置本地 GPU 设备(见gpu_worker.py)。Worker.load_model→GPUModelRunner.load_model加载模型,并按配置包裹:- 若
CompilationLevel.DYNAMO_AS_IS,直接self.model.compile(fullgraph=True, backend=...)(gpu_model_runner.py:2866-2876)。 - 其他编译级别:根据
cudagraph_mode和是否启用 DBO,包裹为CUDAGraphWrapper(FULL)或UBatchWrapper(gpu_model_runner.py:2897-2916)。
- 若
内存探测与首次预热
Worker.determine_available_memory调用GPUModelRunner.profile_run()做一次完整的 dummy 前向以测峰值显存(gpu_worker.py:236-310)。1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18# vllm/vllm/v1/worker/gpu_worker.py > Worker
def determine_available_memory(self) -> int:
"""Profiles the peak memory usage of the model to determine how much
memory can be used for KV cache without OOMs.
The engine will first conduct a profiling of the existing memory usage.
Then, it calculates the free memory that can be used for KV cache in
bytes.
Tip:
You may limit the usage of GPU memory
by adjusting the `gpu_memory_utilization` parameter.
"""
self.model_runner.profile_run()
msg = (
f"Initial free memory {GiB(self.init_snapshot.free_memory):.2f} "
f"GiB, reserved {GiB(kv_cache_memory_bytes):.2f} GiB memory for ...")GPUModelRunner.profile_run内部执行_dummy_run(self.max_num_tokens, is_profile=True):1
2
3
4
5
6
7# vllm/vllm/v1/worker/gpu_model_runner.py > GPUModelRunner
def profile_run(self) -> None:
# Profile with multimodal encoder & encoder cache.
# Add `is_profile` here to pre-allocate communication buffers
hidden_states, last_hidden_states = self._dummy_run(
self.max_num_tokens, is_profile=True
)- 通过
set_forward_context(..., cudagraph_runtime_mode=CUDAGraphMode.NONE, ...)注入上下文,确保不触发 CUDA Graph 捕获(gpu_model_runner.py:3664-3668和forward_context.py:268-306)。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
def _dummy_run(
self,
num_tokens: int,
cudagraph_runtime_mode: Optional[CUDAGraphMode] = None,
force_attention: bool = False,
uniform_decode: bool = False,
allow_microbatching: bool = True,
skip_eplb: bool = False,
is_profile: bool = False,
create_mixed_batch: bool = False,
remove_lora: bool = True,
) -> tuple[torch.Tensor, torch.Tensor]:
"""
Run a dummy forward pass to warm up/profile run or capture the
CUDA graph for the model.
"""
# ...
with (
self.maybe_randomize_inputs(input_ids),
set_forward_context(
attn_metadata,
self.vllm_config,
num_tokens=num_tokens_after_padding,
num_tokens_across_dp=num_tokens_across_dp,
cudagraph_runtime_mode=cudagraph_runtime_mode,
batch_descriptor=batch_descriptor,
ubatch_slices=ubatch_slices,
),
):
outputs = self.model(
input_ids=input_ids,
positions=positions,
intermediate_tensors=intermediate_tensors,
inputs_embeds=inputs_embeds,
**model_kwargs,
)
# ...
- 若是最后 PP rank,随后预热 pooler 或 sampler(
gpu_model_runner.py:3669-3687)。
- 通过
分配 KV Cache
Worker.initialize_cache按探测结果分配 KV Cache(gpu_worker.py:162-180)。1
2
3def initialize_cache(self, num_gpu_blocks: int, num_cpu_blocks: int) -> None:
self.cache_config.num_gpu_blocks = num_gpu_blocks
self.cache_config.num_cpu_blocks = num_cpu_blocks
编译与 CUDA Graph 预热/捕获
Worker.compile_or_warm_up_model负责正式推理前的全部预热与捕获(gpu_worker.py:338-460):1
2
3
4def compile_or_warm_up_model(self) -> None:
# warm up sizes that are not in cudagraph capture sizes,
# but users still want to compile for better performance,
# e.g. for the max-num-batched token size in chunked prefill.- 预热编译尺寸:对
compile_sizes中不参与 CUDA Graph 捕获的尺寸,逐一调用self.model_runner._dummy_run(size, skip_eplb=True, remove_lora=False),以触发编译与算子预热(gpu_worker.py:351-357)
1
2
3
4
5
6
7
8
9
10
11
12warmup_sizes = self.vllm_config.compilation_config.compile_sizes.copy()
if not self.model_config.enforce_eager:
warmup_sizes = [
x
for x in warmup_sizes
if x not in self.vllm_config.compilation_config.cudagraph_capture_sizes
]
# We skip EPLB here since we don't want to record dummy metrics
for size in sorted(warmup_sizes, reverse=True):
logger.info("Compile and warming up model for size %d", size)
self.model_runner._dummy_run(size, skip_eplb=True, remove_lora=False)
self.model_runner.maybe_remove_all_loras(self.model_runner.lora_config)内核预热:
kernel_warmup(self)(gpu_worker.py:360)。1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41# vllm/vllm/model_executor/warmup/kernel_warmup.py
def kernel_warmup(worker: "Worker"):
# Deep GEMM warmup
do_deep_gemm_warmup = (
envs.VLLM_USE_DEEP_GEMM
and is_deep_gemm_supported()
and not envs.VLLM_SKIP_DEEP_GEMM_WARMUP
)
if do_deep_gemm_warmup:
model = worker.get_model()
max_tokens = worker.scheduler_config.max_num_batched_tokens
deep_gemm_warmup(model, max_tokens)
# FlashInfer autotune for Hopper (SM 9.0) and Blackwell (SM 10.0) GPUs
if has_flashinfer() and current_platform.has_device_capability(90):
flashinfer_autotune(worker.model_runner)
# FlashInfer attention warmup
# Only warmup if the model has FlashInfer attention groups
# and is not a pooling model
def _is_flashinfer_backend(backend):
try:
return backend.get_name() == "FLASHINFER"
except NotImplementedError:
return False
if not worker.model_runner.is_pooling_model and all(
_is_flashinfer_backend(group.backend)
for groups in worker.model_runner.attn_groups
for group in groups
):
logger.info("Warming up FlashInfer attention.")
# Warmup with mixed batch containing both prefill and decode tokens
# This is to warm up both prefill and decode attention kernels
worker.model_runner._dummy_run(
num_tokens=16,
skip_eplb=True,
is_profile=True,
force_attention=True,
create_mixed_batch=True,
)CUDA Graph 捕获:若不强制 eager,则调用
self.model_runner.capture_model()(gpu_worker.py:363)。1
2
3cuda_graph_memory_bytes = 0
if not self.model_config.enforce_eager:
cuda_graph_memory_bytes = self.model_runner.capture_model()GPUModelRunner.capture_model会:- 初始化并解析可用的 CUDA Graph 模式与 keys(
initialize_cudagraph_capture;gpu_model_runner.py:3928-3995)。
1
2
3
4
5
6
7
8
9
10
11# vllm/vllm/v1/worker/gpu_model_runner.py
def capture_model(self) -> int:
# 初始化并解析可用的CUDA Graph模式与keys
if self.compilation_config.cudagraph_mode == CUDAGraphMode.NONE:
logger.warning(
"Skipping CUDA graph capture. To turn on CUDA graph capture, "
"ensure `cudagraph_mode` was not manually set to `NONE`"
)
return 0
else:
self.initialize_cudagraph_capture()- 开启捕获上下文,按配置的 batch sizes 运行
_capture_cudagraphs(...)(gpu_model_runner.py:3700-3775)。
1
2
3
4
5
6
7
8
9
10
11
12
13
14set_cudagraph_capturing_enabled(True) # 开启捕获上下文
with freeze_gc(), graph_capture(device=self.device):
start_free_gpu_memory = torch.cuda.mem_get_info()[0]
cudagraph_mode = self.compilation_config.cudagraph_mode
assert cudagraph_mode is not None
if cudagraph_mode.mixed_mode() != CUDAGraphMode.NONE:
cudagraph_runtime_mode = cudagraph_mode.mixed_mode()
compilation_cases=list(reversed(self.cudagraph_batch_sizes))
self._capture_cudagraphs( # 运行
compilation_cases,
cudagraph_runtime_mode=cudagraph_runtime_mode,
uniform_decode=False,
)- 初始化并解析可用的 CUDA Graph 模式与 keys(
_capture_cudagraphs对每个num_tokens:- 先做若干次“非捕获”的预热:
_dummy_run(..., cudagraph_runtime_mode=NONE, force_attention=FULL)(gpu_model_runner.py:3799-3813)。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45# vllm/vllm/v1/worker/gpu_model_runner.py
def _capture_cudagraphs(
self,
compilation_cases: list[int],
cudagraph_runtime_mode: CUDAGraphMode,
uniform_decode: bool,
):
# Only rank 0 should print progress bar during capture
if is_global_first_rank():
compilation_cases = tqdm(
compilation_cases,
disable=not self.load_config.use_tqdm_on_load,
desc="Capturing CUDA graphs ({}, {})".format(
"decode" if uniform_decode else "mixed prefill-decode",
cudagraph_runtime_mode.name,
),
)
# We skip EPLB here since we don't want to record dummy metrics
for num_tokens in compilation_cases:
for _ in range(self.compilation_config.cudagraph_num_of_warmups):
# Use CUDAGraphRuntimeStyle.NONE (default) for warmup.
# But be careful, warm up with `NONE`is orthogonal to
# if we want to warm up attention or not. This is
# different from the case where `FULL` implies capture
# attention while `PIECEWISE` implies no attention.
force_attention = cudagraph_runtime_mode==CUDAGraphMode.FULL
self._dummy_run(
num_tokens,
cudagraph_runtime_mode=CUDAGraphMode.NONE,
force_attention=force_attention,
uniform_decode=uniform_decode,
allow_microbatching=allow_microbatching,
skip_eplb=True,
remove_lora=False,
)
# 捕获运行
self._dummy_run(
num_tokens,
cudagraph_runtime_mode=cudagraph_runtime_mode, # PIECEWISE
uniform_decode=uniform_decode,
allow_microbatching=allow_microbatching,
skip_eplb=True,
remove_lora=False,
)
self.maybe_remove_all_loras(self.lora_config)- 再做一次“捕获运行”:
_dummy_run(..., cudagraph_runtime_mode=FULL/PIECEWISE, ...)(gpu_model_runner.py:3814-3823)。
- 先做若干次“非捕获”的预热:
在该“捕获运行”里:
_dummy_run会设置forward_context的cudagraph_runtime_mode和batch_descriptor(来自CudagraphDispatcher.dispatch)(gpu_model_runner.py:3476-3497、forward_context.py:268-306、cudagraph_dispatcher.py:89-133)。
self.model(...)实际进入CUDAGraphWrapper.__call__(或UBatchWrapper)。当entry.cudagraph is None时,触发真正的捕获:创建torch.cuda.CUDAGraph,在torch.cuda.graph(...)中执行一次模型前向并保存弱引用输出,随后将图挂到该BatchDescriptor的缓存项上(cuda_graph.py:112-166)。
采样器/池化器预热:在最后 PP rank 上再做一次
_dummy_run(..., skip_eplb=True),随后_dummy_sampler_run或_dummy_pooler_run以预先分配日志与采样缓存,减少碎片化(gpu_worker.py:420-452)。复位随机种子,避免预热影响正式推理(
gpu_worker.py:457-460)。
- 预热编译尺寸:对
_dummy_run函数
_dummy_run 自身不直接调用 torch.cuda.graph ;根据形状与模式构造好 ForwardContext ,然后调用 self.model(...) ;真正的 CUDA Graph 创建与重放发生在模型的包装器( CUDAGraphWrapper 或 UBatchWrapper )里,且仅在 cudagraph_runtime_mode 为 FULL/PIECEWISE 且该形状尚未捕获时进行首捕获,之后相同形状走 replay() 。
捕获与重放的触发点
真实的
torch.cuda.CUDAGraph()捕获/重放不在_dummy_run内部,而在self.model(...)的外层包装器:- 非 DBO 时,
self.model常被替换为CUDAGraphWrapper(runtime_mode=FULL 或 PIECEWISE) 。 - DBO 微批场景下,
self.model可能是UBatchWrapper(可捕获两段 ubatch 的组合图)。
- 非 DBO 时,
当
set_forward_context里设置的cudagraph_runtime_mode为 FULL 或 PIECEWISE ,且 batch_descriptor 为可捕获的形状时:- 首次前向:包装器检测到该形状尚未捕获,进入 torch.cuda.graph(…) 作用域完成一次前向并缓存图。
- 后续相同形状:包装器直接 replay() 。
以下情况不会触发捕获:
- cudagraph_runtime_mode=NONE (例如 is_profile=True 或强制预热)。
- 包装器模式与运行期模式不匹配。
- 调度器返回 NONE (该形状未在键集合中或配置不支持)。
- attn_metadata 指示需要计算 KV scales 等导致强制 NONE (在常规路径中会做此回退)。
常见变体与影响
uniform_decode=True :倾向捕获纯解码优化路径;同时影响 BatchDescriptor 和图键匹配。
create_mixed_batch=True :用于捕获混合 prefill+decode 的路径,prefill 的序列较短以提速。
allow_microbatching=True 且满足阈值:可能走 UBatchWrapper ,捕获针对 num_tokens 的微批组合图。
force_attention=True :即使运行模式不是 FULL,也会构造注意力元数据用于预热后端(不一定捕获图)。
这就是 _dummy_run 从入参解析、批次构造、上下文注入,到触发模型包装器完成 CUDA Graph 捕获/重放的完整流程。
1 | # vllm/vllm/v1/worker/gou_model_runner.py > GPUModelRunner |
CUDA Graph 捕获触发点
- 只有当
_dummy_run以cudagraph_runtime_mode != NONE进入模型时,且对应BatchDescriptor的缓存项尚未捕获,CUDAGraphWrapper.__call__才会执行捕获流程(cuda_graph.py:96-167)。 - 预热(不捕获)与捕获的切换是显式由
capture_model → _capture_cudagraphs → _dummy_run的cudagraph_runtime_mode参数控制的。
补充提示
- 若需确认捕获时机,可将日志级别设为
DEBUG,CUDAGraphWrapper在首次捕获时会打印 “Capturing a cudagraph on (…, batch_descriptor=…)”(cuda_graph.py:104-112)。 CudagraphDispatcher只会为配置允许的 batch sizes 提供合法的batch_descriptor;其余尺寸会被派发为CUDAGraphMode.NONE,不会触发捕获(cudagraph_dispatcher.py:89-133)。
如果你想进一步定位具体首个“捕获运行”的 batch size 或查看哪些尺寸已被捕获,我可以帮你在当前进程中打印 self.model_runner.cudagraph_batch_sizes 以及 CUDAGraphWrapper.concrete_cudagraph_entries 的 keys,以核对与配置的 cudagraph_capture_sizes 是否一致。