Add option to memory map .ORT model loads#28164
Draft
Kevin-Taha wants to merge 4 commits intomainfrom
Draft
Conversation
…ap benchmarking - Clean up benchmark_mmap_ort.py: remove unused code, simplify multi-process approach to use native perf_test processes instead of Python wrappers - Add --hold_ms_after_session_creation flag to onnxruntime_perf_test to keep sessions alive for multi-process memory measurement - Print SESSION_READY marker when holding so benchmark script knows when to sample Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Addressing issue #25524 (MS internal: 60577894)
Today, the closest method callers have to loading models from a shared resource is by mapping the model themselves and using use_ort_model_bytes_directly - this puts the responsibility on the caller to ensure the validity of the mapping as well. These changes introduce use_memory_mapped_ort_model, a session option for using memory-mapped I/O to load ORT format models directly inside OnnxRuntime. The mapping in this case is owned by the InferenceSession. The changes to implement this are simple and minimal and use ORT's existing platform-agnostic memory mapping helpers, and if we choose to make this the default behavior could mean automatic memory savings for multi-process usage.
Note about memory implications & sharing model bytes:
The reality of this change is that using use_memory_mapped_ort_model alone doesn't have a long-running memory usage advantage because ORT will ultimately copy the model bytes from the mapped pages into Tensors. Using it in coordination with session.use_ort_model_bytes_for_initializers ensures that that initializers point directly to the flatbuffer bytes and avoids the extra copy. This would be the expected usage for multi-process sharing of a single model. This introduces questions around what the default behavior should be - the changes I made in this PR are conservative and retain all existing defaults at this time.
Changes
on initializer gating to note mmap case
Benchmark Examples
Note that the benchmark is largely written by GHCP and may not be perfect, but I've validated some of its results.
Single-Proc
Here is a sample result from a single-process benchmark using resnet50 (converted to ORT format). Note that these measure peaks during construction and not end-states, and the measurements may be imperfect.
python tools/python/benchmark_mmap_ort.py --perf-test build\Windows\Release\Release\onnxruntime_perf_test.exe --model resnet50.ort --iterations 15Multi-Proc
The multi-proc benchmark shows that total memory bandwidth gains for shared models can only be obtained alongside use_ort_model_bytes_for_initializers_