Describe the issue
GatherOpBuilder::IsOpSupportedImpl in onnxruntime/core/providers/coreml/builders/impl/gather_op_builder.cc explicitly rejects Gather nodes with rank-0 (scalar) indices:
// Don't allow scalar 'indices' input.
// We convert scalar inputs to tensors with shape [1] before providing them to CoreML.
// This modification changes the shape of the Gather output.
if (indices_shape.empty()) {
LOGS(logger, VERBOSE) << "Gather does not support scalar 'indices'";
return false;
}
The comment acknowledges the workaround (reshape scalar → [1]), but concludes it isn't applied because it would change the output shape. The fixup, however, is straightforward: reshape indices to [1], run the gather, then squeeze the extra axis on the output — everything remains internal to the builder.
Real-world impact
Any StyleGAN/StyleGAN2-derived model exported from PyTorch slices per-layer style codes using scalar-index Gathers. The resulting pattern is:
data: [1, N_layers, D] (float32)
indices: [] (int64 scalar constant)
axis: 1
output: [1, D]
GFPGAN (1024×1024 variant) contains 16 such Gathers. Because they're interleaved with the rest of the generator, they split the CoreML subgraph in two and force 16 CPU nodes. Rewriting the Gathers at the ONNX level (to 1D-index Gather + Squeeze — exactly what the builder could do internally) collapses the graph back to a single CoreML partition and measurably speeds things up.
Measured on an Apple M-series Mac, ONNX Runtime 1.24.4 (onnxruntime-silicon), CoreML EP with MLProgram + MLComputeUnits: ALL + AllowLowPrecisionAccumulationOnGPU: 1, 512² input:
| variant |
CoreML partitions |
CPU nodes |
inference |
| baseline |
2 |
16 |
86.8 ms |
| scalar Gather → 1D Gather + Squeeze (ONNX-level rewrite) |
1 |
0 |
80.6 ms |
The absolute speedup is ~7%. The real value is architectural: the entire generator now runs on ANE without boundary crossings.
To reproduce
Minimal model that triggers the rejection:
import onnx, onnxruntime as ort
from onnx import helper, TensorProto
def make(scalar_idx: bool):
X = helper.make_tensor_value_info("X", TensorProto.FLOAT, [1, 16, 512])
if scalar_idx:
Y = helper.make_tensor_value_info("Y", TensorProto.FLOAT, [1, 512])
inits = [helper.make_tensor("idx", TensorProto.INT64, [], [3])]
nodes = [helper.make_node("Gather", ["X", "idx"], ["Y"], axis=1)]
else:
Y = helper.make_tensor_value_info("Y", TensorProto.FLOAT, [1, 512])
inits = [
helper.make_tensor("idx", TensorProto.INT64, [1], [3]),
helper.make_tensor("ax", TensorProto.INT64, [1], [1]),
]
nodes = [
helper.make_node("Gather", ["X", "idx"], ["G"], axis=1),
helper.make_node("Squeeze", ["G", "ax"], ["Y"]),
]
g = helper.make_graph(nodes, "t", [X], [Y], inits)
return helper.make_model(g, opset_imports=[helper.make_opsetid("", 13)])
for scalar in (True, False):
path = f"/tmp/gather_{scalar}.onnx"
onnx.save(make(scalar), path)
so = ort.SessionOptions(); so.log_severity_level = 1
sess = ort.InferenceSession(path, sess_options=so, providers=[
("CoreMLExecutionProvider", {"ModelFormat": "MLProgram", "MLComputeUnits": "ALL"}),
"CPUExecutionProvider",
])
print(f"scalar_idx={scalar} -> {sess.get_providers()}")
Expected: Both variants place the Gather on CoreMLExecutionProvider.
Actual: The scalar-indices variant logs Gather does not support scalar 'indices' and the op lands on CPUExecutionProvider.
Suggested fix
Inside GatherOpBuilder::AddToModelBuilderImpl, when indices_shape.empty():
- Insert a
Reshape/ExpandDims in the emitted CoreML subgraph to promote indices from [] to [1].
- Emit the CoreML
gather op.
- Insert a
Squeeze on the gather axis to restore the original output rank.
This matches exactly what the existing code comment describes, keeps the caller-visible Gather semantics unchanged, and allows the check at line ~91 to be removed or relaxed.
Urgency
Not urgent. StyleGAN-family models are a common real-world case, and the speedup is meaningful (~7%) but not blocking — workaround is a one-pass ONNX rewrite at load time.
Platform
Mac
OS Version
macOS 15.3 (Darwin 25.3.0)
ONNX Runtime Installation
Released Package (onnxruntime-silicon==1.24.4)
ONNX Runtime Version or Commit ID
1.24.4
ONNX Runtime API
Python
Architecture
ARM64
Execution Provider
CoreML
Is this a quantized model?
No
Describe the issue
GatherOpBuilder::IsOpSupportedImplinonnxruntime/core/providers/coreml/builders/impl/gather_op_builder.ccexplicitly rejects Gather nodes with rank-0 (scalar) indices:The comment acknowledges the workaround (reshape scalar →
[1]), but concludes it isn't applied because it would change the output shape. The fixup, however, is straightforward: reshape indices to[1], run the gather, then squeeze the extra axis on the output — everything remains internal to the builder.Real-world impact
Any StyleGAN/StyleGAN2-derived model exported from PyTorch slices per-layer style codes using scalar-index Gathers. The resulting pattern is:
data: [1, N_layers, D](float32)indices: [](int64 scalar constant)axis: 1output: [1, D]GFPGAN (1024×1024 variant) contains 16 such Gathers. Because they're interleaved with the rest of the generator, they split the CoreML subgraph in two and force 16 CPU nodes. Rewriting the Gathers at the ONNX level (to 1D-index Gather + Squeeze — exactly what the builder could do internally) collapses the graph back to a single CoreML partition and measurably speeds things up.
Measured on an Apple M-series Mac, ONNX Runtime 1.24.4 (
onnxruntime-silicon), CoreML EP withMLProgram+MLComputeUnits: ALL+AllowLowPrecisionAccumulationOnGPU: 1, 512² input:The absolute speedup is ~7%. The real value is architectural: the entire generator now runs on ANE without boundary crossings.
To reproduce
Minimal model that triggers the rejection:
Expected: Both variants place the Gather on
CoreMLExecutionProvider.Actual: The scalar-indices variant logs
Gather does not support scalar 'indices'and the op lands onCPUExecutionProvider.Suggested fix
Inside
GatherOpBuilder::AddToModelBuilderImpl, whenindices_shape.empty():Reshape/ExpandDimsin the emitted CoreML subgraph to promoteindicesfrom[]to[1].gatherop.Squeezeon the gather axis to restore the original output rank.This matches exactly what the existing code comment describes, keeps the caller-visible Gather semantics unchanged, and allows the check at line ~91 to be removed or relaxed.
Urgency
Not urgent. StyleGAN-family models are a common real-world case, and the speedup is meaningful (~7%) but not blocking — workaround is a one-pass ONNX rewrite at load time.
Platform
Mac
OS Version
macOS 15.3 (Darwin 25.3.0)
ONNX Runtime Installation
Released Package (
onnxruntime-silicon==1.24.4)ONNX Runtime Version or Commit ID
1.24.4
ONNX Runtime API
Python
Architecture
ARM64
Execution Provider
CoreML
Is this a quantized model?
No