Skip to content

[Performance][CoreML] GatherOpBuilder rejects rank-0 (scalar) indices, forcing CPU fallback for StyleGAN-family models #28180

@maxwbuckley

Description

@maxwbuckley

Describe the issue

GatherOpBuilder::IsOpSupportedImpl in onnxruntime/core/providers/coreml/builders/impl/gather_op_builder.cc explicitly rejects Gather nodes with rank-0 (scalar) indices:

// Don't allow scalar 'indices' input.
// We convert scalar inputs to tensors with shape [1] before providing them to CoreML.
// This modification changes the shape of the Gather output.
if (indices_shape.empty()) {
    LOGS(logger, VERBOSE) << "Gather does not support scalar 'indices'";
    return false;
}

The comment acknowledges the workaround (reshape scalar → [1]), but concludes it isn't applied because it would change the output shape. The fixup, however, is straightforward: reshape indices to [1], run the gather, then squeeze the extra axis on the output — everything remains internal to the builder.

Real-world impact

Any StyleGAN/StyleGAN2-derived model exported from PyTorch slices per-layer style codes using scalar-index Gathers. The resulting pattern is:

  • data: [1, N_layers, D] (float32)
  • indices: [] (int64 scalar constant)
  • axis: 1
  • output: [1, D]

GFPGAN (1024×1024 variant) contains 16 such Gathers. Because they're interleaved with the rest of the generator, they split the CoreML subgraph in two and force 16 CPU nodes. Rewriting the Gathers at the ONNX level (to 1D-index Gather + Squeeze — exactly what the builder could do internally) collapses the graph back to a single CoreML partition and measurably speeds things up.

Measured on an Apple M-series Mac, ONNX Runtime 1.24.4 (onnxruntime-silicon), CoreML EP with MLProgram + MLComputeUnits: ALL + AllowLowPrecisionAccumulationOnGPU: 1, 512² input:

variant CoreML partitions CPU nodes inference
baseline 2 16 86.8 ms
scalar Gather → 1D Gather + Squeeze (ONNX-level rewrite) 1 0 80.6 ms

The absolute speedup is ~7%. The real value is architectural: the entire generator now runs on ANE without boundary crossings.

To reproduce

Minimal model that triggers the rejection:

import onnx, onnxruntime as ort
from onnx import helper, TensorProto

def make(scalar_idx: bool):
    X = helper.make_tensor_value_info("X", TensorProto.FLOAT, [1, 16, 512])
    if scalar_idx:
        Y = helper.make_tensor_value_info("Y", TensorProto.FLOAT, [1, 512])
        inits = [helper.make_tensor("idx", TensorProto.INT64, [], [3])]
        nodes = [helper.make_node("Gather", ["X", "idx"], ["Y"], axis=1)]
    else:
        Y = helper.make_tensor_value_info("Y", TensorProto.FLOAT, [1, 512])
        inits = [
            helper.make_tensor("idx", TensorProto.INT64, [1], [3]),
            helper.make_tensor("ax", TensorProto.INT64, [1], [1]),
        ]
        nodes = [
            helper.make_node("Gather",  ["X", "idx"], ["G"], axis=1),
            helper.make_node("Squeeze", ["G", "ax"], ["Y"]),
        ]
    g = helper.make_graph(nodes, "t", [X], [Y], inits)
    return helper.make_model(g, opset_imports=[helper.make_opsetid("", 13)])

for scalar in (True, False):
    path = f"/tmp/gather_{scalar}.onnx"
    onnx.save(make(scalar), path)
    so = ort.SessionOptions(); so.log_severity_level = 1
    sess = ort.InferenceSession(path, sess_options=so, providers=[
        ("CoreMLExecutionProvider", {"ModelFormat": "MLProgram", "MLComputeUnits": "ALL"}),
        "CPUExecutionProvider",
    ])
    print(f"scalar_idx={scalar} -> {sess.get_providers()}")

Expected: Both variants place the Gather on CoreMLExecutionProvider.
Actual: The scalar-indices variant logs Gather does not support scalar 'indices' and the op lands on CPUExecutionProvider.

Suggested fix

Inside GatherOpBuilder::AddToModelBuilderImpl, when indices_shape.empty():

  1. Insert a Reshape/ExpandDims in the emitted CoreML subgraph to promote indices from [] to [1].
  2. Emit the CoreML gather op.
  3. Insert a Squeeze on the gather axis to restore the original output rank.

This matches exactly what the existing code comment describes, keeps the caller-visible Gather semantics unchanged, and allows the check at line ~91 to be removed or relaxed.

Urgency

Not urgent. StyleGAN-family models are a common real-world case, and the speedup is meaningful (~7%) but not blocking — workaround is a one-pass ONNX rewrite at load time.

Platform

Mac

OS Version

macOS 15.3 (Darwin 25.3.0)

ONNX Runtime Installation

Released Package (onnxruntime-silicon==1.24.4)

ONNX Runtime Version or Commit ID

1.24.4

ONNX Runtime API

Python

Architecture

ARM64

Execution Provider

CoreML

Is this a quantized model?

No

Metadata

Metadata

Assignees

No one assigned

    Labels

    ep:CoreMLissues related to CoreML execution provider

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions