[Performance][CoreML] GatherOpBuilder rejects rank-0 (scalar) indices, forcing CPU fallback for StyleGAN-family models

### Describe the issue

`GatherOpBuilder::IsOpSupportedImpl` in [`onnxruntime/core/providers/coreml/builders/impl/gather_op_builder.cc`](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/coreml/builders/impl/gather_op_builder.cc) explicitly rejects Gather nodes with rank-0 (scalar) indices:

```cpp
// Don't allow scalar 'indices' input.
// We convert scalar inputs to tensors with shape [1] before providing them to CoreML.
// This modification changes the shape of the Gather output.
if (indices_shape.empty()) {
    LOGS(logger, VERBOSE) << "Gather does not support scalar 'indices'";
    return false;
}
```

The comment acknowledges the workaround (reshape scalar → `[1]`), but concludes it isn't applied because it would change the output shape. The fixup, however, is straightforward: reshape indices to `[1]`, run the gather, then squeeze the extra axis on the output — everything remains internal to the builder.

### Real-world impact

Any StyleGAN/StyleGAN2-derived model exported from PyTorch slices per-layer style codes using scalar-index Gathers. The resulting pattern is:

- `data:    [1, N_layers, D]`    (float32)
- `indices: []`                  (int64 scalar constant)
- `axis:    1`
- `output:  [1, D]`

GFPGAN (1024×1024 variant) contains 16 such Gathers. Because they're interleaved with the rest of the generator, they split the CoreML subgraph in two and force 16 CPU nodes. Rewriting the Gathers at the ONNX level (to 1D-index Gather + Squeeze — exactly what the builder could do internally) collapses the graph back to a single CoreML partition and measurably speeds things up.

Measured on an Apple M-series Mac, ONNX Runtime 1.24.4 (`onnxruntime-silicon`), CoreML EP with `MLProgram` + `MLComputeUnits: ALL` + `AllowLowPrecisionAccumulationOnGPU: 1`, 512² input:

| variant | CoreML partitions | CPU nodes | inference |
|---|---|---|---|
| baseline | 2 | 16 | 86.8 ms |
| scalar Gather → 1D Gather + Squeeze (ONNX-level rewrite) | **1** | **0** | **80.6 ms** |

The absolute speedup is ~7%. The real value is architectural: the entire generator now runs on ANE without boundary crossings.

### To reproduce

Minimal model that triggers the rejection:

```python
import onnx, onnxruntime as ort
from onnx import helper, TensorProto

def make(scalar_idx: bool):
    X = helper.make_tensor_value_info("X", TensorProto.FLOAT, [1, 16, 512])
    if scalar_idx:
        Y = helper.make_tensor_value_info("Y", TensorProto.FLOAT, [1, 512])
        inits = [helper.make_tensor("idx", TensorProto.INT64, [], [3])]
        nodes = [helper.make_node("Gather", ["X", "idx"], ["Y"], axis=1)]
    else:
        Y = helper.make_tensor_value_info("Y", TensorProto.FLOAT, [1, 512])
        inits = [
            helper.make_tensor("idx", TensorProto.INT64, [1], [3]),
            helper.make_tensor("ax", TensorProto.INT64, [1], [1]),
        ]
        nodes = [
            helper.make_node("Gather",  ["X", "idx"], ["G"], axis=1),
            helper.make_node("Squeeze", ["G", "ax"], ["Y"]),
        ]
    g = helper.make_graph(nodes, "t", [X], [Y], inits)
    return helper.make_model(g, opset_imports=[helper.make_opsetid("", 13)])

for scalar in (True, False):
    path = f"/tmp/gather_{scalar}.onnx"
    onnx.save(make(scalar), path)
    so = ort.SessionOptions(); so.log_severity_level = 1
    sess = ort.InferenceSession(path, sess_options=so, providers=[
        ("CoreMLExecutionProvider", {"ModelFormat": "MLProgram", "MLComputeUnits": "ALL"}),
        "CPUExecutionProvider",
    ])
    print(f"scalar_idx={scalar} -> {sess.get_providers()}")
```

**Expected**: Both variants place the Gather on `CoreMLExecutionProvider`.
**Actual**: The scalar-indices variant logs `Gather does not support scalar 'indices'` and the op lands on `CPUExecutionProvider`.

### Suggested fix

Inside `GatherOpBuilder::AddToModelBuilderImpl`, when `indices_shape.empty()`:

1. Insert a `Reshape`/`ExpandDims` in the emitted CoreML subgraph to promote `indices` from `[]` to `[1]`.
2. Emit the CoreML `gather` op.
3. Insert a `Squeeze` on the gather axis to restore the original output rank.

This matches exactly what the existing code comment describes, keeps the caller-visible Gather semantics unchanged, and allows the check at line ~91 to be removed or relaxed.

### Urgency

Not urgent. StyleGAN-family models are a common real-world case, and the speedup is meaningful (~7%) but not blocking — workaround is a one-pass ONNX rewrite at load time.

### Platform

Mac

### OS Version

macOS 15.3 (Darwin 25.3.0)

### ONNX Runtime Installation

Released Package (`onnxruntime-silicon==1.24.4`)

### ONNX Runtime Version or Commit ID

1.24.4

### ONNX Runtime API

Python

### Architecture

ARM64

### Execution Provider

CoreML

### Is this a quantized model?

No

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance][CoreML] GatherOpBuilder rejects rank-0 (scalar) indices, forcing CPU fallback for StyleGAN-family models #28180

Describe the issue

Real-world impact

To reproduce

Suggested fix

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Is this a quantized model?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

variant	CoreML partitions	CPU nodes	inference
baseline	2	16	86.8 ms
scalar Gather → 1D Gather + Squeeze (ONNX-level rewrite)	1	0	80.6 ms

[Performance][CoreML] GatherOpBuilder rejects rank-0 (scalar) indices, forcing CPU fallback for StyleGAN-family models #28180

Description

Describe the issue

Real-world impact

To reproduce

Suggested fix

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Is this a quantized model?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions