Vertex AI – Autoscaling Behavior for Inference Tasks

Is there a way to tweak the scaling behaviour for custom model deployments on Vertex AI?

While we can set the min and max instances on custom model deployments, is there a method for determining how many requests are sent to a given instance?

We’re performing GPU intensive operations that typically only allow us to send a single inference task to a given GPU as a single task takes >50% of the GPU Memory.

Seeing that it’s an HTTP server, I’m wondering if the concurrency will cause an out of of memory issue if a second request is sent to the same node.

Leave a Comment