Is there a way to tweak the scaling behaviour for custom model deployments on Vertex AI?
While we can set the min and max instances on custom model deployments, is there a method for determining how many requests are sent to a given instance?
We’re performing GPU intensive operations that typically only allow us to send a single inference task to a given GPU as a single task takes >50% of the GPU Memory.
Seeing that it’s an HTTP server, I’m wondering if the concurrency will cause an out of of memory issue if a second request is sent to the same node.