buildx Start command could be useful at least for the kubernetes driver

The kubernetes deployment is now being created when the buildx build command is run for the first time.

It sometimes takes longer time to initialize the pod especially when there is a need to scale up the kubernetes cluster.

It would be useful if I could do the retry on the initialization part. Currently, it would be possible by building a dummy image to initialize the pod but its not very convenient.

docker buildx create --name test --driver kubernetes    # create a local builder
retry {
  docker buildx build . -f ./dummy.Dockerfile           # build a dummy image to initialize the pod
}
docker buildx build .                                   # build the final image

Start comand could create the deployment and scale pods to the minReplicas value

docker buildx create --name test --driver kubernetes
retry {
  docker buildx start test
}
docker buildx build .

I could also initialize the builder on the beginning of pipeline to save some time.

If you like the idea, I am willing to prepare a Pull Request adding start command and handling the stop command for kubernetes.

@tonistiigi @AkihiroSuda I need your opinion

May 30 '21 14:05 MichalAugustyn

The other approach could be to change the default create behaviour for kubernetes driver. buildx create --driver kubernetes could already create deployment with expected replicas.

We could have a kubernetes driver specific option stopped=(true|false) if needed

May 30 '21 15:05 MichalAugustyn

I'd be happy if I could specify the timeout as a driver option. The current timeout, which ends up being close to 120s,

[+] Building 115.0s (1/1) FINISHED                                                                  
 => ERROR [internal] booting buildkit                                                        115.0s
 => => waiting for 4 pods to be ready                                                        114.9s
------
 > [internal] booting buildkit:
------
error: expected 4 replicas to be ready, got 2

which I couldn't correlate with the code that prints the error:

func (d *Driver) wait(ctx context.Context) error {
    // TODO: use watch API
    var (
        err  error
        depl *appsv1.Deployment
    )
    for try := 0; try < 100; try++ {
        depl, err = d.deploymentClient.Get(ctx, d.deployment.Name, metav1.GetOptions{})
        if err == nil {
            if depl.Status.ReadyReplicas >= int32(d.minReplicas) {
                return nil
            }
            err = errors.Errorf("expected %d replicas to be ready, got %d",
                d.minReplicas, depl.Status.ReadyReplicas)
        }
        select {
        case <-ctx.Done():
            return ctx.Err()
        case <-time.After(time.Duration(100+try*20) * time.Millisecond):
        }
    }
    return err
}

is just a hair too short for our build cluster to auto-scale.

Sep 25 '21 15:09 andras-kth

@crazy-max just reiterating @andras-kth's suggestion to introduce a configurable timeout for the buildkit pods provisioning phase. As @andras-kth mentioned, the current static timeout is sometimes just not enough, especially in cases requiring the K8s cluster to spin up new nodes for these pods. From an end-user perspective, a longer timeout could be a better UX than a failure requiring a manual retry (on our case, using Github Actions, it means a whole workflow rerun).

Dec 29 '23 10:12 psypuff