cockroach-operator icon indicating copy to clipboard operation
cockroach-operator copied to clipboard

Fix bug with Jobs

Open chrislovecnm opened this issue 4 years ago • 4 comments

Occasionally we are having a bug when we are looping to find a job. Here is the err I am getting

    logger.go:130: 2021-05-27T17:00:56.697Z	WARN	job pod is ready	{"action": "Crdb Version Validator"}
    logger.go:130: 2021-05-27T17:00:56.782Z	WARN	completed version checker	{"action": "Crdb Version Validator", "CrdbCluster": "crdb-test-pxntwh/crdb", "calVersion": "v20.2.8", "containerImage": "cockroachdb/cockroach:v20.2.8"}
    logger.go:130: 2021-05-27T17:00:56.782Z	INFO	request was interrupted	{"CrdbCluster": "crdb-test-pxntwh/crdb"}
    logger.go:130: 2021-05-27T17:00:56.782Z	INFO	reconciling CockroachDB cluster	{"CrdbCluster": "crdb-test-pxntwh/crdb"}
    logger.go:130: 2021-05-27T17:00:56.782Z	INFO	Running action with index: 0 and  name: Decommission	{"CrdbCluster": "crdb-test-pxntwh/crdb"}
    logger.go:130: 2021-05-27T17:00:56.782Z	WARN	check decommission oportunities	{"action": "decommission", "CrdbCluster": "crdb-test-pxntwh/crdb"}
    logger.go:130: 2021-05-27T17:00:56.782Z	INFO	replicas decommisioning	{"action": "decommission", "CrdbCluster": "crdb-test-pxntwh/crdb", "status.CurrentReplicas": 3, "expected": 3}
    logger.go:130: 2021-05-27T17:00:56.782Z	INFO	Running action with index: 1 and  name: VersionCheckerAction	{"CrdbCluster": "crdb-test-pxntwh/crdb"}
    logger.go:130: 2021-05-27T17:00:56.782Z	WARN	starting to check the crdb version of the container provided	{"action": "Crdb Version Validator", "CrdbCluster": "crdb-test-pxntwh/crdb"}
    logger.go:130: 2021-05-27T17:00:56.782Z	WARN	User set image.name, using that field instead of cockroachDBVersion	{"action": "Crdb Version Validator", "CrdbCluster": "crdb-test-pxntwh/crdb"}
    logger.go:130: 2021-05-27T17:00:56.794Z	ERROR	failed to reconcile job only err	{"action": "Crdb Version Validator", "CrdbCluster": "crdb-test-pxntwh/crdb", "error": "Job.batch \"crdb-vcheck-27035580\" is invalid: spec.template: Invalid value: core.PodTemplateSpec{ObjectMeta:v1.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{\"app.kubernetes.io/component\":\"database\", \"app.kubernetes.io/instance\":\"crdb\", \"app.kubernetes.io/name\":\"cockroachdb\", \"controller-uid\":\"a0255182-4d4a-4c98-af46-1cf5eee46a3e\", \"job-name\":\"crdb-vcheck-27035580\"}, Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Spec:core.PodSpec{Volumes:[]core.Volume(nil), InitContainers:[]core.Container(nil), Containers:[]core.Container{core.Container{Name:\"crdb\", Image:\"cockroachdb/cockroach:v20.2.9\", Command:[]string{\"/bin/bash\"}, Args:[]string{\"-c\", \"/cockroach/cockroach.sh version | grep 'Build Tag:'| awk '{print $3}'; sleep 150\"}, WorkingDir:\"\", Ports:[]core.ContainerPort(nil), EnvFrom:[]core.EnvFromSource(nil), Env:[]core.EnvVar(nil), Resources:core.ResourceRequirements{Limits:core.ResourceList(nil), Requests:core.ResourceList(nil)}, VolumeMounts:[]core.VolumeMount(nil), VolumeDevices:[]core.VolumeDevice(nil), LivenessProbe:(*core.Probe)(nil), ReadinessProbe:(*core.Probe)(nil), StartupProbe:(*core.Probe)(nil), Lifecycle:(*core.Lifecycle)(nil), TerminationMessagePath:\"/dev/termination-log\", TerminationMessagePolicy:\"File\", ImagePullPolicy:\"IfNotPresent\", SecurityContext:(*core.SecurityContext)(nil), Stdin:false, StdinOnce:false, TTY:false}}, EphemeralContainers:[]core.EphemeralContainer(nil), RestartPolicy:\"Never\", TerminationGracePeriodSeconds:(*int64)(0xc0158f7b20), ActiveDeadlineSeconds:(*int64)(nil), DNSPolicy:\"ClusterFirst\", NodeSelector:map[string]string(nil), ServiceAccountName:\"cockroach-database-sa\", AutomountServiceAccountToken:(*bool)(0xc0158f7b28), NodeName:\"\", SecurityContext:(*core.PodSecurityContext)(0xc01b036500), ImagePullSecrets:[]core.LocalObjectReference(nil), Hostname:\"\", Subdomain:\"\", SetHostnameAsFQDN:(*bool)(nil), Affinity:(*core.Affinity)(nil), SchedulerName:\"default-scheduler\", Tolerations:[]core.Toleration(nil), HostAliases:[]core.HostAlias(nil), PriorityClassName:\"\", Priority:(*int32)(nil), PreemptionPolicy:(*core.PreemptionPolicy)(nil), DNSConfig:(*core.PodDNSConfig)(nil), ReadinessGates:[]core.PodReadinessGate(nil), RuntimeClassName:(*string)(nil), Overhead:core.ResourceList(nil), EnableServiceLinks:(*bool)(nil), TopologySpreadConstraints:[]core.TopologySpreadConstraint(nil)}}: field is immutable"}
    logger.go:130: 2021-05-27T17:00:56.794Z	WARN	version checker	{"action": "Crdb Version Validator", "CrdbCluster": "crdb-test-pxntwh/crdb", "job": "crdb-vcheck-27035580"}
    logger.go:130: 2021-05-27T17:00:56.799Z	WARN	job pod is ready	{"action": "Crdb Version Validator"}
    logger.go:130: 2021-05-27T17:00:56.883Z	WARN	completed version checker	{"action": "Crdb Version Validator", "CrdbCluster": "crdb-test-pxntwh/crdb", "calVersion": "v20.2.8", "containerImage": "cockroachdb/cockroach:v20.2.8"}

We are recovering, but this will look weird to an end user.

chrislovecnm avatar May 27 '21 17:05 chrislovecnm

@alinadonisa @keith-mcclellan PTAL

chrislovecnm avatar May 27 '21 17:05 chrislovecnm

@chrislovecnm the job has already Image:"cockroachdb/cockroach:v20.2.9\ and you are reconciling for version "v20.2.8". What is the scenario that you are running? If you are running in parallel stuff, or reconcile in the same minute period it will generate the same timestamp and the name of the job will be the same on different runs.

alinadonisa avatar May 27 '21 17:05 alinadonisa

It happens occasionally during running our e2e tests.

chrislovecnm avatar May 27 '21 17:05 chrislovecnm

@davidwding can we close this?

chrislovecnm avatar Sep 14 '21 16:09 chrislovecnm