nomad
nomad copied to clipboard
scheduler: take all assigned cpu cores into account instead of only those part of the largest lifecycle
In our production environment where we run Nomad on v1.8.2 we noticed overlapping cpusets and the Nomad reserve/share slices being out of sync. Specifically, the below setup where we have various task in prestart and poststart that are part of the main lifecycle.
I managed to reproduce it with the below job spec on the latest main (v1.9.1) in my sandbox environment :
job "redis-job-{{SOME_SED_MAGIC}}" {
type = "service"
group "cache" {
count = 1
task "redis" {
driver = "docker"
config {
image = "redis:3.2"
}
resources {
cores = 4
}
}
task "redis-start-side" {
lifecycle {
hook = "poststart"
sidecar = true
}
driver = "docker"
config {
image = "redis:3.2"
}
resources {
cores = 4
}
}
}
}
Spinning up two jobs with this spec resulted in the following overlap :
[sandbox@nomad-dev nomad]$ docker ps --format '{{.ID}}' | xargs -I {} bash -c 'grep -H . /sys/fs/cgroup/cpuset/docker/{}*/cpuset.effective_cpus' | column -s: -t | sort -n -k2
/sys/fs/cgroup/cpuset/docker/ec9220fbe2d0/cpuset.effective_cpus 0-3
/sys/fs/cgroup/cpuset/docker/6e06a9ed1631/cpuset.effective_cpus 4-7
/sys/fs/cgroup/cpuset/docker/a52a46cfa489/cpuset.effective_cpus 4-7
/sys/fs/cgroup/cpuset/docker/c9049b1b3f2c/cpuset.effective_cpus 8-11
Full output
[sandbox@nomad-dev nomad]$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
a52a46cfa489 redis:3.2 "docker-entrypoint.s…" 19 seconds ago Up 18 seconds 6379/tcp redis-start-side-4d6d1f92-fab2-f2bb-ca79-1f56ad3772c0
ec9220fbe2d0 redis:3.2 "docker-entrypoint.s…" 19 seconds ago Up 18 seconds 6379/tcp redis-4d6d1f92-fab2-f2bb-ca79-1f56ad3772c0
[sandbox@nomad-dev nomad]$ grep -H . /sys/fs/cgroup/cpuset/nomad/{reserve,share}/cpuset.effective_cpus
/sys/fs/cgroup/cpuset/nomad/reserve/cpuset.effective_cpus:0-7
/sys/fs/cgroup/cpuset/nomad/share/cpuset.effective_cpus:8-123
[sandbox@nomad-dev nomad]$ docker ps --format '{{.ID}}' | xargs -I {} bash -c 'grep -H . /sys/fs/cgroup/cpuset/docker/{}*/cpuset.effective_cpus' | column -s: -t | sort -n -k2
/sys/fs/cgroup/cpuset/docker/ec9220fbe2d0edef8bd9f67cabd7da226f32d346f65d196463bc4d6701864213/cpuset.effective_cpus 0-3
/sys/fs/cgroup/cpuset/docker/a52a46cfa489fe815fcbd11019c391d7fe771b878f77ddb3c993ab5cd98d8084/cpuset.effective_cpus 4-7
[sandbox@nomad-dev nomad]$ docker ps --format '{{.ID}}' | xargs docker inspect | egrep '(CpusetCpus|NOMAD_CPU_LIMIT|Id)'
"Id": "a52a46cfa489fe815fcbd11019c391d7fe771b878f77ddb3c993ab5cd98d8084",
"CpusetCpus": "4,5,6,7",
"NOMAD_CPU_LIMIT=8980",
"Id": "ec9220fbe2d0edef8bd9f67cabd7da226f32d346f65d196463bc4d6701864213",
"CpusetCpus": "0,1,2,3",
"NOMAD_CPU_LIMIT=8980",
[sandbox@nomad-dev nomad]$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
c9049b1b3f2c redis:3.2 "docker-entrypoint.s…" 16 seconds ago Up 15 seconds 6379/tcp redis-start-side-50ef4e44-0e41-b273-7915-bfd0c2fc2ec2
6e06a9ed1631 redis:3.2 "docker-entrypoint.s…" 16 seconds ago Up 16 seconds 6379/tcp redis-50ef4e44-0e41-b273-7915-bfd0c2fc2ec2
a52a46cfa489 redis:3.2 "docker-entrypoint.s…" 3 minutes ago Up 3 minutes 6379/tcp redis-start-side-4d6d1f92-fab2-f2bb-ca79-1f56ad3772c0
ec9220fbe2d0 redis:3.2 "docker-entrypoint.s…" 3 minutes ago Up 3 minutes 6379/tcp redis-4d6d1f92-fab2-f2bb-ca79-1f56ad3772c0
[sandbox@nomad-dev nomad]$ grep -H . /sys/fs/cgroup/cpuset/nomad/{reserve,share}/cpuset.effective_cpus
/sys/fs/cgroup/cpuset/nomad/reserve/cpuset.effective_cpus:0-11
/sys/fs/cgroup/cpuset/nomad/share/cpuset.effective_cpus:12-123
[sandbox@nomad-dev nomad]$ docker ps --format '{{.ID}}' | xargs -I {} bash -c 'grep -H . /sys/fs/cgroup/cpuset/docker/{}*/cpuset.effective_cpus' | column -s: -t | sort -n -k2
/sys/fs/cgroup/cpuset/docker/ec9220fbe2d0edef8bd9f67cabd7da226f32d346f65d196463bc4d6701864213/cpuset.effective_cpus 0-3
/sys/fs/cgroup/cpuset/docker/6e06a9ed1631758827aa4136690818d04c050c55559fb9f74b780b6ff8d33728/cpuset.effective_cpus 4-7
/sys/fs/cgroup/cpuset/docker/a52a46cfa489fe815fcbd11019c391d7fe771b878f77ddb3c993ab5cd98d8084/cpuset.effective_cpus 4-7
/sys/fs/cgroup/cpuset/docker/c9049b1b3f2c2bbfebc6ec8e2f3aa280a9ab23b86322452a54575b1cba3ae179/cpuset.effective_cpus 8-11
[sandbox@nomad-dev nomad]$ docker ps --format '{{.ID}}' | xargs docker inspect | egrep '(CpusetCpus|NOMAD_CPU_LIMIT|Id)'
"Id": "c9049b1b3f2c2bbfebc6ec8e2f3aa280a9ab23b86322452a54575b1cba3ae179",
"CpusetCpus": "8,9,10,11",
"NOMAD_CPU_LIMIT=8980",
"Id": "6e06a9ed1631758827aa4136690818d04c050c55559fb9f74b780b6ff8d33728",
"CpusetCpus": "4,5,6,7",
"NOMAD_CPU_LIMIT=8980",
"Id": "a52a46cfa489fe815fcbd11019c391d7fe771b878f77ddb3c993ab5cd98d8084",
"CpusetCpus": "4,5,6,7",
"NOMAD_CPU_LIMIT=8980",
"Id": "ec9220fbe2d0edef8bd9f67cabd7da226f32d346f65d196463bc4d6701864213",
"CpusetCpus": "0,1,2,3",
"NOMAD_CPU_LIMIT=8980",
Fixes a bug in the BinPackIterator.Next method, where the scheduler would only
take into account the cpusets of the tasks in the largest lifecycle. This could
result in overlapping cgroup cpusets. By using the Allocation.ReservedCores, the
scheduler will use the same cpuset view as Partition.Reserve. Added logging in
case of future regressions thus not requiring manual inspection of cgroup files.