It was an ordinary day in the life of a Kubernetes Admin working the cluster upgrade lifecycle:
- ✅ New cluster created
- ✅ Control plane accessible
- ✅ Nodes online
- ❌ Monitoring daemon: CrashLoopBackOff
Let's back up for a minute...
The Cluster Update Lifecycle
At MyFitnessPal our software delivery platform of choice is Kubernetes. One of the challenges of maintaining a Kubernetes platform in-house is keeping it up to date. The platform engineering team at MyFitnessPal approaches this challenge by prioritizing "platform modernization" initiatives 1-2 quarters a year. During these initiatives, members of the platform team stand up new, isolated Kubernetes clusters with the latest stable version and begin evaluating the cluster with the relevant upgraded components installed.
During our most recent "modernization" cycle, we encountered a mysterious error in a component they weren't expecting to break during the upgrade:
Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec /opt/datadog-agent/embedded/bin/system-probe: operation not permitted: unknown
The failing component was our monitoring daemon: the DataDog Agent. And the most perplexing part of the failure, is that nothing changed in our deployment of the agent from our current clusters to the new upgraded clusters. Between the clusters we were running the same version of the agent with the same version of the Helm chart and same chart parameters (values). What gives?
We began troubleshooting the issue doing exactly what any accomplished platform engineer would do: googling the error message.
Trusty google to the rescue. Looks like somebody else is having the same problem as us:
Unfortunately the issue was too new to have any answers for us. The common denominator for those commenting on the issue however was Kubernetes upgrades.
So if the version of the agent and helm chart did not change between clusters, the problem must be due to a change in the execution environment: something on the cluster nodes. The error seems to be coming from the container runtime, is it possible something changed there?
|OS||Ubuntu 20.04||Ubuntu 20.04|
Well look at that, a few things did change. We tried to read through the changelogs for each of these components, but no smoking guns stood out. We needed more information than that ambiguous error message was providing. All we had was the feeling of a permissions issue. What methods of permissioning could be at play here? To answer this question, we reviewed how the application was deployed, and discovered two interesting annotations on the pods.
... spec.template.metadata.annotations: container.apparmor.security.beta.kubernetes.io/system-probe: unconfined container.seccomp.security.alpha.kubernetes.io/system-probe: localhost/system-probe
According to the Kubernetes documentation for the seccomp and apparmor annotations, they enabled cluster administrators to restrict what containers are allowed to do. These annotations disabled AppArmor on the
system-probe container, but applied a custom seccomp profile to it. It certainly looked like our containers were being restricted, maybe the seccomp profile is at fault. But then again, what changed? These annotations have been present all along in both clusters.
After some more research, we found this KEP which announced changes to seccomp defaults for containers targeted for Kubernetes v1.25. We obviously had not upgraded to v1.25 yet, but the monitoring section of the KEP had some golden advice for troubleshooting that pointed us in the right direction:
Question: What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
Answer: A workload is exiting unexpectedly after the feature has been enabled:
- The termination reason is a "permission denied" error.
- The termination is reproducible.
SCMP_ACT_LOGin the default profile will show seccomp error messages in auditd or syslog.
- There are no other reasons for container termination (like eviction or exhausting resources)
More logs is exactly what we felt we needed to diagnose the issue! We located the ConfigMap containing the custom seccomp profile for the
system-probe container and changed the default action of
SCMP_ACT_ERRNO (block syscall execution) to
SCMP_ACT_LOG (log violating syscall), and then reviewed
/var/log/syslog on a node with a crashing datadog pod.
containerd: level=info msg="StartContainer for \"789abed7eed7a1318926884ef65f747a4cedd2c5214d79944ff5abe8c7c9f802\"" systemd: Started libcontainer container 789abed7eed7a1318926884ef65f747a4cedd2c5214d79944ff5abe8c7c9f802. kernel: [1109855.783490] audit: type=1326 audit(1664902715.647:992801): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=3918867 comm="runc:[2:INIT]" exe="/" sig=0 arch=c000003e syscall=269 compat=0 ip=0x4b0c3b code=0x7ffc0000 kernel: [1109855.791402] audit: type=1326 audit(1664902715.655:992802): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=3918867 comm="system-probe" exe="/opt/datadog-agent/embedded/bin/system-probe" sig=0 arch=c000003e syscall=334 compat=0 ip=0x7fede4f871d1 code=0x7ffc0000 kernel: [1109855.792513] audit: type=1326 audit(1664902715.655:992803): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=3918867 comm="system-probe" exe="/opt/datadog-agent/embedded/bin/system-probe" sig=0 arch=c000003e syscall=435 compat=0 ip=0x7fede4d33a3d code=0x7ffc0000
This finding revealed two things:
- The system-probe binary itself was issuing 2 syscalls that were not allowed by its custom seccomp profile:
- While starting the system-probe container, runc itself was getting denied by the custom seccomp profile when issuing system call
The fact that runc itself was being blocked from executing the syscall during the initial container setup explained our symptoms perfectly - this is why the system-probe was never even getting to execute and instead the container couldn't even launch.
It must be the case that the container runtime itself must be restricted by the seccomp profile of the containers it is responsible for launching.
Now that we had system call numbers, we had to translate them into their actual names. The linux kernel has a lookup table for system calls, and finding our kernel's version of the table on GitHub allowed us to identify which calls were being blocked:
Sure enough, none of these system calls were on the seccomp profile allow list. Adding them and changing the mode back to
SCMP_ACT_ERRNO allowed our container to start!
Before encountering this problem, our team was not very familiar with seccomp profiles. After being forced to troubleshoot this mysterious error, we gained some valuable knowledge and became that much more familiar with the magic subsystems that work together to make Kubernetes.
We also got a heads up about the seccomp changes that are targeted for v1.25. The TL;DR of the change is:
[by default] Every workload created will then get the RuntimeDefault (SeccompProfileTypeRuntimeDefault) as SeccompProfile.type value for the PodSecurityContext as well as the SecurityContext for every container
If [the feature flag is] enabled, the kubelet will use the RuntimeDefault seccomp profile by default, which is defined by the container runtime, instead of using the Unconfined (seccomp disabled) mode. The default profiles aim to provide a strong set of security defaults while preserving the functionality of the workload. It is possible that the default profiles differ between container runtimes and their release versions, for example when comparing those from CRI-O and containerd.
Therefore, if the container runtime default restricts things a workload needs to be able to do, then it makes sense to create a custom seccomp profile for the workload. Otherwise it should be fine to embrace a more secure execution environment for the containerized workloads and utilize the default runtime profile.
See you in Kubernetes v1.25 when we all have to put this troubleshooting knowledge to use!