Readiness Probe

APP 구동 순간에 트래픽 실패를 막기 위한 장치
- Sometimes, applications are temporarily unable to serve traffic. For example, an application might need to load large data or configuration files during startup, or depend on external services after startup (K8S Document)
Container 는 Running 이지만 APP 이 Running 이 아닐 수도 있다 (Booting 같은 경우)
Probe Fail 시 Service 의 Endpoint 에서 제거 (사용자가 해당 Pod로 연결 X)

Liveness Probe

App 장애 시 지속적인 트래픽 실패를 없앰
서비스 민감도가 높아 장애를 즉각적으로 대응해야 하는 Pod에 많이 걺
- Many applications running for long periods of time eventually transition to broken states, and cannot recover except by being restarted (K8S Document)
Probe Fail 시 Pod 삭제 후 재생성

StartUp Probe

APP 구동 시점을 명확하게 해주는 장치
컨테이너 사이즈가 커, 기동 시간 자체가 오래 걸리는 Pod에 많이 걺
- Sometimes, you have to deal with legacy applications that might require an additional startup time on their first initialization (K8S Document)
Readiness 와 Liveness 가 Pod 가 뜨기도 전부터 시작하면 실패할 수 밖에 없다
Health Check 의 시작을 Pod 가 UP 이 된 이후에 측정하기 위해 사용하는 Probe
Probe Fail 시 Pod 삭제 후 재생성

DEPLOYMENT SPEC // HPA SPEC (CPU 20%)

livenessProbe:
  failureThreshold: 6
  httpGet:
    path: /
    port: 15000
    scheme: HTTP
  initialDelaySeconds: 60
  periodSeconds: 10
  successThreshold: 1
  timeoutSeconds: 10

readinessProbe:
  failureThreshold: 6
  httpGet:
    path: /
    port: 15000
    scheme: HTTP
  periodSeconds: 10
  successThreshold: 1
  timeoutSeconds: 5

startupProbe:
  failureThreshold: 6
  httpGet:
    path: /
    port: 15000
    scheme: HTTP
  initialDelaySeconds: 90
  periodSeconds: 10
  successThreshold: 1
  timeoutSeconds: 5

CPU 자원이 여유 있을 땐 서비스 되다가, CPU 100% 되고 아래의 로그가 발생하는데, 이 로그는 Pod 가 정상적으로 서비스를 할 수 없는 상황을 의미한다. (Client.Timeout exceeded while awaiting headers)

503 ERROR 가 나는 이유 정리

Client Request 수가 늘어남
Pod Response Time 이 늘어남 (CPU Busy로 정상적인 반응)

3. Liveness Probe 에 설정한 시간 내에 Response 가 오지 않음 (Pod 오류로 간주 → Pod 재시작)

Node 의 CPU 는 여유가 있지만, 결국 Readiness Probe 가 성공 할 때 까지 Pod 는 기다려야 한다

4. Pod 는 Startup Probe 의 Success 시간 까지 기다렸다가 시간이 되면 다시 Available 로 돌아온다

5. Pod 가 죽었다 살아났다 하는 현상이 발생한다

결론

Health Check 옵션과 HPA Scale 옵션 조정을 아무리 진행해도, Pod 의 CPU가 100%가 되면 서비스를 정상적으로 처리 못하게 되며 Liveness Probe 시간을 초과함에 따라 Pod 가 재시작 된다.
비슷한 사례 https://github.com/kubernetes/kubernetes/issues/89898#issuecomment-728908503

Sometime Liveness/Readiness Probes fail because of net/http: request canceled while waiting for connection (Client.Timeout excee

What happened: In my cluster sometimes readiness the probes are failing. But the application works fine. Apr 06 18:15:14 kubenode** kubelet[34236]: I0406 18:15:14.056915 34236 prober.go:111] Readin...

github.com

- 재시작한 Pod 에 의해 순간적으로 남은 Pod 들에게 Resource 가 몰리게 되고, Pod 들이 차례차례 죽는다

- SVC 에 bind 된 pod 가 하나도 없게 되며 (순차적으로 통나무 들다 뻗으므로) 이 경우 HTTP 50x 에러가 뜬다

- Core 당 300 명 정도의 유저를 예상하고 (10만 명의 경우, 2Core Replicas 약 150개) , 사전에 인프라를 구축 한 뒤 HPA 옵션을 50% 보다 조금 낮게 설정해서 인프라 안정성을 확보하는 것이 필요해 보인다.

- Liveness 와 Startup 을 제거 하고 Readiness 만 설정하여 앱 운영 중 부하 상태에서 Pod 를 재시작 하는 일을 방지한다

Readiness Probe kustomize 추가

kind: Deployment
apiVersion: apps/v1
metadata:
  name: {{ .Release.Name }}
#  namespace: vsaidt-math
  labels:
    app: {{ .Release.Name }}
spec:
  replicas: 1
  selector:
    matchLabels:
      app: {{ .Release.Name }}
  template:
    metadata:
      labels:
        app: {{ .Release.Name }}
    spec:
      containers:
        - name: {{ .Release.Name }}
          image: {{ .Values.image }}
          readinessProbe:
            failureThreshold: 6
            httpGet:
              path: /
              port: 80
              scheme: HTTP
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 5
          ports:
            - containerPort: 80
              protocol: TCP
          env:
            - name: TZ
              value: Asia/Seoul
            - name: SPRING_PROFILES_ACTIVE
              valueFrom:
                configMapKeyRef:
                  key: SPRING_PROFILES_ACTIVE
                  name: {{ .Release.Name }}-configmap
      {{- with .Values.imagePullSecrets }}
      imagePullSecrets:
        {{- toYaml . | nindent 8 }}
      {{- end }}

'Cloud' 카테고리의 다른 글

로그 모니터링 시스템 개발기 (0)	2025.02.13
[Locust] 대시보드 커스텀 (feat. websocket 부하테스트) (0)	2025.01.31
[Rclone] 파일 동기화 시스템 구축 (0)	2025.01.31
[Kafka] CDC 인프라 구축 (feat. ksql) (1)	2025.01.18
[Kafka] CDC 초기화 Script (0)	2025.01.18

VENUSIM

[K8S] Pod Health Check (feat. 부하테스트)

Readiness Probe

Liveness Probe

StartUp Probe

DEPLOYMENT SPEC // HPA SPEC (CPU 20%)

결론

'Cloud' 카테고리의 다른 글

댓글

티스토리툴바

[K8S] Pod Health Check (feat. 부하테스트)

Readiness Probe

Liveness Probe

StartUp Probe

DEPLOYMENT SPEC // HPA SPEC (CPU 20%)

결론

'Cloud' 카테고리의 다른 글

관련글

댓글

티스토리툴바