클라우드 네이티브 Observability Part 5 - Observability 데이터로 프로덕션 이슈 디버깅

Paul Lee·2026년 1월 29일·약 7분 읽기

시리즈 소개

Series Introduction

Part 1: OpenTelemetry Instrumentation
Part 2: 마이크로서비스 분산 추적
Part 3: 구조화된 로깅과 Correlation ID
Part 4: Prometheus/Grafana로 메트릭과 알림
Part 5: Observability 데이터로 프로덕션 이슈 디버깅 (현재 글)

Part 1: OpenTelemetry Instrumentation
Part 2: Distributed Tracing for Microservices
Part 3: Structured Logging and Correlation ID
Part 4: Metrics and Alerting with Prometheus/Grafana
Part 5: Debugging Production Issues with Observability Data (current post)

디버깅 워크플로우

MELT 접근법

Metrics → Events → Logs → Traces

메트릭으로 문제 감지
이벤트/알림으로 시점 확인
로그로 상세 정보 파악
트레이스로 요청 흐름 추적

Debugging Workflow

MELT Approach

Metrics → Events → Logs → Traces

Detect problems with Metrics
Identify timing with Events/Alerts
Understand details with Logs
Trace request flow with Traces

실제 장애 시나리오

시나리오 1: 간헐적 타임아웃

증상: 일부 주문 생성 요청이 30초 후 타임아웃

1단계: 메트릭 확인

Real-World Incident Scenarios

Scenario 1: Intermittent Timeouts

Symptom: Some order creation requests timeout after 30 seconds

Step 1: Check Metrics

# P99 레이턴시 급증 확인
histogram_quantile(0.99,
  sum(rate(http_server_requests_seconds_bucket{uri="/api/orders"}[5m])) by (le)
)

Grafana에서 확인: P99 레이턴시가 특정 시간대에 30초까지 급증

2단계: 트레이스 분석

Confirmed in Grafana: P99 latency spikes to 30 seconds during specific time periods

Step 2: Trace Analysis

Jaeger에서 느린 요청 검색:

Search for slow requests in Jaeger:

service=order-service minDuration=10s

발견: inventory.checkStock span이 29초 소요

3단계: 로그 확인

Finding: inventory.checkStock span takes 29 seconds

Step 3: Check Logs

{service="inventory-service"} | json | latency > 10000

발견: 특정 상품 ID에서 데이터베이스 쿼리가 느림

4단계: 근본 원인

Finding: Database query is slow for specific product IDs

Step 4: Root Cause

-- 실행 계획 확인
EXPLAIN ANALYZE SELECT * FROM inventory WHERE product_id = 'PROD-12345';

원인: product_id 인덱스 누락

해결:

Cause: Missing index on product_id

Solution:

CREATE INDEX idx_inventory_product_id ON inventory(product_id);

시나리오 2: 메모리 누수

증상: 서비스가 주기적으로 OOM으로 재시작

1단계: 메트릭 확인

Scenario 2: Memory Leak

Symptom: Service periodically restarts due to OOM

Step 1: Check Metrics

# Heap 메모리 사용량 추세
jvm_memory_used_bytes{area="heap",application="order-service"}

패턴 발견: 메모리가 점진적으로 증가 후 급락 (재시작)

2단계: GC 로그 분석

Pattern found: Memory gradually increases then drops sharply (restart)

Step 2: Analyze GC Logs

# GC 빈도 증가
rate(jvm_gc_pause_seconds_count{application="order-service"}[5m])

발견: Full GC 빈도가 점점 증가

3단계: 힙 덤프 분석

Finding: Full GC frequency is gradually increasing

Step 3: Analyze Heap Dump

# 힙 덤프 생성
jmap -dump:format=b,file=heapdump.hprof <pid>

# MAT 또는 VisualVM으로 분석

발견: OrderCache 객체가 메모리의 80% 차지

4단계: 코드 검토

Finding: OrderCache objects occupy 80% of memory

Step 4: Code Review

// 문제 코드
@Component
class OrderCache {
    private val cache = ConcurrentHashMap<String, Order>()

    fun put(orderId: String, order: Order) {
        cache[orderId] = order  // 제거 로직 없음!
    }
}

해결:

Solution:

@Component
class OrderCache {
    private val cache = Caffeine.newBuilder()
        .maximumSize(10_000)
        .expireAfterWrite(Duration.ofHours(1))
        .build<String, Order>()
}

시나리오 3: 서비스 간 연쇄 장애

증상: 결제 서비스 장애가 전체 시스템 마비로 이어짐

1단계: 의존성 그래프 확인

Scenario 3: Cascading Failure Between Services

Symptom: Payment service failure leads to complete system outage

Step 1: Check Dependency Graph

Jaeger Service Map에서 확인:

Order Service → Payment Service (동기 호출)
Payment Service 장애 시 Order Service 스레드 블로킹

2단계: 메트릭 확인

Confirmed in Jaeger Service Map:

Order Service → Payment Service (synchronous call)
When Payment Service fails, Order Service threads block

Step 2: Check Metrics

# 연결 풀 고갈
hikaricp_connections_active{application="order-service"}
hikaricp_connections_pending{application="order-service"}

발견: Payment Service 타임아웃 동안 모든 연결이 대기 상태

3단계: 로그 확인

Finding: All connections are in waiting state during Payment Service timeout

Step 3: Check Logs

{service="order-service"} |= "Connection pool exhausted"

해결: Circuit Breaker 패턴 적용

Solution: Apply Circuit Breaker Pattern

@Service
class PaymentClient(
    private val circuitBreakerFactory: Resilience4JCircuitBreakerFactory
) {
    private val circuitBreaker = circuitBreakerFactory.create("payment")

    fun processPayment(order: Order): PaymentResult {
        return circuitBreaker.run(
            { paymentApi.charge(order) },
            { fallback -> handleFallback(order) }
        )
    }

    private fun handleFallback(order: Order): PaymentResult {
        // 결제 대기열에 추가하고 나중에 처리
        paymentQueue.add(order)
        return PaymentResult.PENDING
    }
}

디버깅 도구 모음

1. 분산 트레이스 검색 쿼리

Debugging Toolkit

1. Distributed Trace Search Queries

# 느린 요청
service=order-service minDuration=1s

# 에러 요청
service=order-service error=true

# 특정 사용자
service=order-service tag.customer.id=CUST-123

2. 유용한 PromQL 쿼리

2. Useful PromQL Queries

# 에러율 급증 서비스 찾기
topk(5,
  sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m])) by (application)
)

# 레이턴시 급증 엔드포인트
topk(5,
  histogram_quantile(0.99,
    sum(rate(http_server_requests_seconds_bucket[5m])) by (uri, le)
  )
)

# 메모리 사용량 상위 서비스
topk(5,
  jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"}
)

3. 유용한 LogQL 쿼리

3. Useful LogQL Queries

# 에러 로그 집계
sum by (errorType) (
  count_over_time({service="order-service"} | json | level="ERROR" [1h])
)

# 특정 traceId의 모든 로그
{service=~".+"} |= "traceId=abc123"

# 느린 쿼리 로그
{service=~".+"} | json | queryTime > 1000

On-Call 플레이북

서비스 다운 시

즉시 확인
- up{job="spring-boot-apps"} 메트릭 확인
- Pod 상태 확인: kubectl get pods
최근 변경 확인
- 최근 배포 이력
- 설정 변경
로그 확인
- 시작 로그에서 에러 확인
- OOM 여부 확인
롤백 결정
- 빠른 복구가 필요하면 이전 버전으로 롤백

성능 저하 시

영향 범위 파악
- 전체 서비스? 특정 엔드포인트?
병목 지점 식별
- 트레이스로 느린 span 확인
- 외부 의존성 문제?
리소스 확인
- CPU, 메모리, 디스크 I/O
- 연결 풀 상태
임시 조치
- 스케일 아웃
- Rate limiting 적용

On-Call Playbook

When Service is Down

Immediate Check
- Check up{job="spring-boot-apps"} metric
- Check Pod status: kubectl get pods
Check Recent Changes
- Recent deployment history
- Configuration changes
Check Logs
- Check for errors in startup logs
- Check for OOM
Rollback Decision
- Rollback to previous version if quick recovery is needed

When Performance Degrades

Assess Impact Scope
- Entire service? Specific endpoints?
Identify Bottleneck
- Check slow spans via traces
- External dependency issues?
Check Resources
- CPU, Memory, Disk I/O
- Connection pool status
Temporary Measures
- Scale out
- Apply rate limiting

포스트모템 템플릿

Postmortem Template

# 장애 보고서: [제목]

## 개요
- 발생 시간: YYYY-MM-DD HH:MM ~ HH:MM (KST)
- 영향 범위: [서비스명, 사용자 수]
- 심각도: [Critical/High/Medium/Low]

## 타임라인
- HH:MM - 첫 번째 알림 발생
- HH:MM - 조사 시작
- HH:MM - 근본 원인 파악
- HH:MM - 수정 배포
- HH:MM - 정상화 확인

## 근본 원인
[상세 설명]

## 해결 방법
[수행한 조치]

## 영향
- 에러율: X%
- 영향받은 요청 수: N건

## 교훈
### 잘된 점
-

### 개선할 점
-

## 액션 아이템
- [ ] [담당자] 액션 내용 (기한)

시리즈 마무리

이 시리즈에서 다룬 내용:

Part	주제	핵심
1	OpenTelemetry	계측의 기초
2	분산 추적	요청 흐름 시각화
3	구조화된 로깅	검색 가능한 로그
4	메트릭/알림	선제적 모니터링
5	디버깅	실전 문제 해결

Observability는 단순한 모니터링이 아닙니다. 시스템을 이해하고, 문제를 예방하며, 빠르게 해결할 수 있는 능력입니다.

Series Wrap-up

What this series covered:

Part	Topic	Key Point
1	OpenTelemetry	Instrumentation basics
2	Distributed Tracing	Visualizing request flows
3	Structured Logging	Searchable logs
4	Metrics/Alerting	Proactive monitoring
5	Debugging	Real-world problem solving

Observability is not just monitoring. It is the ability to understand your system, prevent problems, and resolve issues quickly.

클라우드 네이티브 Observability Part 5 - Observability 데이터로 프로덕션 이슈 디버깅

시리즈 소개​

Series Introduction​

디버깅 워크플로우​

MELT 접근법​

Debugging Workflow​

MELT Approach​

실제 장애 시나리오​

시나리오 1: 간헐적 타임아웃​

Real-World Incident Scenarios​

Scenario 1: Intermittent Timeouts​

시나리오 2: 메모리 누수​

Scenario 2: Memory Leak​

시나리오 3: 서비스 간 연쇄 장애​

Scenario 3: Cascading Failure Between Services​

디버깅 도구 모음​

1. 분산 트레이스 검색 쿼리​

Debugging Toolkit​

1. Distributed Trace Search Queries​

2. 유용한 PromQL 쿼리​

2. Useful PromQL Queries​

3. 유용한 LogQL 쿼리​

3. Useful LogQL Queries​

On-Call 플레이북​

서비스 다운 시​

성능 저하 시​

On-Call Playbook​

When Service is Down​

When Performance Degrades​

포스트모템 템플릿​

Postmortem Template​

시리즈 마무리​

Series Wrap-up​