본문으로 건너뛰기

클라우드 네이티브 Observability Part 4 - Prometheus/Grafana로 메트릭과 알림

시리즈 소개

Series Introduction

  1. Part 1: OpenTelemetry Instrumentation
  2. Part 2: 마이크로서비스 분산 추적
  3. Part 3: 구조화된 로깅과 Correlation ID
  4. Part 4: Prometheus/Grafana로 메트릭과 알림 (현재 글)
  5. Part 5: Observability 데이터로 프로덕션 이슈 디버깅
  1. Part 1: OpenTelemetry Instrumentation
  2. Part 2: Distributed Tracing for Microservices
  3. Part 3: Structured Logging and Correlation ID
  4. Part 4: Metrics and Alerting with Prometheus/Grafana (Current Post)
  5. Part 5: Debugging Production Issues with Observability Data

메트릭의 중요성

메트릭은 시스템의 건강 상태를 수치로 보여줍니다:

  • 요청 처리량 (Throughput)
  • 응답 시간 (Latency)
  • 에러율 (Error Rate)
  • 리소스 사용량 (CPU, Memory)

The Importance of Metrics

Metrics show the health status of your system in numbers:

  • Request Throughput
  • Response Latency
  • Error Rate
  • Resource Usage (CPU, Memory)

Spring Boot + Micrometer 설정

의존성 추가

Spring Boot + Micrometer Setup

Adding Dependencies

dependencies {
implementation("org.springframework.boot:spring-boot-starter-actuator")
implementation("io.micrometer:micrometer-registry-prometheus")
}

Application 설정

Application Configuration

management:
endpoints:
web:
exposure:
include: health,info,prometheus,metrics
endpoint:
health:
show-details: always
metrics:
tags:
application: order-service
environment: production
distribution:
percentiles-histogram:
http.server.requests: true
slo:
http.server.requests: 100ms,500ms,1000ms

기본 제공 메트릭

HTTP 요청 메트릭

Built-in Metrics

HTTP Request Metrics

http_server_requests_seconds_count{method="POST",uri="/api/orders",status="200"}
http_server_requests_seconds_sum{method="POST",uri="/api/orders",status="200"}
http_server_requests_seconds_bucket{method="POST",uri="/api/orders",status="200",le="0.1"}

JVM 메트릭

JVM Metrics

jvm_memory_used_bytes{area="heap",id="G1 Eden Space"}
jvm_gc_pause_seconds_count{action="end of minor GC",cause="G1 Evacuation Pause"}
jvm_threads_live_threads

커스텀 메트릭 구현

Counter (카운터)

Implementing Custom Metrics

Counter

@Service
class OrderMetrics(private val meterRegistry: MeterRegistry) {

private val ordersCreated = Counter.builder("orders.created")
.description("Total number of orders created")
.tag("service", "order-service")
.register(meterRegistry)

private val ordersFailed = Counter.builder("orders.failed")
.description("Total number of failed orders")
.tag("service", "order-service")
.register(meterRegistry)

fun recordOrderCreated() {
ordersCreated.increment()
}

fun recordOrderFailed(reason: String) {
Counter.builder("orders.failed")
.tag("reason", reason)
.register(meterRegistry)
.increment()
}
}

Gauge (게이지)

Gauge

@Component
class QueueMetrics(
meterRegistry: MeterRegistry,
private val orderQueue: OrderQueue
) {
init {
Gauge.builder("order.queue.size", orderQueue) { queue ->
queue.size().toDouble()
}
.description("Current size of order processing queue")
.register(meterRegistry)
}
}

Timer (타이머)

Timer

@Service
class PaymentService(private val meterRegistry: MeterRegistry) {

private val paymentTimer = Timer.builder("payment.processing.time")
.description("Time taken to process payments")
.publishPercentiles(0.5, 0.95, 0.99)
.register(meterRegistry)

fun processPayment(order: Order): PaymentResult {
return paymentTimer.recordCallable {
// 결제 처리 로직 / Payment processing logic
paymentGateway.charge(order.customerId, order.totalAmount)
}!!
}
}

Distribution Summary

@Service
class OrderAnalytics(private val meterRegistry: MeterRegistry) {

private val orderAmountSummary = DistributionSummary.builder("order.amount")
.description("Distribution of order amounts")
.baseUnit("KRW")
.publishPercentiles(0.5, 0.75, 0.95)
.register(meterRegistry)

fun recordOrderAmount(amount: BigDecimal) {
orderAmountSummary.record(amount.toDouble())
}
}

Prometheus 설정

Prometheus Configuration

# prometheus.yml
global:
scrape_interval: 15s

scrape_configs:
- job_name: 'spring-boot-apps'
metrics_path: '/actuator/prometheus'
static_configs:
- targets:
- 'order-service:8080'
- 'payment-service:8081'
- 'inventory-service:8082'

- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']

Grafana 대시보드

RED Method 대시보드

Rate, Errors, Duration - 서비스 관점:

Grafana Dashboard

RED Method Dashboard

Rate, Errors, Duration - Service Perspective:

# Request Rate
sum(rate(http_server_requests_seconds_count{application="order-service"}[5m]))

# Error Rate
sum(rate(http_server_requests_seconds_count{application="order-service",status=~"5.."}[5m]))
/
sum(rate(http_server_requests_seconds_count{application="order-service"}[5m]))

# Duration (P99)
histogram_quantile(0.99, sum(rate(http_server_requests_seconds_bucket{application="order-service"}[5m])) by (le))

USE Method 대시보드

Utilization, Saturation, Errors - 리소스 관점:

USE Method Dashboard

Utilization, Saturation, Errors - Resource Perspective:

# CPU Utilization
system_cpu_usage{application="order-service"}

# Memory Utilization
jvm_memory_used_bytes{application="order-service",area="heap"}
/
jvm_memory_max_bytes{application="order-service",area="heap"}

# Thread Pool Saturation
hikaricp_connections_pending{application="order-service"}

SLI/SLO 정의

SLI/SLO Definition

Service Level Indicators

# SLI 정의 / SLI Definition
slis:
- name: availability
query: |
sum(rate(http_server_requests_seconds_count{status!~"5.."}[5m]))
/
sum(rate(http_server_requests_seconds_count[5m]))

- name: latency_p99
query: |
histogram_quantile(0.99,
sum(rate(http_server_requests_seconds_bucket[5m])) by (le)
)

- name: error_rate
query: |
sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m]))
/
sum(rate(http_server_requests_seconds_count[5m]))

Service Level Objectives

slos:
- name: availability
target: 99.9%
window: 30d

- name: latency_p99
target: 500ms
window: 30d

- name: error_rate
target: 0.1%
window: 30d

알림 설정

Alertmanager 규칙

Alert Configuration

Alertmanager Rules

# alert-rules.yml
groups:
- name: order-service-alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_server_requests_seconds_count{application="order-service",status=~"5.."}[5m]))
/
sum(rate(http_server_requests_seconds_count{application="order-service"}[5m]))
> 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }}"

- alert: HighLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_server_requests_seconds_bucket{application="order-service"}[5m])) by (le)
) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "P99 latency is {{ $value }}s"

- alert: PodDown
expr: up{job="spring-boot-apps"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service is down"

Slack 알림 설정

Slack Alert Configuration

# alertmanager.yml
route:
receiver: 'slack-notifications'
routes:
- match:
severity: critical
receiver: 'slack-critical'
- match:
severity: warning
receiver: 'slack-warnings'

receivers:
- name: 'slack-critical'
slack_configs:
- channel: '#alerts-critical'
send_resolved: true
title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'

- name: 'slack-warnings'
slack_configs:
- channel: '#alerts-warnings'
send_resolved: true

Docker Compose 전체 설정

Complete Docker Compose Configuration

version: '3.8'
services:
prometheus:
image: prom/prometheus:v2.48.0
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alert-rules.yml:/etc/prometheus/alert-rules.yml

grafana:
image: grafana/grafana:10.2.0
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- ./grafana/dashboards:/etc/grafana/provisioning/dashboards
- ./grafana/datasources:/etc/grafana/provisioning/datasources

alertmanager:
image: prom/alertmanager:v0.26.0
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml

정리

메트릭과 알림의 핵심:

항목설명
MicrometerSpring Boot 메트릭 추상화
RED MethodRate, Errors, Duration - 서비스 관점
USE MethodUtilization, Saturation, Errors - 리소스 관점
SLI/SLO서비스 품질 목표 정의
알림임계값 기반 자동 알림

다음 글에서는 Observability 데이터를 활용한 프로덕션 이슈 디버깅을 다루겠습니다.

Summary

Key points of metrics and alerting:

ItemDescription
MicrometerSpring Boot metrics abstraction
RED MethodRate, Errors, Duration - Service perspective
USE MethodUtilization, Saturation, Errors - Resource perspective
SLI/SLOService quality objective definition
AlertingThreshold-based automatic alerts

In the next post, we will cover debugging production issues using Observability data.