跳到主要内容

监控与告警文档

概述

本文档描述了井云服务中心后端系统的监控体系、告警机制和运维监控策略,确保系统的可观测性和故障快速响应能力。

监控架构

监控体系架构

监控技术栈

  • 指标监控:Prometheus + Grafana
  • 日志监控:Loki + Grafana
  • 链路追踪:Jaeger + OpenTelemetry
  • 告警管理:AlertManager + Webhook
  • 服务发现:Consul + Prometheus
  • 健康检查:Kratos 内置健康检查

指标监控

核心指标定义

1. 业务指标

用户活跃度指标
# 用户注册指标
user_registrations_total:
type: counter
description: "总用户注册数"
labels: [tenant_id, source]

# 日活跃用户指标
daily_active_users:
type: gauge
description: "日活跃用户数"
labels: [tenant_id]

# 月活跃用户指标
monthly_active_users:
type: gauge
description: "月活跃用户数"
labels: [tenant_id]
智能体使用指标
# 智能体创建指标
agents_created_total:
type: counter
description: "智能体创建总数"
labels: [tenant_id, agent_type, platform_type]

# 智能体使用指标
agent_usage_total:
type: counter
description: "智能体使用总数"
labels: [tenant_id, agent_id, usage_type]

# 智能体访问指标
agent_requests_total:
type: counter
description: "智能体请求总数"
labels: [tenant_id, agent_id, status]
支付交易指标
# 订单创建指标
orders_created_total:
type: counter
description: "订单创建总数"
labels: [tenant_id, order_type, payment_method]

# 支付成功指标
payments_success_total:
type: counter
description: "支付成功总数"
labels: [tenant_id, payment_method, amount_range]

# 支付失败指标
payments_failed_total:
type: counter
description: "支付失败总数"
labels: [tenant_id, payment_method, error_type]
点数消费指标
# 点数发放指标
points_granted_total:
type: counter
description: "点数发放总数"
labels: [tenant_id, source, amount_range]

# 点数消费指标
points_consumed_total:
type: counter
description: "点数消费总数"
labels: [tenant_id, usage_type, amount_range]

# 点数余额指标
points_balance:
type: gauge
description: "用户点数余额"
labels: [tenant_id, user_id]

2. 系统指标

HTTP 服务指标
# HTTP 请求总数
http_requests_total:
type: counter
description: "HTTP 请求总数"
labels: [method, endpoint, status_code, tenant_id]

# HTTP 请求延迟
http_request_duration_seconds:
type: histogram
description: "HTTP 请求延迟分布"
labels: [method, endpoint, tenant_id]
buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]

# HTTP 请求大小
http_request_size_bytes:
type: histogram
description: "HTTP 请求大小分布"
labels: [method, endpoint]
buckets: [100, 1000, 10000, 100000, 1000000]

# HTTP 响应大小
http_response_size_bytes:
type: histogram
description: "HTTP 响应大小分布"
labels: [method, endpoint]
buckets: [100, 1000, 10000, 100000, 1000000]
gRPC 服务指标
# gRPC 请求总数
grpc_requests_total:
type: counter
description: "gRPC 请求总数"
labels: [method, status_code, tenant_id]

# gRPC 请求延迟
grpc_request_duration_seconds:
type: histogram
description: "gRPC 请求延迟分布"
labels: [method, tenant_id]
buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]

# gRPC 消息大小
grpc_message_size_bytes:
type: histogram
description: "gRPC 消息大小分布"
labels: [method, message_type]
buckets: [100, 1000, 10000, 100000, 1000000]
数据库指标
# 数据库连接池指标
db_connections_active:
type: gauge
description: "活跃数据库连接数"
labels: [database, service]

db_connections_idle:
type: gauge
description: "空闲数据库连接数"
labels: [database, service]

db_connections_max:
type: gauge
description: "最大数据库连接数"
labels: [database, service]

# 数据库查询指标
db_query_duration_seconds:
type: histogram
description: "数据库查询延迟分布"
labels: [database, operation, table]
buckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5]

# 数据库错误指标
db_errors_total:
type: counter
description: "数据库错误总数"
labels: [database, operation, error_type]
缓存指标
# Redis 连接指标
redis_connections_active:
type: gauge
description: "活跃 Redis 连接数"
labels: [service]

# Redis 命令指标
redis_commands_total:
type: counter
description: "Redis 命令执行总数"
labels: [service, command, status]

# Redis 延迟指标
redis_command_duration_seconds:
type: histogram
description: "Redis 命令延迟分布"
labels: [service, command]
buckets: [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.025, 0.05, 0.1]

# 缓存命中率
redis_cache_hit_ratio:
type: gauge
description: "Redis 缓存命中率"
labels: [service, cache_type]
消息队列指标
# RabbitMQ 队列指标
rabbitmq_queue_messages:
type: gauge
description: "RabbitMQ 队列消息数"
labels: [queue, vhost]

rabbitmq_queue_messages_ready:
type: gauge
description: "RabbitMQ 队列准备就绪消息数"
labels: [queue, vhost]

rabbitmq_queue_messages_unacknowledged:
type: gauge
description: "RabbitMQ 队列未确认消息数"
labels: [queue, vhost]

# 消息发布消费指标
rabbitmq_messages_published_total:
type: counter
description: "RabbitMQ 消息发布总数"
labels: [exchange, routing_key]

rabbitmq_messages_consumed_total:
type: counter
description: "RabbitMQ 消息消费总数"
labels: [queue, consumer]

3. 基础设施指标

系统资源指标
# CPU 使用率
cpu_usage_percent:
type: gauge
description: "CPU 使用率"
labels: [instance, cpu]

# 内存使用率
memory_usage_percent:
type: gauge
description: "内存使用率"
labels: [instance]

# 磁盘使用率
disk_usage_percent:
type: gauge
description: "磁盘使用率"
labels: [instance, device]

# 网络流量
network_bytes_total:
type: counter
description: "网络流量总数"
labels: [instance, device, direction]
容器指标
# 容器 CPU 使用率
container_cpu_usage_seconds_total:
type: counter
description: "容器 CPU 使用时间"
labels: [container, pod, namespace]

# 容器内存使用量
container_memory_usage_bytes:
type: gauge
description: "容器内存使用量"
labels: [container, pod, namespace]

# 容器网络流量
container_network_bytes_total:
type: counter
description: "容器网络流量"
labels: [container, pod, namespace, interface, direction]

指标收集实现

1. Prometheus 配置

# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s

rule_files:
- "alert_rules.yml"

alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093

scrape_configs:
# 服务发现
- job_name: 'consul-services'
consul_sd_configs:
- server: 'consul:8500'
services: []
relabel_configs:
- source_labels: [__meta_consul_tags]
regex: .*,metrics,.*
action: keep
- source_labels: [__meta_consul_service]
target_label: service
- source_labels: [__meta_consul_node]
target_label: node

# 应用程序指标
- job_name: 'jingyun-services'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- jingyun
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name

# 基础设施指标
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']

- job_name: 'redis-exporter'
static_configs:
- targets: ['redis-exporter:9121']

- job_name: 'postgres-exporter'
static_configs:
- targets: ['postgres-exporter:9187']

- job_name: 'rabbitmq-exporter'
static_configs:
- targets: ['rabbitmq-exporter:9419']

2. 应用指标实现

// metrics/metrics.go
package metrics

import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
)

var (
// HTTP 请求总数
httpRequestsTotal = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "endpoint", "status_code", "tenant_id"},
)

// HTTP 请求延迟
httpRequestDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request latency distribution",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "endpoint", "tenant_id"},
)

// 数据库查询延迟
dbQueryDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "db_query_duration_seconds",
Help: "Database query latency distribution",
Buckets: []float64{0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5},
},
[]string{"database", "operation", "table"},
)

// 业务指标
userRegistrationsTotal = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "user_registrations_total",
Help: "Total number of user registrations",
},
[]string{"tenant_id", "source"},
)

agentUsageTotal = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "agent_usage_total",
Help: "Total number of agent usage",
},
[]string{"tenant_id", "agent_id", "usage_type"},
)
)

// 指标记录函数
func RecordHTTPRequest(method, endpoint, statusCode, tenantID string) {
httpRequestsTotal.WithLabelValues(method, endpoint, statusCode, tenantID).Inc()
}

func RecordHTTPRequestDuration(method, endpoint, tenantID string, duration float64) {
httpRequestDuration.WithLabelValues(method, endpoint, tenantID).Observe(duration)
}

func RecordDBQuery(database, operation, table string, duration float64) {
dbQueryDuration.WithLabelValues(database, operation, table).Observe(duration)
}

func RecordUserRegistration(tenantID, source string) {
userRegistrationsTotal.WithLabelValues(tenantID, source).Inc()
}

func RecordAgentUsage(tenantID, agentID, usageType string) {
agentUsageTotal.WithLabelValues(tenantID, agentID, usageType).Inc()
}

3. 中间件集成

// middleware/metrics.go
package middleware

import (
"time"
"github.com/go-kratos/kratos/v2/middleware"
"github.com/go-kratos/kratos/v2/transport"
"your-project/metrics"
)

func Metrics() middleware.Middleware {
return func(handler middleware.Handler) middleware.Handler {
return func(ctx context.Context, req interface{}) (interface{}, error) {
start := time.Now()

// 获取请求信息
if tr, ok := transport.FromServerContext(ctx); ok {
method := tr.Operation()
endpoint := extractEndpoint(tr)
tenantID := extractTenantID(ctx)

defer func() {
duration := time.Since(start).Seconds()
metrics.RecordHTTPRequestDuration(method, endpoint, tenantID, duration)

// 记录请求结果
if err != nil {
statusCode := extractStatusCode(err)
metrics.RecordHTTPRequest(method, endpoint, statusCode, tenantID)
} else {
metrics.RecordHTTPRequest(method, endpoint, "200", tenantID)
}
}()
}

return handler(ctx, req)
}
}
}

日志监控

日志架构设计

1. 日志收集架构

2. 日志格式规范

{
"timestamp": "2025-12-27T10:30:00.000Z",
"level": "info",
"service": "user-service",
"method": "CreateUser",
"request_id": "req_123456789",
"user_id": 12345,
"tenant_id": 1,
"ip_address": "192.168.1.100",
"user_agent": "Mozilla/5.0...",
"message": "User created successfully",
"duration_ms": 150,
"metadata": {
"phone": "138****1234",
"source": "wechat"
},
"error": null,
"stack_trace": null
}

3. 日志级别定义

  • error:错误和异常,需要立即关注
  • warn:警告信息,可能影响系统功能
  • info:重要业务操作信息
  • debug:调试信息,仅在开发环境使用

日志收集配置

1. Fluentd 配置

# fluentd.conf
<source>
@type tail
path /var/log/containers/*.log
pos_file /var/log/fluentd-containers.log.pos
tag kubernetes.*
read_from_head true
<parse>
@type json
time_format %Y-%m-%dT%H:%M:%S.%NZ
</parse>
</source>

<filter kubernetes.**>
@type kubernetes_metadata
</filter>

<filter kubernetes.**>
@type record_transformer
<record>
hostname ${hostname}
environment "#{ENV['ENVIRONMENT']}"
</record>
</filter>

<match kubernetes.var.log.containers.**jingyun**.log>
@type loki
url "http://loki:3100/loki/api/v1/push"
labels:
service: ${record["service"]}
environment: ${record["environment"]}
level: ${record["level"]}
line_format json
</match>

2. 应用日志配置

// logger/logger.go
package logger

import (
"github.com/go-kratos/kratos/v2/log"
"github.com/go-kratos/kratos/v2/middleware/tracing"
)

func NewLogger(level string) log.Logger {
return log.With(
log.NewStdLogger(os.Stdout),
"ts", log.DefaultTimestamp,
"caller", log.DefaultCaller,
"service", "user-service",
"version", "v1.0.0",
"trace_id", tracing.TraceID(),
"span_id", tracing.SpanID(),
)
}

// 结构化日志记录
func LogUserCreated(ctx context.Context, userID int64, tenantID int64, phone string, source string) {
log.FromContext(ctx).Info(
"user created",
"user_id", userID,
"tenant_id", tenantID,
"phone", maskPhone(phone),
"source", source,
)
}

func LogError(ctx context.Context, err error, operation string) {
log.FromContext(ctx).Error(
"operation failed",
"operation", operation,
"error", err.Error(),
"stack_trace", debug.Stack(),
)
}

日志查询和分析

1. Grafana 日志查询

# 查询错误日志
{level="error"} |= "user service"

# 查询特定服务的日志
{service="user-service"}

# 查询慢请求日志
{duration_ms>1000}

# 查询特定用户的操作日志
{user_id="12345"}

# 查询特定时间范围的日志
{service="user-service"} |= "2025-12-27"

2. 日志聚合分析

# 统计错误日志数量
count_over_time({level="error"}[5m])

# 统计不同服务的请求量
sum by (service) (count_over_time({service!=""}[1h]))

# 统计响应时间分布
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# 统计错误率
sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

链路追踪

链路追踪架构

1. 追踪架构图

链路追踪架构分层:

应用层:

  • Gateway(网关服务)
  • Auth Service(认证服务)
  • User Service(用户服务)
  • Agent Service(智能体服务)

追踪收集:

  • OpenTelemetry Collector(OTEL)
  • Jaeger Collector

存储层:

  • Jaeger Storage(JAEGERDB)

查询层:

  • Jaeger UI(JAEGERUI)

数据流向: 应用层服务 → OpenTelemetry Collector → Jaeger Collector → Jaeger Storage → Jaeger UI

2. OpenTelemetry 配置

// tracing/tracing.go
package tracing

import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/jaeger"
"go.opentelemetry.io/otel/sdk/resource"
"go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
)

func InitTracer(serviceName string) (*trace.TracerProvider, error) {
// 创建 Jaeger 导出器
exporter, err := jaeger.New(jaeger.WithCollectorEndpoint())
if err != nil {
return nil, err
}

// 创建追踪器提供者
tp := trace.NewTracerProvider(
trace.WithBatcher(exporter),
trace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String(serviceName),
semconv.ServiceVersionKey.String("v1.0.0"),
)),
)

otel.SetTracerProvider(tp)
return tp, nil
}

3. 链路追踪中间件

// middleware/tracing.go
package middleware

import (
"context"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/trace"
"github.com/go-kratos/kratos/v2/middleware"
"github.com/go-kratos/kratos/v2/transport"
)

func Tracing(serviceName string) middleware.Middleware {
tracer := otel.Tracer(serviceName)

return func(handler middleware.Handler) middleware.Handler {
return func(ctx context.Context, req interface{}) (interface{}, error) {
if tr, ok := transport.FromServerContext(ctx); ok {
spanName := tr.Operation()
ctx, span := tracer.Start(ctx, spanName, trace.WithAttributes(
attribute.String("service", serviceName),
attribute.String("operation", spanName),
))
defer span.End()

return handler(ctx, req)
}
return handler(ctx, req)
}
}
}

告警配置

告警规则定义

1. 系统级告警规则

# alert_rules.yml
groups:
- name: system.rules
rules:
# 服务可用性告警
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.service }} is down"
description: "Service {{ $labels.service }} has been down for more than 1 minute."

# 高错误率告警
- alert: HighErrorRate
expr: rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 2m
labels:
severity: warning
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }} for service {{ $labels.service }}."

# 高延迟告警
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency on {{ $labels.service }}"
description: "95th percentile latency is {{ $value }}s for service {{ $labels.service }}."

# CPU 使用率告警
- alert: HighCPUUsage
expr: cpu_usage_percent > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value }}% on instance {{ $labels.instance }}."

# 内存使用率告警
- alert: HighMemoryUsage
expr: memory_usage_percent > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value }}% on instance {{ $labels.instance }}."

# 磁盘使用率告警
- alert: HighDiskUsage
expr: disk_usage_percent > 90
for: 1m
labels:
severity: critical
annotations:
summary: "High disk usage on {{ $labels.instance }}"
description: "Disk usage is {{ $value }}% on instance {{ $labels.instance }}."

2. 业务级告警规则

  - name: business.rules
rules:
# 用户注册异常
- alert: UserRegistrationAnomaly
expr: rate(user_registrations_total[5m]) < 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "Low user registration rate"
description: "User registration rate is {{ $value }} per second, which is unusually low."

# 支付失败率过高
- alert: HighPaymentFailureRate
expr: rate(payments_failed_total[5m]) / rate(payments_success_total[5m]) > 0.1
for: 2m
labels:
severity: critical
annotations:
summary: "High payment failure rate"
description: "Payment failure rate is {{ $value | humanizePercentage }}."

# 智能体使用异常
- alert: AgentUsageAnomaly
expr: rate(agent_usage_total[5m]) == 0
for: 15m
labels:
severity: warning
annotations:
summary: "No agent usage detected"
description: "No agent usage has been detected in the last 15 minutes."

# 点数余额异常
- alert: PointsBalanceAnomaly
expr: sum(points_balance) < 1000
for: 1m
labels:
severity: warning
annotations:
summary: "Low total points balance"
description: "Total points balance is {{ $value }}, which is unusually low."

3. 数据库告警规则

  - name: database.rules
rules:
# 数据库连接数告警
- alert: HighDatabaseConnections
expr: db_connections_active / db_connections_max > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "High database connection usage"
description: "Database connection usage is {{ $value | humanizePercentage }}."

# 数据库查询延迟告警
- alert: HighDatabaseLatency
expr: histogram_quantile(0.95, rate(db_query_duration_seconds_bucket[5m])) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "High database query latency"
description: "95th percentile database query latency is {{ $value }}s."

# 数据库错误率告警
- alert: HighDatabaseErrorRate
expr: rate(db_errors_total[5m]) / rate(db_queries_total[5m]) > 0.01
for: 2m
labels:
severity: critical
annotations:
summary: "High database error rate"
description: "Database error rate is {{ $value | humanizePercentage }}."

告警管理配置

1. AlertManager 配置

# alertmanager.yml
global:
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alerts@jingyun.design'
smtp_auth_username: 'alerts@jingyun.design'
smtp_auth_password: 'smtp_password'

route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
- match:
severity: warning
receiver: 'warning-alerts'

receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://alert-webhook:8080/webhook'
send_resolved: true

- name: 'critical-alerts'
email_configs:
- to: 'ops-team@jingyun.design'
subject: '[CRITICAL] {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Labels: {{ range .Labels.SortedPairs }}{{ .Name }}={{ .Value }} {{ end }}
{{ end }}
webhook_configs:
- url: 'http://alert-webhook:8080/critical'
send_resolved: true

- name: 'warning-alerts'
email_configs:
- to: 'dev-team@jingyun.design'
subject: '[WARNING] {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Labels: {{ range .Labels.SortedPairs }}{{ .Name }}={{ .Value }} {{ end }}
{{ end }}
webhook_configs:
- url: 'http://alert-webhook:8080/warning'
send_resolved: true

inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'service']

2. Webhook 告警处理

// webhook/handler.go
package webhook

import (
"encoding/json"
"net/http"
)

type Alert struct {
Status string `json:"status"`
Labels map[string]string `json:"labels"`
Annotations map[string]string `json:"annotations"`
StartsAt time.Time `json:"startsAt"`
EndsAt time.Time `json:"endsAt"`
}

type WebhookPayload struct {
Receiver string `json:"receiver"`
Status string `json:"status"`
Alerts []Alert `json:"alerts"`
}

func HandleAlertWebhook(w http.ResponseWriter, r *http.Request) {
var payload WebhookPayload
if err := json.NewDecoder(r.Body).Decode(&payload); err != nil {
http.Error(w, err.Error(), http.StatusBadRequest)
return
}

for _, alert := range payload.Alerts {
// 处理告警
processAlert(alert)
}

w.WriteHeader(http.StatusOK)
}

func processAlert(alert Alert) {
severity := alert.Labels["severity"]
service := alert.Labels["service"]
summary := alert.Annotations["summary"]

// 发送到不同的通知渠道
switch severity {
case "critical":
sendCriticalAlert(service, summary)
case "warning":
sendWarningAlert(service, summary)
default:
sendInfoAlert(service, summary)
}
}

可视化仪表板

Grafana 仪表板配置

1. 系统概览仪表板

{
"dashboard": {
"title": "系统概览",
"panels": [
{
"title": "服务状态",
"type": "stat",
"targets": [
{
"expr": "up",
"legendFormat": "{{ service }}"
}
]
},
{
"title": "请求速率",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (service)",
"legendFormat": "{{ service }}"
}
]
},
{
"title": "错误率",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total{status_code=~\"5..\"}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)",
"legendFormat": "{{ service }}"
}
]
},
{
"title": "响应时间",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
"legendFormat": "95th percentile - {{ service }}"
}
]
}
]
}
}

2. 业务指标仪表板

{
"dashboard": {
"title": "业务指标",
"panels": [
{
"title": "用户注册趋势",
"type": "graph",
"targets": [
{
"expr": "sum(rate(user_registrations_total[5m])) by (tenant_id)",
"legendFormat": "租户 {{ tenant_id }}"
}
]
},
{
"title": "智能体使用量",
"type": "graph",
"targets": [
{
"expr": "sum(rate(agent_usage_total[5m])) by (tenant_id)",
"legendFormat": "租户 {{ tenant_id }}"
}
]
},
{
"title": "支付成功率",
"type": "stat",
"targets": [
{
"expr": "sum(rate(payments_success_total[5m])) / (sum(rate(payments_success_total[5m])) + sum(rate(payments_failed_total[5m])))",
"legendFormat": "支付成功率"
}
]
},
{
"title": "点数消费趋势",
"type": "graph",
"targets": [
{
"expr": "sum(rate(points_consumed_total[5m])) by (tenant_id)",
"legendFormat": "租户 {{ tenant_id }}"
}
]
}
]
}
}

监控最佳实践

1. 指标设计原则

  • 可操作性:每个指标都应该有对应的行动方案
  • 可理解性:指标名称和标签应该清晰易懂
  • 一致性:同类指标使用相同的命名规范
  • 完整性:覆盖系统、业务和用户体验的关键方面

2. 告警设计原则

  • 相关性:告警应该与实际问题相关
  • 可操作性:每个告警都应该有明确的处理步骤
  • 及时性:告警应该在问题发生时及时发出
  • 避免噪音:减少误报和重复告警

3. 监控运维实践

  • 定期审查:定期检查监控指标的有效性
  • 性能优化:监控系统的性能不能影响业务系统
  • 容量规划:基于监控数据进行容量规划
  • 故障演练:定期进行故障演练验证监控有效性

4. 数据保留策略

  • 原始数据:保留 15 天
  • 聚合数据:保留 1 年
  • 告警记录:保留 6 个月
  • 链路追踪:保留 7 天

应急响应流程

1. 告警响应级别

  • P0 - 紧急:系统完全不可用,影响所有用户
  • P1 - 严重:核心功能不可用,影响大部分用户
  • P2 - 重要:部分功能异常,影响部分用户
  • P3 - 一般:非核心功能异常,影响少数用户

2. 响应时间要求

  • P0:5 分钟内响应,30 分钟内解决
  • P1:15 分钟内响应,2 小时内解决
  • P2:30 分钟内响应,4 小时内解决
  • P3:1 小时内响应,24 小时内解决

3. 故障处理流程

4. 事后总结要求

  • 问题描述:详细描述故障现象和影响范围
  • 时间线:记录故障发生、发现、响应、解决的关键时间点
  • 根因分析:深入分析故障的根本原因
  • 改进措施:制定具体的改进措施和预防方案
  • 经验教训:总结经验教训,完善应急预案