监控与告警文档

概述

本文档描述了井云服务中心后端系统的监控体系、告警机制和运维监控策略，确保系统的可观测性和故障快速响应能力。

监控架构

监控体系架构

监控技术栈

指标监控：Prometheus + Grafana
日志监控：Loki + Grafana
链路追踪：Jaeger + OpenTelemetry
告警管理：AlertManager + Webhook
服务发现：Consul + Prometheus
健康检查：Kratos 内置健康检查

指标监控

核心指标定义

1. 业务指标

用户活跃度指标

# 用户注册指标
user_registrations_total:
  type: counter
  description: "总用户注册数"
  labels: [tenant_id, source]

# 日活跃用户指标
daily_active_users:
  type: gauge
  description: "日活跃用户数"
  labels: [tenant_id]

# 月活跃用户指标
monthly_active_users:
  type: gauge
  description: "月活跃用户数"
  labels: [tenant_id]

智能体使用指标

# 智能体创建指标
agents_created_total:
  type: counter
  description: "智能体创建总数"
  labels: [tenant_id, agent_type, platform_type]

# 智能体使用指标
agent_usage_total:
  type: counter
  description: "智能体使用总数"
  labels: [tenant_id, agent_id, usage_type]

# 智能体访问指标
agent_requests_total:
  type: counter
  description: "智能体请求总数"
  labels: [tenant_id, agent_id, status]

支付交易指标

# 订单创建指标
orders_created_total:
  type: counter
  description: "订单创建总数"
  labels: [tenant_id, order_type, payment_method]

# 支付成功指标
payments_success_total:
  type: counter
  description: "支付成功总数"
  labels: [tenant_id, payment_method, amount_range]

# 支付失败指标
payments_failed_total:
  type: counter
  description: "支付失败总数"
  labels: [tenant_id, payment_method, error_type]

点数消费指标

# 点数发放指标
points_granted_total:
  type: counter
  description: "点数发放总数"
  labels: [tenant_id, source, amount_range]

# 点数消费指标
points_consumed_total:
  type: counter
  description: "点数消费总数"
  labels: [tenant_id, usage_type, amount_range]

# 点数余额指标
points_balance:
  type: gauge
  description: "用户点数余额"
  labels: [tenant_id, user_id]

2. 系统指标

HTTP 服务指标

# HTTP 请求总数
http_requests_total:
  type: counter
  description: "HTTP 请求总数"
  labels: [method, endpoint, status_code, tenant_id]

# HTTP 请求延迟
http_request_duration_seconds:
  type: histogram
  description: "HTTP 请求延迟分布"
  labels: [method, endpoint, tenant_id]
  buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]

# HTTP 请求大小
http_request_size_bytes:
  type: histogram
  description: "HTTP 请求大小分布"
  labels: [method, endpoint]
  buckets: [100, 1000, 10000, 100000, 1000000]

# HTTP 响应大小
http_response_size_bytes:
  type: histogram
  description: "HTTP 响应大小分布"
  labels: [method, endpoint]
  buckets: [100, 1000, 10000, 100000, 1000000]

gRPC 服务指标

# gRPC 请求总数
grpc_requests_total:
  type: counter
  description: "gRPC 请求总数"
  labels: [method, status_code, tenant_id]

# gRPC 请求延迟
grpc_request_duration_seconds:
  type: histogram
  description: "gRPC 请求延迟分布"
  labels: [method, tenant_id]
  buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]

# gRPC 消息大小
grpc_message_size_bytes:
  type: histogram
  description: "gRPC 消息大小分布"
  labels: [method, message_type]
  buckets: [100, 1000, 10000, 100000, 1000000]

数据库指标

# 数据库连接池指标
db_connections_active:
  type: gauge
  description: "活跃数据库连接数"
  labels: [database, service]

db_connections_idle:
  type: gauge
  description: "空闲数据库连接数"
  labels: [database, service]

db_connections_max:
  type: gauge
  description: "最大数据库连接数"
  labels: [database, service]

# 数据库查询指标
db_query_duration_seconds:
  type: histogram
  description: "数据库查询延迟分布"
  labels: [database, operation, table]
  buckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5]

# 数据库错误指标
db_errors_total:
  type: counter
  description: "数据库错误总数"
  labels: [database, operation, error_type]

缓存指标

# Redis 连接指标
redis_connections_active:
  type: gauge
  description: "活跃 Redis 连接数"
  labels: [service]

# Redis 命令指标
redis_commands_total:
  type: counter
  description: "Redis 命令执行总数"
  labels: [service, command, status]

# Redis 延迟指标
redis_command_duration_seconds:
  type: histogram
  description: "Redis 命令延迟分布"
  labels: [service, command]
  buckets: [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.025, 0.05, 0.1]

# 缓存命中率
redis_cache_hit_ratio:
  type: gauge
  description: "Redis 缓存命中率"
  labels: [service, cache_type]

消息队列指标

# RabbitMQ 队列指标
rabbitmq_queue_messages:
  type: gauge
  description: "RabbitMQ 队列消息数"
  labels: [queue, vhost]

rabbitmq_queue_messages_ready:
  type: gauge
  description: "RabbitMQ 队列准备就绪消息数"
  labels: [queue, vhost]

rabbitmq_queue_messages_unacknowledged:
  type: gauge
  description: "RabbitMQ 队列未确认消息数"
  labels: [queue, vhost]

# 消息发布消费指标
rabbitmq_messages_published_total:
  type: counter
  description: "RabbitMQ 消息发布总数"
  labels: [exchange, routing_key]

rabbitmq_messages_consumed_total:
  type: counter
  description: "RabbitMQ 消息消费总数"
  labels: [queue, consumer]

3. 基础设施指标

系统资源指标

# CPU 使用率
cpu_usage_percent:
  type: gauge
  description: "CPU 使用率"
  labels: [instance, cpu]

# 内存使用率
memory_usage_percent:
  type: gauge
  description: "内存使用率"
  labels: [instance]

# 磁盘使用率
disk_usage_percent:
  type: gauge
  description: "磁盘使用率"
  labels: [instance, device]

# 网络流量
network_bytes_total:
  type: counter
  description: "网络流量总数"
  labels: [instance, device, direction]

容器指标

# 容器 CPU 使用率
container_cpu_usage_seconds_total:
  type: counter
  description: "容器 CPU 使用时间"
  labels: [container, pod, namespace]

# 容器内存使用量
container_memory_usage_bytes:
  type: gauge
  description: "容器内存使用量"
  labels: [container, pod, namespace]

# 容器网络流量
container_network_bytes_total:
  type: counter
  description: "容器网络流量"
  labels: [container, pod, namespace, interface, direction]

指标收集实现

1. Prometheus 配置

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alert_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

scrape_configs:
  # 服务发现
  - job_name: 'consul-services'
    consul_sd_configs:
      - server: 'consul:8500'
        services: []
    relabel_configs:
      - source_labels: [__meta_consul_tags]
        regex: .*,metrics,.*
        action: keep
      - source_labels: [__meta_consul_service]
        target_label: service
      - source_labels: [__meta_consul_node]
        target_label: node

  # 应用程序指标
  - job_name: 'jingyun-services'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - jingyun
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name

  # 基础设施指标
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'redis-exporter'
    static_configs:
      - targets: ['redis-exporter:9121']

  - job_name: 'postgres-exporter'
    static_configs:
      - targets: ['postgres-exporter:9187']

  - job_name: 'rabbitmq-exporter'
    static_configs:
      - targets: ['rabbitmq-exporter:9419']

2. 应用指标实现

// metrics/metrics.go
package metrics

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    // HTTP 请求总数
    httpRequestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"method", "endpoint", "status_code", "tenant_id"},
    )

    // HTTP 请求延迟
    httpRequestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request latency distribution",
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "endpoint", "tenant_id"},
    )

    // 数据库查询延迟
    dbQueryDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "db_query_duration_seconds",
            Help:    "Database query latency distribution",
            Buckets: []float64{0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5},
        },
        []string{"database", "operation", "table"},
    )

    // 业务指标
    userRegistrationsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "user_registrations_total",
            Help: "Total number of user registrations",
        },
        []string{"tenant_id", "source"},
    )

    agentUsageTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "agent_usage_total",
            Help: "Total number of agent usage",
        },
        []string{"tenant_id", "agent_id", "usage_type"},
    )
)

// 指标记录函数
func RecordHTTPRequest(method, endpoint, statusCode, tenantID string) {
    httpRequestsTotal.WithLabelValues(method, endpoint, statusCode, tenantID).Inc()
}

func RecordHTTPRequestDuration(method, endpoint, tenantID string, duration float64) {
    httpRequestDuration.WithLabelValues(method, endpoint, tenantID).Observe(duration)
}

func RecordDBQuery(database, operation, table string, duration float64) {
    dbQueryDuration.WithLabelValues(database, operation, table).Observe(duration)
}

func RecordUserRegistration(tenantID, source string) {
    userRegistrationsTotal.WithLabelValues(tenantID, source).Inc()
}

func RecordAgentUsage(tenantID, agentID, usageType string) {
    agentUsageTotal.WithLabelValues(tenantID, agentID, usageType).Inc()
}

3. 中间件集成

// middleware/metrics.go
package middleware

import (
    "time"
    "github.com/go-kratos/kratos/v2/middleware"
    "github.com/go-kratos/kratos/v2/transport"
    "your-project/metrics"
)

func Metrics() middleware.Middleware {
    return func(handler middleware.Handler) middleware.Handler {
        return func(ctx context.Context, req interface{}) (interface{}, error) {
            start := time.Now()
            
            // 获取请求信息
            if tr, ok := transport.FromServerContext(ctx); ok {
                method := tr.Operation()
                endpoint := extractEndpoint(tr)
                tenantID := extractTenantID(ctx)
                
                defer func() {
                    duration := time.Since(start).Seconds()
                    metrics.RecordHTTPRequestDuration(method, endpoint, tenantID, duration)
                    
                    // 记录请求结果
                    if err != nil {
                        statusCode := extractStatusCode(err)
                        metrics.RecordHTTPRequest(method, endpoint, statusCode, tenantID)
                    } else {
                        metrics.RecordHTTPRequest(method, endpoint, "200", tenantID)
                    }
                }()
            }
            
            return handler(ctx, req)
        }
    }
}

日志监控

日志架构设计

1. 日志收集架构

2. 日志格式规范

{
  "timestamp": "2025-12-27T10:30:00.000Z",
  "level": "info",
  "service": "user-service",
  "method": "CreateUser",
  "request_id": "req_123456789",
  "user_id": 12345,
  "tenant_id": 1,
  "ip_address": "192.168.1.100",
  "user_agent": "Mozilla/5.0...",
  "message": "User created successfully",
  "duration_ms": 150,
  "metadata": {
    "phone": "138****1234",
    "source": "wechat"
  },
  "error": null,
  "stack_trace": null
}

3. 日志级别定义

error：错误和异常，需要立即关注
warn：警告信息，可能影响系统功能
info：重要业务操作信息
debug：调试信息，仅在开发环境使用

日志收集配置

1. Fluentd 配置

# fluentd.conf
<source>
  @type tail
  path /var/log/containers/*.log
  pos_file /var/log/fluentd-containers.log.pos
  tag kubernetes.*
  read_from_head true
  <parse>
    @type json
    time_format %Y-%m-%dT%H:%M:%S.%NZ
  </parse>
</source>

<filter kubernetes.**>
  @type kubernetes_metadata
</filter>

<filter kubernetes.**>
  @type record_transformer
  <record>
    hostname ${hostname}
    environment "#{ENV['ENVIRONMENT']}"
  </record>
</filter>

<match kubernetes.var.log.containers.**jingyun**.log>
  @type loki
  url "http://loki:3100/loki/api/v1/push"
  labels:
    service: ${record["service"]}
    environment: ${record["environment"]}
    level: ${record["level"]}
  line_format json
</match>

2. 应用日志配置

// logger/logger.go
package logger

import (
    "github.com/go-kratos/kratos/v2/log"
    "github.com/go-kratos/kratos/v2/middleware/tracing"
)

func NewLogger(level string) log.Logger {
    return log.With(
        log.NewStdLogger(os.Stdout),
        "ts", log.DefaultTimestamp,
        "caller", log.DefaultCaller,
        "service", "user-service",
        "version", "v1.0.0",
        "trace_id", tracing.TraceID(),
        "span_id", tracing.SpanID(),
    )
}

// 结构化日志记录
func LogUserCreated(ctx context.Context, userID int64, tenantID int64, phone string, source string) {
    log.FromContext(ctx).Info(
        "user created",
        "user_id", userID,
        "tenant_id", tenantID,
        "phone", maskPhone(phone),
        "source", source,
    )
}

func LogError(ctx context.Context, err error, operation string) {
    log.FromContext(ctx).Error(
        "operation failed",
        "operation", operation,
        "error", err.Error(),
        "stack_trace", debug.Stack(),
    )
}

日志查询和分析

1. Grafana 日志查询

# 查询错误日志
{level="error"} |= "user service"

# 查询特定服务的日志
{service="user-service"}

# 查询慢请求日志
{duration_ms>1000}

# 查询特定用户的操作日志
{user_id="12345"}

# 查询特定时间范围的日志
{service="user-service"} |= "2025-12-27"

2. 日志聚合分析

# 统计错误日志数量
count_over_time({level="error"}[5m])

# 统计不同服务的请求量
sum by (service) (count_over_time({service!=""}[1h]))

# 统计响应时间分布
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# 统计错误率
sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

链路追踪

链路追踪架构

1. 追踪架构图

链路追踪架构分层：

应用层：

Gateway（网关服务）
Auth Service（认证服务）
User Service（用户服务）
Agent Service（智能体服务）

追踪收集：

OpenTelemetry Collector（OTEL）
Jaeger Collector

存储层：

Jaeger Storage（JAEGERDB）

查询层：

Jaeger UI（JAEGERUI）

数据流向： 应用层服务 → OpenTelemetry Collector → Jaeger Collector → Jaeger Storage → Jaeger UI

2. OpenTelemetry 配置

// tracing/tracing.go
package tracing

import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/jaeger"
    "go.opentelemetry.io/otel/sdk/resource"
    "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
)

func InitTracer(serviceName string) (*trace.TracerProvider, error) {
    // 创建 Jaeger 导出器
    exporter, err := jaeger.New(jaeger.WithCollectorEndpoint())
    if err != nil {
        return nil, err
    }

    // 创建追踪器提供者
    tp := trace.NewTracerProvider(
        trace.WithBatcher(exporter),
        trace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceNameKey.String(serviceName),
            semconv.ServiceVersionKey.String("v1.0.0"),
        )),
    )

    otel.SetTracerProvider(tp)
    return tp, nil
}

3. 链路追踪中间件

// middleware/tracing.go
package middleware

import (
    "context"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/trace"
    "github.com/go-kratos/kratos/v2/middleware"
    "github.com/go-kratos/kratos/v2/transport"
)

func Tracing(serviceName string) middleware.Middleware {
    tracer := otel.Tracer(serviceName)
    
    return func(handler middleware.Handler) middleware.Handler {
        return func(ctx context.Context, req interface{}) (interface{}, error) {
            if tr, ok := transport.FromServerContext(ctx); ok {
                spanName := tr.Operation()
                ctx, span := tracer.Start(ctx, spanName, trace.WithAttributes(
                    attribute.String("service", serviceName),
                    attribute.String("operation", spanName),
                ))
                defer span.End()
                
                return handler(ctx, req)
            }
            return handler(ctx, req)
        }
    }
}

告警配置

告警规则定义

1. 系统级告警规则

# alert_rules.yml
groups:
  - name: system.rules
    rules:
      # 服务可用性告警
      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.service }} is down"
          description: "Service {{ $labels.service }} has been down for more than 1 minute."

      # 高错误率告警
      - alert: HighErrorRate
        expr: rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High error rate on {{ $labels.service }}"
          description: "Error rate is {{ $value | humanizePercentage }} for service {{ $labels.service }}."

      # 高延迟告警
      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency on {{ $labels.service }}"
          description: "95th percentile latency is {{ $value }}s for service {{ $labels.service }}."

      # CPU 使用率告警
      - alert: HighCPUUsage
        expr: cpu_usage_percent > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value }}% on instance {{ $labels.instance }}."

      # 内存使用率告警
      - alert: HighMemoryUsage
        expr: memory_usage_percent > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value }}% on instance {{ $labels.instance }}."

      # 磁盘使用率告警
      - alert: HighDiskUsage
        expr: disk_usage_percent > 90
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "High disk usage on {{ $labels.instance }}"
          description: "Disk usage is {{ $value }}% on instance {{ $labels.instance }}."

2. 业务级告警规则

  - name: business.rules
    rules:
      # 用户注册异常
      - alert: UserRegistrationAnomaly
        expr: rate(user_registrations_total[5m]) < 0.1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Low user registration rate"
          description: "User registration rate is {{ $value }} per second, which is unusually low."

      # 支付失败率过高
      - alert: HighPaymentFailureRate
        expr: rate(payments_failed_total[5m]) / rate(payments_success_total[5m]) > 0.1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High payment failure rate"
          description: "Payment failure rate is {{ $value | humanizePercentage }}."

      # 智能体使用异常
      - alert: AgentUsageAnomaly
        expr: rate(agent_usage_total[5m]) == 0
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "No agent usage detected"
          description: "No agent usage has been detected in the last 15 minutes."

      # 点数余额异常
      - alert: PointsBalanceAnomaly
        expr: sum(points_balance) < 1000
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Low total points balance"
          description: "Total points balance is {{ $value }}, which is unusually low."

3. 数据库告警规则

  - name: database.rules
    rules:
      # 数据库连接数告警
      - alert: HighDatabaseConnections
        expr: db_connections_active / db_connections_max > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High database connection usage"
          description: "Database connection usage is {{ $value | humanizePercentage }}."

      # 数据库查询延迟告警
      - alert: HighDatabaseLatency
        expr: histogram_quantile(0.95, rate(db_query_duration_seconds_bucket[5m])) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High database query latency"
          description: "95th percentile database query latency is {{ $value }}s."

      # 数据库错误率告警
      - alert: HighDatabaseErrorRate
        expr: rate(db_errors_total[5m]) / rate(db_queries_total[5m]) > 0.01
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High database error rate"
          description: "Database error rate is {{ $value | humanizePercentage }}."

告警管理配置

1. AlertManager 配置

# alertmanager.yml
global:
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alerts@jingyun.design'
  smtp_auth_username: 'alerts@jingyun.design'
  smtp_auth_password: 'smtp_password'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'
  routes:
    - match:
        severity: critical
      receiver: 'critical-alerts'
    - match:
        severity: warning
      receiver: 'warning-alerts'

receivers:
  - name: 'web.hook'
    webhook_configs:
      - url: 'http://alert-webhook:8080/webhook'
        send_resolved: true

  - name: 'critical-alerts'
    email_configs:
      - to: 'ops-team@jingyun.design'
        subject: '[CRITICAL] {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          Labels: {{ range .Labels.SortedPairs }}{{ .Name }}={{ .Value }} {{ end }}
          {{ end }}
    webhook_configs:
      - url: 'http://alert-webhook:8080/critical'
        send_resolved: true

  - name: 'warning-alerts'
    email_configs:
      - to: 'dev-team@jingyun.design'
        subject: '[WARNING] {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          Labels: {{ range .Labels.SortedPairs }}{{ .Name }}={{ .Value }} {{ end }}
          {{ end }}
    webhook_configs:
      - url: 'http://alert-webhook:8080/warning'
        send_resolved: true

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'cluster', 'service']

2. Webhook 告警处理

// webhook/handler.go
package webhook

import (
    "encoding/json"
    "net/http"
)

type Alert struct {
    Status       string            `json:"status"`
    Labels       map[string]string `json:"labels"`
    Annotations  map[string]string `json:"annotations"`
    StartsAt     time.Time         `json:"startsAt"`
    EndsAt       time.Time         `json:"endsAt"`
}

type WebhookPayload struct {
    Receiver string  `json:"receiver"`
    Status   string  `json:"status"`
    Alerts   []Alert `json:"alerts"`
}

func HandleAlertWebhook(w http.ResponseWriter, r *http.Request) {
    var payload WebhookPayload
    if err := json.NewDecoder(r.Body).Decode(&payload); err != nil {
        http.Error(w, err.Error(), http.StatusBadRequest)
        return
    }

    for _, alert := range payload.Alerts {
        // 处理告警
        processAlert(alert)
    }

    w.WriteHeader(http.StatusOK)
}

func processAlert(alert Alert) {
    severity := alert.Labels["severity"]
    service := alert.Labels["service"]
    summary := alert.Annotations["summary"]
    
    // 发送到不同的通知渠道
    switch severity {
    case "critical":
        sendCriticalAlert(service, summary)
    case "warning":
        sendWarningAlert(service, summary)
    default:
        sendInfoAlert(service, summary)
    }
}

可视化仪表板

Grafana 仪表板配置

1. 系统概览仪表板

{
  "dashboard": {
    "title": "系统概览",
    "panels": [
      {
        "title": "服务状态",
        "type": "stat",
        "targets": [
          {
            "expr": "up",
            "legendFormat": "{{ service }}"
          }
        ]
      },
      {
        "title": "请求速率",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (service)",
            "legendFormat": "{{ service }}"
          }
        ]
      },
      {
        "title": "错误率",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{status_code=~\"5..\"}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)",
            "legendFormat": "{{ service }}"
          }
        ]
      },
      {
        "title": "响应时间",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
            "legendFormat": "95th percentile - {{ service }}"
          }
        ]
      }
    ]
  }
}

2. 业务指标仪表板

{
  "dashboard": {
    "title": "业务指标",
    "panels": [
      {
        "title": "用户注册趋势",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(user_registrations_total[5m])) by (tenant_id)",
            "legendFormat": "租户 {{ tenant_id }}"
          }
        ]
      },
      {
        "title": "智能体使用量",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(agent_usage_total[5m])) by (tenant_id)",
            "legendFormat": "租户 {{ tenant_id }}"
          }
        ]
      },
      {
        "title": "支付成功率",
        "type": "stat",
        "targets": [
          {
            "expr": "sum(rate(payments_success_total[5m])) / (sum(rate(payments_success_total[5m])) + sum(rate(payments_failed_total[5m])))",
            "legendFormat": "支付成功率"
          }
        ]
      },
      {
        "title": "点数消费趋势",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(points_consumed_total[5m])) by (tenant_id)",
            "legendFormat": "租户 {{ tenant_id }}"
          }
        ]
      }
    ]
  }
}

监控最佳实践

1. 指标设计原则

可操作性：每个指标都应该有对应的行动方案
可理解性：指标名称和标签应该清晰易懂
一致性：同类指标使用相同的命名规范
完整性：覆盖系统、业务和用户体验的关键方面

2. 告警设计原则

相关性：告警应该与实际问题相关
可操作性：每个告警都应该有明确的处理步骤
及时性：告警应该在问题发生时及时发出
避免噪音：减少误报和重复告警

3. 监控运维实践

定期审查：定期检查监控指标的有效性
性能优化：监控系统的性能不能影响业务系统
容量规划：基于监控数据进行容量规划
故障演练：定期进行故障演练验证监控有效性

4. 数据保留策略

原始数据：保留 15 天
聚合数据：保留 1 年
告警记录：保留 6 个月
链路追踪：保留 7 天

应急响应流程

1. 告警响应级别

P0 - 紧急：系统完全不可用，影响所有用户
P1 - 严重：核心功能不可用，影响大部分用户
P2 - 重要：部分功能异常，影响部分用户
P3 - 一般：非核心功能异常，影响少数用户

2. 响应时间要求

P0：5 分钟内响应，30 分钟内解决
P1：15 分钟内响应，2 小时内解决
P2：30 分钟内响应，4 小时内解决
P3：1 小时内响应，24 小时内解决

3. 故障处理流程

4. 事后总结要求

问题描述：详细描述故障现象和影响范围
时间线：记录故障发生、发现、响应、解决的关键时间点
根因分析：深入分析故障的根本原因
改进措施：制定具体的改进措施和预防方案
经验教训：总结经验教训，完善应急预案

概述​

监控架构​

监控体系架构​

监控技术栈​

指标监控​

核心指标定义​

1. 业务指标​

用户活跃度指标​

智能体使用指标​

支付交易指标​

点数消费指标​

2. 系统指标​

HTTP 服务指标​

gRPC 服务指标​

数据库指标​

缓存指标​

消息队列指标​

3. 基础设施指标​

系统资源指标​

容器指标​

指标收集实现​

1. Prometheus 配置​

2. 应用指标实现​

3. 中间件集成​

日志监控​

日志架构设计​

1. 日志收集架构​

2. 日志格式规范​

3. 日志级别定义​

日志收集配置​

1. Fluentd 配置​

2. 应用日志配置​

日志查询和分析​

1. Grafana 日志查询​

2. 日志聚合分析​

链路追踪​

链路追踪架构​

1. 追踪架构图​

2. OpenTelemetry 配置​

3. 链路追踪中间件​

告警配置​

告警规则定义​

1. 系统级告警规则​

2. 业务级告警规则​

3. 数据库告警规则​

告警管理配置​

1. AlertManager 配置​

2. Webhook 告警处理​

可视化仪表板​

Grafana 仪表板配置​

1. 系统概览仪表板​

2. 业务指标仪表板​

监控最佳实践​

1. 指标设计原则​

2. 告警设计原则​

3. 监控运维实践​

4. 数据保留策略​

应急响应流程​

1. 告警响应级别​

2. 响应时间要求​

3. 故障处理流程​

4. 事后总结要求​

概述

监控架构

监控体系架构

监控技术栈

指标监控

核心指标定义

1. 业务指标

用户活跃度指标

智能体使用指标

支付交易指标

点数消费指标

2. 系统指标

HTTP 服务指标

gRPC 服务指标

数据库指标

缓存指标

消息队列指标

3. 基础设施指标

系统资源指标

容器指标

指标收集实现

1. Prometheus 配置

2. 应用指标实现

3. 中间件集成

日志监控

日志架构设计

1. 日志收集架构

2. 日志格式规范

3. 日志级别定义

日志收集配置

1. Fluentd 配置

2. 应用日志配置

日志查询和分析

1. Grafana 日志查询

2. 日志聚合分析

链路追踪

链路追踪架构

1. 追踪架构图

2. OpenTelemetry 配置

3. 链路追踪中间件

告警配置

告警规则定义

1. 系统级告警规则

2. 业务级告警规则

3. 数据库告警规则

告警管理配置

1. AlertManager 配置

2. Webhook 告警处理

可视化仪表板

Grafana 仪表板配置

1. 系统概览仪表板

2. 业务指标仪表板

监控最佳实践

1. 指标设计原则

2. 告警设计原则

3. 监控运维实践

4. 数据保留策略

应急响应流程

1. 告警响应级别

2. 响应时间要求

3. 故障处理流程

4. 事后总结要求