监控与告警文档
概述
本文档描述了井云服务中心后端系统的监控体系、告警机制和运维监控策略,确保系统的可观测性和故障快速响应能力。
监控架构
监控体系架构
监控技术栈
- 指标监控:Prometheus + Grafana
- 日志监控:Loki + Grafana
- 链路追踪:Jaeger + OpenTelemetry
- 告警管理:AlertManager + Webhook
- 服务发现:Consul + Prometheus
- 健康检查:Kratos 内置健康检查
指标监控
核心指标定义
1. 业务指标
用户活跃度指标
# 用户注册指标
user_registrations_total:
type: counter
description: "总用户注册数"
labels: [tenant_id, source]
# 日活跃用户指标
daily_active_users:
type: gauge
description: "日活跃用户数"
labels: [tenant_id]
# 月活跃用户指标
monthly_active_users:
type: gauge
description: "月活跃用户数"
labels: [tenant_id]
智能体使用指标
# 智能体创建指标
agents_created_total:
type: counter
description: "智能体创建总数"
labels: [tenant_id, agent_type, platform_type]
# 智能体使用指标
agent_usage_total:
type: counter
description: "智能体使用总数"
labels: [tenant_id, agent_id, usage_type]
# 智能体访问指标
agent_requests_total:
type: counter
description: "智能体请求总数"
labels: [tenant_id, agent_id, status]
支付交易指标
# 订单创建指标
orders_created_total:
type: counter
description: "订单创建总数"
labels: [tenant_id, order_type, payment_method]
# 支付成功指标
payments_success_total:
type: counter
description: "支付成功总数"
labels: [tenant_id, payment_method, amount_range]
# 支付失败指标
payments_failed_total:
type: counter
description: "支付失败总数"
labels: [tenant_id, payment_method, error_type]
点数消费指标
# 点数发放指标
points_granted_total:
type: counter
description: "点数发放总数"
labels: [tenant_id, source, amount_range]
# 点数消费指标
points_consumed_total:
type: counter
description: "点数消费总数"
labels: [tenant_id, usage_type, amount_range]
# 点数余额指标
points_balance:
type: gauge
description: "用户点数余额"
labels: [tenant_id, user_id]
2. 系统指标
HTTP 服务指标
# HTTP 请求总数
http_requests_total:
type: counter
description: "HTTP 请求总数"
labels: [method, endpoint, status_code, tenant_id]
# HTTP 请求延迟
http_request_duration_seconds:
type: histogram
description: "HTTP 请求延迟分布"
labels: [method, endpoint, tenant_id]
buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
# HTTP 请求大小
http_request_size_bytes:
type: histogram
description: "HTTP 请求大小分布"
labels: [method, endpoint]
buckets: [100, 1000, 10000, 100000, 1000000]
# HTTP 响应大小
http_response_size_bytes:
type: histogram
description: "HTTP 响应大小分布"
labels: [method, endpoint]
buckets: [100, 1000, 10000, 100000, 1000000]
gRPC 服务指标
# gRPC 请求总数
grpc_requests_total:
type: counter
description: "gRPC 请求总数"
labels: [method, status_code, tenant_id]
# gRPC 请求延迟
grpc_request_duration_seconds:
type: histogram
description: "gRPC 请求延迟分布"
labels: [method, tenant_id]
buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
# gRPC 消息大小
grpc_message_size_bytes:
type: histogram
description: "gRPC 消息大小分布"
labels: [method, message_type]
buckets: [100, 1000, 10000, 100000, 1000000]
数据库指标
# 数据库连接池指标
db_connections_active:
type: gauge
description: "活跃数据库连接数"
labels: [database, service]
db_connections_idle:
type: gauge
description: "空闲数据库连接数"
labels: [database, service]
db_connections_max:
type: gauge
description: "