性能监控与调优

开篇：性能是用户体验的核心

用户不会容忍一个慢吞吞的AI应用。3秒响应和300ms响应，体验天壤之别。

本节课我们将学习如何监控和优化AI应用的性能。

AI应用性能指标

**关键指标**：

**1. 响应时间（Latency）** - P50：50%请求的响应时间 - P95：95%请求的响应时间（关注长尾） - P99：99%请求的响应时间 - TTFB：首字节时间（Time to First Byte）

**2. 吞吐量（Throughput）** - QPS：每秒查询数 - RPS：每秒请求数 - TPM：每分钟Token数

**3. 资源利用率** - GPU利用率：理想70-90% - 内存使用率：避免OOM - CPU利用率：平稳负载

**4. 错误率** - HTTP错误率（4xx, 5xx） - 模型调用失败率 - 超时率

**5. 用户体验指标** - 流式输出的首Token时间（TTFT） - 感知延迟（用户开始打字到看到输出） - 完成率（请求完成的百分比）

💡 关注P95/P99长尾延迟，不只是平均值；TTFT影响用户体验。流式输出的首Token时间是关键指标。

性能监控工具

**APM工具**： - Datadog、New Relic：全栈监控 - Grafana+Prometheus：开源方案 - Lightstep：分布式追踪

**AI专用监控**： - LangSmith：LLM应用监控 - Weights & Biases：模型实验追踪 - Arize：模型性能监控

**基础监控指标**： ```yaml metrics: - name: response_time_p99 type: histogram labels: [endpoint, model] - name: gpu_utilization type: gauge labels: [gpu_id, model] - name: token_throughput type: counter labels: [model] - name: error_rate type: gauge labels: [endpoint, error_type] ```

代码示例：性能指标收集

使用Prometheus收集性能指标：

python

from prometheus_client import Counter, Histogram, Gauge, start_http_server

# 定义指标
response_time = Histogram(
    'api_response_time_seconds',
    'API响应时间',
    ['endpoint', 'model']
)

token_throughput = Counter(
    'token_throughput_total',
    'Token吞吐量',
    ['model']
)

gpu_utilization = Gauge(
    'gpu_utilization_percent',
    'GPU利用率',
    ['gpu_id', 'model']
)

error_count = Counter(
    'api_error_count_total',
    'API错误计数',
    ['endpoint', 'error_type']
)

# 使用示例
import time
from contextlib import contextmanager

@contextmanager
def track_api_call(endpoint: str, model: str):
    """追踪API调用"""
    start = time.time()
    try:
        yield
    except Exception as e:
        error_count.labels(endpoint=endpoint, error_type=type(e).__name__).inc()
        raise
    finally:
        duration = time.time() - start
        response_time.labels(endpoint=endpoint, model=model).observe(duration)

# 在API中使用
with track_api_call('/api/chat', 'gpt-4'):
    result = call_llm_api("Hello")
    token_throughput.labels(model='gpt-4').inc(len(result))

性能优化技巧

**1. 减少网络往返** - 批量API调用 - 合并请求 - 使用连接池

**2. 优化模型调用** - 批处理：一次处理多个输入 - KV Cache：复用计算结果 - 量化：使用量化模型（INT8/INT4）

**3. 缓存策略** - 请求结果缓存 - 中间结果缓存 - 预加载热门数据

**4. 并发处理** - 异步IO（asyncio） - 并发请求（requests并发） - 流式输出

**5. 数据处理优化** - 向量化计算（NumPy） - 数据库索引优化 - 避免N+1查询

代码示例：异步并发调用

使用asyncio并发处理多个请求：

python

import asyncio
import aiohttp

async def fetch_with_timeout(session, url, timeout=10):
    """带超时的异步请求"""
    try:
        async with session.get(url, timeout=timeout) as response:
            return await response.json()
    except asyncio.TimeoutError:
        return {"error": "timeout"}

async def parallel_llm_calls(queries: list, max_concurrent=5):
    """并发调用LLM API"""
    semaphore = asyncio.Semaphore(max_concurrent)

    async def bounded_call(query):
        async with semaphore:
            # 模拟API调用
            return await fetch_with_timeout(
                session=f"api.example.com",
                url=f"/api/chat?q={query}"
            )

    async with aiohttp.ClientSession() as session:
        # 并发执行所有请求
        results = await asyncio.gather(
            *[bounded_call(q) for q in queries],
            return_exceptions=True
        )

        return results

# 使用示例
queries = ["What is AI?", "Explain ML", "Tell me about Python"]
results = asyncio.run(parallel_llm_calls(queries))

实战：慢查询诊断

**问题**：某API P99延迟达到5秒

**诊断步骤**：

1. **查看追踪数据** - 数据库查询：2.5秒 - LLM调用：2秒 - 其他处理：0.5秒

2. **深入分析** - 数据库：发现一个N+1查询问题 - LLM：模型选择不合适，用了超大模型

3. **优化措施** - 数据库：添加索引，修复N+1查询 → 降至0.2秒 - LLM：换用中等模型 → 降至1秒

4. **结果** - P99延迟：5秒 → 1.7秒 - 成本降低40%

优化原则

先测量再优化：不要假设瓶颈在哪。优化要基于数据，而不是感觉。重点优化P95/P99长尾延迟，而不是平均值。