Skip to main content

Use Case Study: Prometheus Important Metrics On Games Service

Basic Monitoring

CCU

alt text

CCU indicates the current users count that playing our games, it's not only the key service status metric but also the key business metric in Games.

Service Protocol (HTTP)

Calculate by the QPS of the regular heartbeat request.

For example, the client send a heartbeat request every 10s, then we can say the service's CCU is 10*QPS (the QPS metric report type is counter in prometheus system, normally, we get the QPS ).

sum(increase(server_handled_total{service="${serviceName}", method="${heartbeatAPIName}",env=~"${env}"}[1m])) by (cid) / 6

Service Protocol (WebSocket)

Stats the established WebSocket connection directly.

sum(websocket_connected_client_count{service="websocket-gateway",upstream=~"${serviceName}",env=~"${env}"}) by (cid)

HTTP Status Code: 5xx

alt text

Availability is the core metric of a service, http status code 5.* almost indicates whether the web service available or not.

For the services that have integrated with nginx api-gateway, due to it cannot differentiate the SGW metrics by location, we use the HTTP metric that reported by web service or api-gateway instead of the SGW metrics.

So, for api-gateway upstreams, the promql looks like:

sum(increase(apigateway_handled_total{service="games-apigateway",upstream="${serviceName}",env="$env",status_code=~"5.*"}[1m])) by (cid)

For other services can directly use nginx metrics from it load balancer

sum(increase(nginx_v2_requests_total{host=~"your-domain",status_code=~'5xx',endpoint=~"${location}"}[1m])) by (host,endpoint)

HTTP Status Code: 4xx

alt text

Although http status code 4.* is not service code scope, we also need to monitor it because the abnormal wave can help us find some serious bugs, such as 401 for abnormal login status and 403 for wrong anti-fraud strategy.

sum(increase(nginx_v2_requests_total{host=~``"your-domain"``,status_code=~``'4xx'``,endpoint=~``"${location}"``}[1m])) by (host,endpoint) `

Play Time Success Rate

alt text

Play time success rate is the key metricin Games. It means we should promise the core function APIs with a high success rate.

100``-sum(job:nginx_requests_total_v2:increase_by_endpoint{host=~``"your-domain.*"``,status_code=~``"5xx"``, endpoint=~``"${location}"``})by(host,endpoint)/sum(job:nginx_requests_total_v2:increase_by_endpoint{host=~``"your-domain.*"``, endpoint=~``"${location}"``})by(host,endpoint)*``100

API Latency

We should not only concern the average latency, should also take care of the P90, P95.

quantile(``0.50``, (job:nginx_request_latency_seconds_v2:avg_by_endpoint{host=~``"your-domain.*"``,endpoint=~``"${location}"``}))
quantile(``0.90``, (job:nginx_request_latency_seconds_v2:avg_by_endpoint{host=~``"your-domain.*"``,endpoint=~``"${location}"``}))
quantile(``0.95``, (job:nginx_request_latency_seconds_v2:avg_by_endpoint{host=~``"your-domain.*"``,endpoint=~``"${location}"``}))

Business Error

The business error code is not a system error, it commonly caused by some user's abnormal operation encounter the edge case of business logic. However, the client should handle these edge cases in advance to improve user's experience. If some kind of business error amount is too much, we should confirm whether there is a bug or can we optimize it. So it's helpful to monitor the business errors.

Panic Error

Panic always means there is a bug. Businesses should always try to recover from the panic to avoid instance quitting, but the panic still impacts the business, and some panic cannot be recovered. So we should fix the panic issue asap.


(sum(increase(studio_fclp_panic_total{service=~``"${serviceName}"``}[5m])) by (cid) or ` ` 0``) +

(sum(increase(server_recovery_total{service=~``"${serviceName}"``}[5m])) by (cid) or ` ` 0``) +

(sum(increase(panic_total{ervice=~``"${serviceName}"``}[5m])) by (cid) or ` ` 0``)

Summary

IndicatorBriefStandardNeed RulesNeed To Be In ReportRemark
HTTP Code Status: 5xx5xx number increases means there is something wrong with the service availability< 10 per minYesYesFor api-gateway upstream services, use api-gateway layer metric. For other web services, use SGW layer metrics.
Business ErrorBusiness error is reasonable, but pay attention on the high rate business error, it maybe means there are some serious bugs> 80%YesNoAlerts: use the wave ratio instead of fixed value
CCUFor websocket service, use ws established connections. For HTTP service, use regular heartbeat api qps to calculate CCU-YesYesFor retention game, use fixed value or change rate of CCU as alert rules. For campaign game, can use some other metrics instead of CCU to monitor service status
API LatencyGame is a latency sensitive business, low api latency means high user experienceP50 < 20msYesYesSame as above
Panic ErrorPanic always means a bug, that needs to fix0YesNo-
Play Time Success RateThe business core function should keep high available, it's the core function> 99%YesYesSame as above
HTTP Code Status: 4xxWe should pay attention on 401 and 403 HTTP status code to avoid some serious bugs< 50% per apiYesNoSame as above