Use Case Study: Prometheus Important Metrics On Games Service

Basic Monitoring

CCU

alt text

CCU indicates the current users count that playing our games, it's not only the key service status metric but also the key business metric in Games.

Service Protocol (HTTP)

Calculate by the QPS of the regular heartbeat request.

For example, the client send a heartbeat request every 10s, then we can say the service's CCU is 10*QPS (the QPS metric report type is counter in prometheus system, normally, we get the QPS ).

sum(increase(server_handled_total{service="${serviceName}", method="${heartbeatAPIName}",env=~"${env}"}[1m])) by (cid) / 6

Service Protocol (WebSocket)

Stats the established WebSocket connection directly.

sum(websocket_connected_client_count{service="websocket-gateway",upstream=~"${serviceName}",env=~"${env}"}) by (cid)

HTTP Status Code: 5xx

alt text

Availability is the core metric of a service, http status code 5.* almost indicates whether the web service available or not.

For the services that have integrated with nginx api-gateway, due to it cannot differentiate the SGW metrics by location, we use the HTTP metric that reported by web service or api-gateway instead of the SGW metrics.

So, for api-gateway upstreams, the promql looks like:

sum(increase(apigateway_handled_total{service="games-apigateway",upstream="${serviceName}",env="$env",status_code=~"5.*"}[1m])) by (cid)

For other services can directly use nginx metrics from it load balancer

sum(increase(nginx_v2_requests_total{host=~"your-domain",status_code=~'5xx',endpoint=~"${location}"}[1m])) by (host,endpoint)

HTTP Status Code: 4xx

alt text

Although http status code 4.* is not service code scope, we also need to monitor it because the abnormal wave can help us find some serious bugs, such as 401 for abnormal login status and 403 for wrong anti-fraud strategy.

sum(increase(nginx_v2_requests_total{host=~``"your-domain"``,status_code=~``'4xx'``,endpoint=~``"${location}"``}[1m])) by (host,endpoint) `

Play Time Success Rate

alt text

Play time success rate is the key metricin Games. It means we should promise the core function APIs with a high success rate.

100``-sum(job:nginx_requests_total_v2:increase_by_endpoint{host=~``"your-domain.*"``,status_code=~``"5xx"``, endpoint=~``"${location}"``})by(host,endpoint)/sum(job:nginx_requests_total_v2:increase_by_endpoint{host=~``"your-domain.*"``, endpoint=~``"${location}"``})by(host,endpoint)*``100

API Latency

We should not only concern the average latency, should also take care of the P90, P95.

quantile(``0.50``, (job:nginx_request_latency_seconds_v2:avg_by_endpoint{host=~``"your-domain.*"``,endpoint=~``"${location}"``}))
quantile(``0.90``, (job:nginx_request_latency_seconds_v2:avg_by_endpoint{host=~``"your-domain.*"``,endpoint=~``"${location}"``}))
quantile(``0.95``, (job:nginx_request_latency_seconds_v2:avg_by_endpoint{host=~``"your-domain.*"``,endpoint=~``"${location}"``}))

Business Error

The business error code is not a system error, it commonly caused by some user's abnormal operation encounter the edge case of business logic. However, the client should handle these edge cases in advance to improve user's experience. If some kind of business error amount is too much, we should confirm whether there is a bug or can we optimize it. So it's helpful to monitor the business errors.

Panic Error

Panic always means there is a bug. Businesses should always try to recover from the panic to avoid instance quitting, but the panic still impacts the business, and some panic cannot be recovered. So we should fix the panic issue asap.

(sum(increase(studio_fclp_panic_total{service=~``"${serviceName}"``}[5m])) by (cid) or ` ` 0``) +

(sum(increase(server_recovery_total{service=~``"${serviceName}"``}[5m])) by (cid) or ` ` 0``) +

(sum(increase(panic_total{ervice=~``"${serviceName}"``}[5m])) by (cid) or ` ` 0``)

Summary

Indicator	Brief	Standard	Need Rules	Need To Be In Report	Remark
HTTP Code Status: 5xx	5xx number increases means there is something wrong with the service availability	< 10 per min	Yes	Yes	For api-gateway upstream services, use api-gateway layer metric. For other web services, use SGW layer metrics.
Business Error	Business error is reasonable, but pay attention on the high rate business error, it maybe means there are some serious bugs	> 80%	Yes	No	Alerts: use the wave ratio instead of fixed value
CCU	For websocket service, use ws established connections. For HTTP service, use regular heartbeat api qps to calculate CCU	-	Yes	Yes	For retention game, use fixed value or change rate of CCU as alert rules. For campaign game, can use some other metrics instead of CCU to monitor service status
API Latency	Game is a latency sensitive business, low api latency means high user experience	P50 < 20ms	Yes	Yes	Same as above
Panic Error	Panic always means a bug, that needs to fix	0	Yes	No	-
Play Time Success Rate	The business core function should keep high available, it's the core function	> 99%	Yes	Yes	Same as above
HTTP Code Status: 4xx	We should pay attention on 401 and 403 HTTP status code to avoid some serious bugs	< 50% per api	Yes	No	Same as above

Basic Monitoring​

CCU​

Service Protocol (HTTP)​

Service Protocol (WebSocket)​

HTTP Status Code: 5xx​

HTTP Status Code: 4xx​

Play Time Success Rate​

API Latency​

Business Error​

Panic Error​

Summary​

Basic Monitoring

CCU

Service Protocol (HTTP)

Service Protocol (WebSocket)

HTTP Status Code: 5xx

HTTP Status Code: 4xx

Play Time Success Rate

API Latency

Business Error

Panic Error

Summary