SignalPrint, Rules APIs briefly unavailable
Nov 07 at 12:13pm CST
At 10:19:06 am CST, Verosint's API backend cluster experienced a node failure which caused a brief downtime on an internal middleware service used by SignalPrint and Rules APIs. During the incident, requests to
https://api.verosint.com/v1/rules endpoints would have returned a HTTP 401 error for an invalid API token. The issue was detected and remediated automatically, with full recovery of all services by 10:22:16 am CST.
Requests sent to either API during the incident failed. Log streams sent via Verosint's Auth0 integration also initially failed, but succeeded on subsequent automatic retries once the incident had recovered, with no data loss.
Response and Root Cause
Normally, a node loss should be a non-event, as Verosint's engineering standards require all services to be highly-available across multiple availability zones but, in this case, the middleware service was not properly configured and when the cluster node failed, it took the lone instance of the middleware service with it. Fortunately, the cluster detected and remediated the failure 2 minutes after it occurred, with full recovery validated within 1 minute of redeployment.
Verosint engineering has corrected the misconfiguration and are taking additional steps to audit/test for improper configurations of every platform service.
Verosint Application Frontend