Slow autonomous queue message processing

Monitor - https://app.axiom.co/legion-security-ry4e/monitors/edit/o2pYw0TdyOsZMnZWKV

Overview

All investigations sessions waiting for a worker to be assigned, both new triggered investigations from a poller and continued investigations after a user decision, have messages written to the autonomous triggers SQS queue. Each message in the queue correlates to a specific investigation id of a specific customer that needs to be assigned a worker.

The monitor detects messages that were read from the queue more than X minutes after they were written to it. This can be caused by one of 2 issues:

(more likely) The message was read multiple times from the queue, we failed to process it successfully (e.g failed to start worker, failed to update the DB) and returned the message to queue multiple times, until one of the retry attempts was later than the monitor threshold
(possible but less likely) High volume of incoming messages to the queue causes delays in the first time we reach a given message in the queue

Finding the problem

Run the monitor's query to find the message ids that caused the monitor to trigger an alarm. Note down the message ids where the max_attributes.time_in_queue_seconds column contains a value higher than the monitor threshold, as indicated in the raised alarm or monitor configuration

Run query

traces | where name == "process_message" | where ['resource.deployment.environment'] == "production" | summarize max(['attributes.time_in_queue_seconds']) by ['attributes.org_id'], ['attributes.message_id']
For each message id found by the above query, we want to check how many times it was read from the queue to determine which of the 2 possible issues (listed in the overview section) caused the delay. Run the following query to check how many the message was read:
- If there is only a single result in the relevant time range indicating successful processing it means the issue is delays reading from the queue. Continue to section Issue #2 below
- Otherwise, there are multiple results indicating we read the multiple times that ended with failure. Continue to section Issue #1 below
Run query

traces | where name == "process_message" | where ['resource.deployment.environment'] == "production" | where ['attributes.message_id'] == "message-id" | project _time, ['status.code'], ['status.message'], duration, trace_id

Issue #1 - Failure to process message

In the queries above we listed all the attempts to read a specific message id, and determined the same message was read multiple times which ended with failure.

The first thing to look for is if the last attempt to process the message was successful or not. If it was successful it means the issue was mitigated (a worker was started for the investigation) so we still need to find the root caused but it's not as urgent.

Copy one of the trace_ids from the query that listed message processing attempts and run the following query to get the investigation id the message correlates to.

Run query

traces | where name == "start_worker_investigation" | where ['resource.deployment.environment'] == "production" | where trace_id == "trace-id" | project ['attributes.investigation_id']
With the investigation id we can check all the attempts to start a worker for this investigation to check why they failed. In most cases it's worth focusing on the first attempt to process the message that failed, since if it didn't clean up properly after itself causing the DB to indicate a worker exists for the investigation when it doesn't really it could be the root problem.

Run query

traces | where name == "start_worker_investigation" | where ['resource.deployment.environment'] == "production" | where ['attributes.investigation_id'] == "investigation-id"

Run query

logs | where ['resource.deployment.environment'] == "production" | where body contains "investigation-id" | project _time, trace_id, body

If after investigating the telemetry the root issue seems that the first attempt to process the message didn't remove the DB entry for the worker, causing the next attempts to process the message to fail on MongoDB index uniqueness constraint, a possible quick mitigation would be to manually remove the corresponding DB document in investigation_workers collection in the correct DB after verifying in AWS that the worker is indeed not actively used for an investigation.

Issue #2 - High volume of message in queue

Check the Axiom dashboard that tracks the autonomous triggers queue to verify if there is an indeed an anomaly in the number of messages being written to the queue. If that is indeed the case we have a few ways to mitigate the problem:

If the chart show the issue is not a transient volume spike but is instead still continuing, we can decide to temporarily disable the customer poller causing the volume issue until the problem is resolved on the customer's side. Note that this should only be done after careful considering and be communicated to the customer, as it will cause triggers to not be run by Legion or be run with very high delays.

Run query to find which poller is creating high number of autonomous triggers

traces | where name == "run_job" | where ['resource.deployment.environment'] == "production" | summarize sum(['attributes.results_count']) by ['resource.cloud.region'], ['attributes.org_id'], ['attributes.workflow_id'], ['attributes.job_id']

To disable the poller, go to MongoDB cluster for the relevant region, in the polling_jobs collection find the job matching the job_id found in the above query and change its enabled field to false. When reenabling the job later on carefully choose if we want to continue running from the point in time the poller was stopped (just change enabled field back to true) or from the current time (change enabled to true and also update last_handled to the current time - this can be done more easily from the workflow graph page in the webapp)
We can increase the number of consumers from the queue to allow us to read more messages in parallel. This can be done by updating the AUTONOMOUS_TRIGGERS_QUEUE_CONSUMERS_COUNT environment variable default value in the backend repo or by adding it to the infra repo with override values for specific regions and deploying the infra (note that after deploying the infra the backend service needs to be restarted for the change to apply).

Note that increasing the number of consumers may not always increase the speed of reading messages from the queue, since SQS doesn't allow parallel processing (by design) of messages with the same message group ID

Slow autonomous queue message processing

Overview​

Finding the problem​

Issue #1 - Failure to process message​

Issue #2 - High volume of message in queue​

Overview

Finding the problem

Issue #1 - Failure to process message

Issue #2 - High volume of message in queue