Deduplication, cooldowns, and expiry in operational alerting
An alerting system without state is a scheduled spam machine. It needs durable identity, cooldowns, expiry, reminders, suppression, and reopening rules to stay useful.
Stateless alerts become spam
A scheduled query does not become an alerting system because it posts rows every morning.
Without state, it has no memory. It can’t tell whether a situation is new, already sent, already handled, still active, expired, suppressed, waiting for a reminder, or worth reopening because the facts changed. It only sees that a condition is true right now. Then it sends another message, because that is what machines do when no one gives them memory.
The rule may be logically correct, but the recipient experiences it as repeated interruption. The same account appears again. The same order appears again. The same opportunity appears again because one timestamp changed, a source model rebuilt, or the detection query ran on a new schedule. After a few cycles, the question stops being “Does this matter?” and becomes “How do I make this stop?” The system loses trust.
Once an alert assigns work, the system has to remember the work it already created. Otherwise detection keeps rediscovering the same condition and delivery keeps pretending the rediscovery is useful.
Even a good signal can be ruined by bad repetition.
Dedupe keys
The dedupe key defines what counts as the same alert.
One of the highest-risk design choices in the system. It decides whether work is repeated, suppressed, continued, or reopened. Get it wrong and the alerting system either floods the channel or hides real work behind a key that was too broad.
A useful dedupe key is based on business identity, not incidental row shape. It may include alert type, customer, account, opportunity, order, product, location, owner group, variant, or event window. It should include the fields that make the situation distinct and exclude fields that only describe how the situation was detected or displayed.
Run timestamp is usually not identity. Formatted message text is not identity. A localized label is not identity. A supporting metric might not be identity either if it only explains the case rather than defining it.
If the key is too narrow, small payload changes create new alerts even though the underlying case is already open. If the key is too broad, different cases collapse into one and valid work gets suppressed.
The key also needs a stable home. Detection should emit it explicitly. Candidate queues should carry it. Alert history should store it. Reminder, repeat, suppression, and reopening logic should use it. The payload should include enough of it for downstream systems to trace the alert back to the decision that created it.
Cooldown windows
A cooldown window controls when the same situation is allowed to surface again. Dedupe says, “we have already seen this case.” Cooldown says, “we may send it again, but not yet.”
Cooldowns matter when the underlying condition can remain true for a while. Even if there’s relevant work being done. An account can keep drifting. A customer can stay inactive. An open issue can remain unresolved. A product condition can stay risky. If every run treats that continuing state as a fresh signal, the alert punishes the recipient for not resolving the case instantly.
The cooldown should match the response window. If the expected action takes three business days, repeating the alert every hour is noise. If the expected action is urgent, a shorter window may be justified. Some rules need default cooldowns with variant-level overrides. Some need limits by route or owner group because one queue can absorb only so much open work before the system starts manufacturing neglect.
Cooldown logic should read alert history, not only candidate history. A candidate that failed before delivery should not necessarily start the window. A successfully delivered alert usually should. A reminder may extend the window because it continues the same workflow rather than creating a new case.
These semantics need to be explicit. Otherwise retries, reminders, and follow-ups start behaving like separate systems.
Expiry
Expiry controls when an alert is no longer useful.
Some alerts expire because the business window closes. An order risk may only matter before fulfillment. A customer follow-up may only be useful while the context is fresh. A pricing exception may only matter until the quote is accepted or rejected. After that, the alert arrives too late to be actionable.
Expiry should be separate from closure. Closure is a response state. Expiry is a rule about usefulness. An alert can expire without being handled. It can be handled before it expires. An expired alert may still be useful for audit, feedback analysis, and rule tuning, but it should not keep competing with current work.
Expiry also applies before delivery. A candidate queue should not process stale work forever because an external writer was down, a payload was invalid, or a retry policy was too optimistic. At some point the candidate should be marked expired or dead with a reason.
Otherwise a repaired delivery layer can suddenly send a batch of alerts that were correct when detected and wrong by the time they arrived.
Reminders
Reminders are not duplicates when they are modeled as reminders. A reminder has lineage. It points back to the original alert, carries the original context or a replayable snapshot, records who requested it, and has its own due date. It should also have its own idempotency key so the same reminder request is not sent twice because a scheduler ran twice, a retry happened, or a source flag stayed set longer than expected.
A reminder may reuse the original context. It may also recalculate the current state. The underlying situation may have improved, worsened, or disappeared. The alert can say that the condition changed since the original send. If the current scope no longer resolves, it can fall back to the previous snapshot and say so. Both are better than pretending the reminder is a brand-new alert.
Reminder logic has to check pending work and sent history. If a reminder with the same identity is already pending, don’t enqueue another one. If it was already sent, don’t send it again. If the original alert closed, decide whether the reminder should disappear, convert into an audit event, or reopen the case.
Suppression
Suppression is an intentional state. An alert can be suppressed because today is a non-working day, a blackout window is active, the schedule does not allow dispatch, a route is at capacity, a required field is missing, the owner group is unavailable, or the payload failed validation. Some suppression is temporary. Some is terminal. The record should say which one happened.
That matters later. If a condition was detected and then suppressed, that’s useful evidence. It explains why a known issue did not produce a message. It helps tune schedules, route limits, context requirements, and rule variants. If suppressed candidates simply disappear, someone will eventually ask why no one was warned and the system will answer with a shrug.
Suppression can happen before candidate creation or during enrichment. Cheap scope filters often belong close to detection. Runtime policy, availability, localized configuration, payload validation, and route limits usually belong closer to dispatch.
Reopening
Reopening is needed when a situation changes after it was handled, or when the previous response no longer closes the loop.
Reopening should not be implemented by forgetting history. If the old state is erased, the system loses the ability to explain why the case returned. A reopened alert should carry lineage to the original situation and a reason the old state is no longer enough.
The cause might be a material change in the signal, a new event on the same business object, an expired snooze, a reminder response, or a manual return from the recipient. Those aren’t the same thing, and the system shouldn’t flatten them into “new alert.”
The reopened state should be visible in the payload and history. “This is new” and “this has returned” are different messages. They create different expectations. They also tell a different story during tuning.
A rule that creates many reopened alerts may be correctly catching unresolved situations. It may also have weak closure criteria, bad cooldowns, or a dedupe key that hides too much change. You can’t tell unless reopening is modeled as state.
Reopening is also a guard against overaggressive dedupe. If the condition materially changes, the system may need to create new work even though the business identity is similar. That decision should come from state and change detection, not from making dedupe keys so narrow that everything looks new.
What to log
Stateful alerting needs joinable logs across the lifecycle. The candidate layer should record candidate ID, dedupe key, alert type, variant, target entity, route, source, status, next run time, attempts, execution ID, last error, created time, and updated time. That tells you whether work was created and what happened while the system tried to process it.
Alert history should store the business-facing state: alert ID, dedupe key, recipient or route, alert type, variant, created time, due date, closed state, reminder state, reminder lineage, repeat count, reopened marker, selected answers, response text, attachments, payload snapshot, and last modification time. That tells you what was sent and how the workflow evolved.
The writer log should record side effects: event ID, operation type, target entity, status, error code, error message, source, batch context, timestamp, and enough payload detail to debug the request without leaking unnecessary data. That tells you whether the external system accepted the write.
Those logs need stable identifiers between layers. Candidate ID, alert ID, dedupe key, execution ID, and writer event ID should form a path, not a scavenger hunt.
Without that path, every incident becomes a search across message history, warehouse rows, API responses, scheduler logs, and whatever someone remembers.
The business effect of fewer, better alerts
Good dedupe prevents duplicate interruptions. Cooldowns give a real response window. Expiry keeps stale work from arriving late. Reminders continue the same workflow without pretending to be new signals. Suppression respects timing, capacity, and business rules. Reopening brings back only the cases that deserve another look.
The result is not just fewer alerts. It is more trust in the alerts that remain. Recipients learn that the system remembers what it already sent, waits when waiting is correct, and returns only when the situation deserves attention again.
That trust is the actual asset. Once alerts are treated as noise, even the correct ones lose force. The organization stops responding to the system and starts negotiating with it, muting it, ignoring it, or building side channels around it.
An alerting system without state can be analytically correct and operationally useless. It will repeat itself, hide its own decisions, and train recipients to ignore it.
State is what lets the system be selective without becoming blind.
More in this domain: Automation
Browse allAlert configuration should control business behavior, not system structure
Alert configuration should make business behavior reviewable: wording, thresholds, variants, labels, routing, timing, and feedback options. Lifecycle guarantees belong in code.
Data-Driven Alerts: System Breakdown
Data-driven alerts turn agreed business conditions into assigned, stateful work. The useful part is the loop: detection, queueing, enrichment, routing, response, writeback, audit, and rule tuning.
Related patterns
An alert is not a notification
A notification says something happened. An operational alert identifies a business situation, assigns ownership, carries enough context to act, records the response, and becomes workflow state.
Why alert feedback should be structured first
Free text helps, but structured alert feedback lets the system measure relevance, timing, duplicates, bad data, and rule quality. Human response becomes evidence the rules can learn from.
How we decide which metrics deserve a dashboard and which deserve a workflow
Some metrics are for observation. Others need ownership, thresholds, timing, and structured action. We decide explicitly which system shape each metric actually deserves.
A dashboard is not an operating system
Dashboards are good at showing state. They are bad at routing action, assigning ownership, and closing operational loops once a metric requires intervention.