Engineering

Dec 08, 2022

10 min read

How we Built our Automated Secrets Rotation Engine

Dec 08, 2022

Nic Manoogian

Head of Engineering, Doppler

Back to the blog

How we Built our Automated Secrets Rotation Engine

Engineering

Teams understand the value of rotating secrets:

Secrets are used in applications that often send logs and other metadata to third-party services, making accidental leaks a significant risk
Some employees need access to certain application secrets but should only have access while it’s needed
Secrets are often shared between several applications (perhaps even across multiple teams), making the exposure surface area massive

Great, rotation is a best practice. But when is the last time your team rotated a third-party service token or even your database credentials? If it was recently, how long did that process take?

Performing manual rotation is risky and cumbersome. Individuals with access to production running manual workflows for every secret, every N days? Not going to happen. Like any good “thing that has to be done over and over” — this process is ripe for automation, provided it can be done safely.

Doppler is in a unique position to build rotation automation because teams already:

Use Doppler as their source-of-truth for secrets
Configure their applications to load secrets from Doppler
Rely on Doppler’s access control, observability, and auditing features

However, a secret rotation feature needs to check some very important boxes before it can trusted for production use.

Requirements for Doppler’s Rotation Engine

Encryption and Storage

All rotation state data, including credentials and sensitive parameters, must be encrypted to the same standards as any other secret in our system.

Atomic Operations

Rotation must occur in “one shot”. Keeping all rotation operations atomic ensures that applications that fetch secrets from Doppler will always receive a valid (i.e. working) credential — even if Doppler’s infrastructure is interrupted (e.g. power or network failure).

Graceful Error Handling

Doppler will facilitate rotation via third-party integrations and integration points which are infamous for being the most brittle parts of any software system. Some errors that the engine encounters can be retried, while others will require user intervention.

Above all, data loss must not be possible. If an error or interruption occurs during any point in the rotation process, Doppler’s rotation engine must be able to recover without an issue.

No Public Access Required

If a secret source (e.g. a database) isn’t publicly accessible, it won’t need to be made public for Doppler to rotate its secrets. Automatic rotation is meant to improve your security posture, not trade one set of risks for another.

We used an interesting strategy to meet this requirement, and the solution deserves its own post. Keep an eye out for our next rotation post on Proxied Rotation to learn more.

The Two-Secret Strategy

To make automatic rotation safe and atomic, we can’t just go revoking old API keys and generating new ones. We need to generate a new key, transition applications to use it, and revoke the old key when it is no longer in use.

Doppler accomplishes this by always maintaining two valid credentials: the active and the inactive. Every N days (we call this the rotation interval), the active credential becomes the inactive credential and the inactive credential is updated or replaced with a new active credential.

Let’s take database rotation on a 15-day rotation interval as an example:

During setup (day 0), the user provides Doppler with two valid starting credentials (username/password pairs) which will be used for rotation. Let’s say they’re called appuser1 and appuser2.
At this point, when an application fetches secrets from Doppler, it’ll get back appuser1 with the original password
On day 15, Doppler will update the password (more on how in our Proxied Rotation post) for appuser2 and make it the active credential
When an application fetches secrets, it’ll receive appuser2 and its new password
The credentials for appuser1 are still valid but the credential is inactive and cannot be requested from Doppler
On day 30, Doppler will update the password for appuser1 and make it the active credential
When an application fetches secrets, it’ll receive appuser1 and its new password
The credentials for appuser2 are still valid but the credential is inactive and cannot be requested from Doppler

With this strategy, each credential is valid for two rotation intervals (2*N days). As long as applications fetch secrets from Doppler with at least this frequency, they’re guaranteed to always be holding a valid credential.

Types of Rotation

Doppler supports two types of rotation: Updater and Issuer

The database rotation example above is updater rotation. The user provides Doppler with two starting credentials and Doppler updates them “in-place”.

Not all credentials can be updated in place; some need to be freshly created and revoked. With issuer rotation, the user provides parameters for creating new credentials and Doppler will revoke and reissue credentials to perform the rotation.

Let’s take SendGrid API key rotation with a 15-day rotation interval as an example:

During setup (day 0), the user provides Doppler with the SendGrid API scopes that the API key needs
Doppler immediately creates a new API key with the requested scopes and this key is available to applications. Let’s call it key1.
On day 15, Doppler creates a new SendGrid API key, again with the requested SendGrid scopes. Let’s call it key2.
Applications fetching secrets from Doppler now get key2
key1 is still valid but becomes inactive and cannot be requested from Doppler
On day 30, Doppler revokes the inactive credential (key1) and issues a new one (key3) to be the active credential
Applications now get key3
key2 is still valid but becomes inactive and cannot be requested from Doppler

Like the update strategy, applications are guaranteed to always fetch a valid credential as long as they re-fetch at least once every two rotation intervals. The difference with issuer rotation is that we’re continuously revoking old credentials and issuing new ones.

Rotation State

To meet the atomic rotation and encryption requirements, Doppler stores rotation state as a tokenized JSON string.

For those unfamiliar with tokenization: All sensitive data in Doppler is stored and managed by an isolated tokenization service in our infrastructure. Our web applications exchange sensitive data for tokens and vice versa. There’s more information on tokenization in our security facts sheet.

Here’s an example JSON state object for a SendGrid rotated secret:

The parameters object contains static information about how to connect with the secret source and/or how to create credentials (in the case of issuer rotation).
The activeIndex field contains the index of the active credential in the credentials list.
The credentials field contains the list of valid credentials. Depending on the type of secret, the credential may contain several fields, some sensitive and some informational.
The pendingCredential field contains transient state that is used during the rotation. We’ll discuss this more in the next section.

We found this structure to be sufficiently generic to accommodate all types of rotated secrets, while still allowing our core logic to perform rotation operations abstractly.

When a rotation is performed, the rotation engine applies the necessary changes, builds new state, and commits the tokenized JSON object to the database.

Two-Phase Commits

The rotation state must always be accurate to meet our atomicity requirements, but interruptions (e.g. power or network failure) make this tricky for updater rotations.

To update a database password, the rotation engine needs to:

Generate a new password in memory
Update the DB user’s password in the source database
Commit the new rotation state to the Doppler database

This leaves a critical period between steps 2 and 3 where our engine could experience an interruption (e.g. power or network failure). When rotation is attempted after the interruption, our state might be out-of-sync with the true state of the database. Big yikes.

To avoid this, we generate a new password and immediately commit the credential to the Doppler database in the pendingCredential field. This occurs before we attempt to update the DB user’s password in the source database. If the engine encounters an interruption, it can check the pendingCredential field during the next attempt. If a pending credential is present, the engine can test the pending username and password to verify whether or not the credential was truly updated.

Issuer rotation isn’t affected by this problem, but the engine is careful to ensure that the old credential is revoked (or missing, to be resilient to interruptions) before issuing the next credential. This ensures that used credentials are not “abandoned” in error scenarios.

Delivering Secret Fields to the Application

Whenever rotated secret state is created or updated, the rotation engine saves the resulting secrets as “injected secrets” into the config. This allows advanced functionality like config logs, sync integrations, webhooks, secret references, and access logs to work for rotated secrets the same way they do for static secrets.

For example, if our rotated database secret was named DB_USER, then the secrets DB_USER_NAME, DB_USER_PASSWORD, and DB_USER_SECRET would all be available in the config.

Automatic Redeployment

Integrating Doppler into your deployment process is the most reliable way to ensure that your applications are always consuming valid secrets. When a secret is rotated, the new injected secret values are immediately synced to any integrations you’ve configured, just like any other secret change in Doppler. Automatic restarts can be implemented with the Doppler Secrets Operator, External Secrets Operator, or a webhook automation.

Handling Errors

In order for automatic rotation to be useful, error states need to be friendly and the dashboard needs to provide clear directions to users on how to resolve problems.

Doppler distinguishes between:

Authentication Errors: These disable the whole integration between Doppler and the service, which effectively disables all rotated secrets that use the integration to authenticate. Users are prompted to correct their authentication to continue.
Access Errors: These disable just the rotated secret. Users are prompted to grant appropriate access back to the credential that Doppler is using to facilitate rotation.
Transient Errors: These should be retried at some point in the future.
Doppler will automatically retry these errors with an exponential backoff, eventually disabling the rotated secret and notifying the user.
When a rotated secret is in an error backoff state, this information is available in the dashboard. This includes the time of the last attempt with the error message and the time of the next attempt.

Whenever an error requires user intervention, we notify users with the appropriate permissions to correct the issue via email.

Closing Thoughts

Manual rotation is complex, time-consuming, and error-prone. We believe that automation is the answer — but only if the tooling is secure, reliable, and fails gracefully when necessary. After all, a tool is only as good as its failure modes.

Why not kick the tires and see for yourself?

Back to the blog