BIG BOX Hosting — Guides — Migrate from SendGrid in 21 days № 60.01

Migrate from SendGrid
to dedicated infrastructure
in 21 days.

Discovery, DNS audit, SPF/DKIM/DMARC remediation, IP warmup, two-stage cutover, DPA execution. Written for the senior engineer running the migration. The technical work is not difficult; doing it correctly while keeping production running is. This guide describes the typical case (1-10M emails/month, 1-4 sending domains) and explains where the 21-day pattern does not fit.

Read the guide → SPF flattener tool

01 / Why this guide exists

Who should read this.

Senior engineers running the migration. Volume between 1 and 10 million emails per month. One to four sending domains. One engineer dedicated for three weeks.

This guide describes how to migrate an email infrastructure off SendGrid to dedicated hosting in 21 days. It is written for the senior engineer or technical lead who has been told that SendGrid is being replaced and needs to deliver the migration without breaking ongoing production sends. The 21-day timeline assumes a moderately complex setup: between one and four sending domains, a single SendGrid account, monthly volume between one and ten million emails, and an internal team that can dedicate one engineer's time over the full three weeks. Setups outside that envelope need a different timeline, and the last section of this guide explains where the boundaries are.

We have run this migration roughly thirty times over the last four years. The typical reasons customers move are Schrems II compliance review, a procurement framework that has rejected US-domiciled email processors, an OVH Canada-style ruling that has reframed the legal analysis for cloud providers with foreign group exposure, a sustained deliverability decline that the customer has traced to shared-IP reputation, or a cost analysis that has concluded that dedicated infrastructure is more economical at the customer's volume. The technical work is largely the same regardless of motivation. The order changes slightly. The compliance-driven migrations front-load the legal work; the deliverability-driven migrations front-load the technical audit.

Read this guide front to back the first time. Do not skip the discovery section. The single biggest mistake we see in self-driven migrations is teams who underestimate how much SendGrid configuration drift has accumulated over two to six years of production use. The DKIM key rotation in week two is what catches most teams off-guard, and the single source of pain is almost always a hardcoded selector reference somewhere in a third-party templating service that nobody on the current engineering team knows about.

─────────────────────────────────────────────────────────────────────────

02 / Timeline overview

Five phases. Twenty-one days.

The 21-day timeline is achievable but not comfortable. Teams that succeed run it as primary work, not as a side project alongside ongoing feature delivery.

The 21-day timeline breaks into five phases. Days 1 through 3 are discovery and DNS audit, a non-technical phase that the engineering lead can run alone with read-only access to the production DNS. Days 4 through 7 are SPF plus DKIM plus DMARC remediation, where the bulk of the configuration changes happen. Days 8 through 14 are IP warmup on the new infrastructure, with parallel sending against the existing SendGrid path for comparison. Days 15 through 18 are cutover, executed in two stages with a 48-hour gap. Days 19 through 21 are DPA execution, DNS cleanup, and decommissioning of the old SendGrid configuration.

The timeline assumes the new infrastructure has been provisioned by day 1. If you are evaluating providers in parallel with reading this guide, add five to seven business days to the front of the timeline for vendor selection, contract negotiation, and dedicated IP allocation. The provisioning step is not technical; it is a procurement step. We have seen procurement teams take three weeks for a vendor selection that the engineering team had pre-decided in three days, which is a reason to start the procurement conversation earlier than you think.

The 21-day timeline is achievable but not comfortable. The teams that run this in 21 days are running it as their primary work for three weeks. The teams that try to run this in 21 days as a side project alongside ongoing feature work consistently miss the timeline by 50 to 100 percent and arrive at week six with half the migration done and the production environment in a hybrid state that is harder to manage than either the old or the new infrastructure alone. If you cannot dedicate one engineer's time, the realistic timeline is closer to six to eight weeks.

03 / Days 1-3

Discovery and DNS audit.

The first three days are spent finding everything that exists. The objective is a complete inventory before any technical work begins.

The first three days are spent finding everything that exists. The objective is a complete inventory of sending domains, DKIM selectors, SPF includes, MTA-STS policies, BIMI records, and any third-party services that have direct sending integration. Do not start the technical migration before the inventory is complete. Migrations that start technical work on incomplete inventory routinely discover dormant integrations on day twelve that would have been a one-hour fix on day one and turn into a two-day debugging session at the worst possible moment.

DNS audit specifics. For each sending domain, run a complete DNS audit. The records you need to capture: the SPF record on the apex (TXT v=spf1), every include: directive in that SPF record, every DKIM selector currently published (typically selector1._domainkey.example.com, but check what SendGrid configured), the DMARC record on _dmarc.example.com, the MTA-STS policy at https://mta-sts.example.com/.well-known/mta-sts.txt, the TLS-RPT record at _smtp._tls.example.com, and any BIMI record at default._bimi.example.com. Use dig +short TXT for the standard records. For MTA-STS, fetch the policy file directly with curl; do not rely on the DNS record alone, which only points to the policy.

SPF lookup count. Count the DNS lookups in your SPF record. RFC 7208 section 4.6.4 sets the limit at 10 DNS lookups per SPF evaluation. Most teams who have been on SendGrid for more than two years are at or over the limit, often without knowing it because some receivers fail open on permerror and others silently treat the SPF result as neutral or none. The tool to use is spf-record-lookup or any of the public web tools (we publish one at /tools/spf-flattener/). The fix in week two will involve flattening or restructuring the SPF chain. The discovery output you need from day 1-3 is a written count of current lookups per domain.

Hidden services. Dormant or undocumented services are the single largest source of migration overruns. Pull the SendGrid sending logs for the previous 90 days and identify every API key that has sent more than 100 messages in that window. Cross-reference each API key against the production codebase. Any API key that is sending mail but is not referenced in code is either a third-party integration (typical: customer support tools, CRM platforms, billing systems) or a dormant service running unsupervised. The customer support integration we found in the UK Fintech case study was sending 80,000 messages a month from a tool that the engineer who provisioned it had left the company eighteen months earlier. Do this audit before starting the technical work.

04 / Days 4-7

SPF, DKIM, DMARC remediation.

The technical fixes that make the rest of the migration work. SPF flattening if needed. DKIM rotation with a seven-day overlap window. DMARC progression toward enforcement.

SPF flattening. The SPF remediation is the first technical change. If your domain is over the 10-lookup limit, you need to flatten the SPF chain or restructure it so that the SendGrid-related includes are removed cleanly when you cut over to the new infrastructure. Flattening means resolving the include chain manually and publishing the resulting set of IP addresses or CIDR blocks directly in your SPF record, instead of leaving the includes in place. The trade-off is that flattened SPF records do not auto-update when the upstream provider changes its IP ranges; you have to maintain the record manually. We typically recommend flattening for senders whose SPF chain only changes when they migrate providers (which is most senders), and recommend keeping includes for senders who change peering arrangements frequently (which is unusual).

DKIM key rotation. The DKIM rotation is the part of the migration that catches most teams off-guard. The procedure is straightforward: generate a new DKIM key pair (2048-bit minimum, 1024-bit is being progressively rejected by some receivers and should not be used in 2026), publish the public half on a new selector (e.g. selector-2026._domainkey.example.com), and configure the new infrastructure to sign with the new selector. The complication is that some senders have hardcoded DKIM selector references in places nobody on the current team knows about. Templating services, marketing automation platforms, customer support tools, and legacy CMS integrations are the typical culprits. The discovery audit in days 1-3 should have surfaced these. If it did not, the DKIM rotation will surface them, with mail bouncing or failing DMARC alignment until the missing reference is updated.

Selector overlap window. Run a seven-day overlap window. Publish the new selector immediately. Keep the old selector published in DNS for seven days while production transitions to signing with the new selector. After seven days, the old selector has stopped being used by any active code path and can be removed from DNS. The overlap window is necessary because mail in flight at the moment of cutover may be signed with the old selector and only verified by the receiver hours or days later. Removing the old selector before in-flight mail is verified results in DMARC alignment failure on a small but real fraction of mail.

DMARC enforcement progression. If you are still at p=none, this is the moment to begin moving toward enforcement. Do not jump straight to p=reject. The progression we recommend: publish DMARC at p=none; rua=mailto:[email protected] if you are not already, collect aggregate reports for two weeks, identify any unauthorised sending sources, then move to p=quarantine; pct=25, monitor for one week, increase to pct=50, monitor for another week, then to pct=100; p=quarantine. After three to four weeks at quarantine 100 percent with no remaining issues, move to p=reject. The full progression takes six to eight weeks, longer than the migration itself, but you can begin the progression in week one of the migration and finish it after the migration is complete. DMARC enforcement is not a precondition for the migration.

TLS-RPT and MTA-STS. If your domain does not have MTA-STS and TLS-RPT published, this is the moment to add them. Both are passive deliverability improvements; they will not break anything if they are not present, but they signal to receivers that you are operating a mature infrastructure. Publish the MTA-STS policy at mode: enforce if your infrastructure supports it (which any modern MTA does), or at mode: testing if you want a buffer period. Add the TLS-RPT record pointing to a reporting endpoint you actually monitor, not a destination you intend to set up later. Receivers occasionally use TLS-RPT as a signal of operational maturity, and pointing it to an unmonitored endpoint is worse than not having it at all.

05 / Days 8-14

IP warmup.

The seven-day window where new IPs build reputation. Split traffic between old and new infrastructure. Watch for complaint rate spikes. Pause and resume rather than push through.

Why warmup is necessary. Mailbox providers track sending IP reputation independently of domain reputation. A new IP, even when sending DKIM-signed mail from a domain with established reputation, is treated cautiously by major receivers until it has accumulated enough sending history to establish its own reputation profile. Sending a million messages from a cold IP on day one will result in the majority being deferred or sent to spam. The warmup process is the gradual ramp from a few hundred messages a day to full production volume over seven to fourteen days, allowing the IP to build reputation without triggering the volume-based defences that receivers apply to suspicious new senders.

The warmup curve. A typical warmup curve doubles volume every 24 to 48 hours, starting at 50 to 100 messages a day and reaching 50,000 to 100,000 messages per IP per day by day fourteen. The exact curve depends on the receiver mix and the engagement quality of the recipient list. High-engagement lists (recent subscribers, transactional traffic to known customers) warm faster than low-engagement lists. Yahoo and Gmail Postmaster both publish guidance on warmup pacing in their bulk sender requirements (effective February 2024 for Yahoo and Gmail, May 2025 for Microsoft). Follow the published guidance for whichever receiver represents the largest share of your audience.

Splitting traffic during warmup. During the warmup window, split your sending across the old SendGrid IPs and the new dedicated IPs. The split should match the warmup capacity of the new IPs. A typical split on day one of warmup might be 99 percent SendGrid, 1 percent new IP. By day eight, the split should be roughly 70/30 in favour of SendGrid, with the new IPs having absorbed about 30 percent of total volume. By day fourteen, the split should be 30/70 or 50/50, with the new IPs ready to absorb full production volume after cutover. Use traffic routing logic that selects between the two paths per-message, not per-domain, so that the warmup is spread evenly across recipient mailbox providers.

Troubleshooting during warmup. The most common problem during warmup is a complaint rate spike. Yahoo and Gmail flag sustained complaint rates above 0.3 percent (per the bulk sender requirements). A spike during warmup typically traces back to a specific list segment with old or low-quality consent. Identify the segment by partitioning the recent sends by source list and checking the complaint rate per source. Pause the offending segment, drop the affected addresses from the active list, and resume warmup from where you stopped. The complaint rate typically falls back below threshold within 24 to 48 hours after pausing the source. We had a 27,000-address segment trigger this exact pattern during the UK Fintech migration; the segment was a 2022 acquisition list that had never been re-permissioned.

IP allocation strategy. Allocate at least two IPs for transactional traffic and at least two for marketing traffic, and warm them on separate curves. Mixing transactional and marketing on the same IP introduces complaint risk to the transactional path, which is the path that should be most reliable. If your monthly volume is below 500,000 emails, two IPs total may suffice. Above five million emails a month, consider four to six IPs split by traffic type and by recipient region. The general principle: separate traffic profiles that have different complaint risk, give each its own warmup curve, and converge them only after both have established reputation independently.

06 / Days 15-18

Two-stage cutover.

Transactional first. Wait 48 hours. Marketing second. Monitor in real time. Have the rollback plan written before cutover starts.

Two-stage cutover. Cut transactional traffic first. Wait 48 hours. Cut marketing traffic second. The reason for the two-stage cutover is that transactional traffic produces immediate, visible signal: customers expect to receive password reset emails within seconds, and a problem with transactional cutover surfaces within minutes. Marketing traffic is usually batched and the signal of a problem can take hours to surface. Cutting transactional first lets you confirm that the new infrastructure is operating correctly before exposing the higher-volume marketing path. If transactional cutover fails, you have 48 hours to roll back without affecting marketing.

Cutover monitoring. During the cutover window, monitor the following metrics in real time: SMTP response codes (specifically the rate of 421/451/4xx temporary deferrals and 5xx permanent rejections), bounce rate by recipient mailbox provider, DMARC alignment passing rate (accessible through aggregate reports if your DMARC record is configured to receive them), and inbox placement at Gmail Postmaster Tools. Set alerting thresholds for each: deferral rate above 5 percent, bounce rate above 2 percent, DMARC alignment below 95 percent, Gmail reputation drop into Medium or Low. The monitoring should run for the full 72-hour window after each cutover stage.

Rollback plan. The rollback plan should be written before cutover starts, not improvised mid-incident. The plan should specify: the precise DNS changes to roll back (which TXT records, what values), who has access to make them, the time-to-rollback (typically 5 to 15 minutes for DNS changes plus the receiver TTL window before the change propagates), the criteria that trigger rollback (which thresholds in the cutover monitoring would justify rollback rather than mid-flight remediation), and the communication plan to internal stakeholders. Rolling back is rare in our experience, but the cases where rollback was necessary always involved a problem that had not been caught during warmup, and the speed of the rollback was the difference between a minor incident and a customer-visible outage.

07 / Days 19-21

DPA execution and DNS cleanup.

Counter-sign the DPA. Remove the SendGrid SPF includes. Retire the legacy DKIM selectors after the seven-day overlap. Verify each production sending integration with a test message.

DPA execution. The Data Processing Agreement should be in legal review during the warmup phase, not started after cutover. By day 19, the DPA should be ready for counter-signature. The standard EU GDPR Article 28 instrument is straightforward; customer-supplied DPAs typically require small amendments (most commonly: a shorter breach notification window, sector-specific audit rights references such as FCA SYSC 8 or HDS Article L1111-8, or specific sub-processor consent language). Counter-sign and store the executed DPA in your contract management system. Update the entry on your sub-processor list and the corresponding entry on your published vendor list.

DNS cleanup. Remove the SendGrid SPF includes from your production DNS. Retire the legacy DKIM selectors after the seven-day overlap window completes. Decommission any dormant sending domains that the discovery audit surfaced. Publish the final SPF record (without SendGrid includes), the final DKIM selector list (without legacy selectors), the production DMARC record (at whatever enforcement level you have reached in the progression), the MTA-STS policy at mode: enforce, and the TLS-RPT record pointing to your monitored endpoint.

Verification. Send a test message from each production sending integration after DNS cleanup completes. Verify that the message authenticates correctly (SPF pass, DKIM pass, DMARC alignment pass), that it is received in the inbox (not Promotions, not spam), and that the headers do not contain residual references to SendGrid infrastructure. Run the verification against test accounts on the major mailbox providers (Gmail, Outlook, Yahoo, Apple iCloud). Save the headers as documentation of successful migration. If any verification fails, the migration is not complete; investigate and remediate before declaring the migration finished.

08 / Failure modes

What can go wrong.

The migrations that fail are usually migrations on an unrealistic timeline by a team without dedicated capacity. The technical work is not difficult; doing it correctly while keeping production running is.

The migrations that fail are typically migrations that were attempted on an unrealistic timeline by a team that did not have dedicated capacity. The technical work is not difficult. What is difficult is doing the technical work correctly while also keeping production running, while also handling any unexpected discovery from days 1-3, while also managing internal stakeholders who want status updates and may not understand why the timeline is what it is. The single most common failure mode is trying to migrate during a high-volume sending period (Black Friday week, end-of-quarter campaigns, product launches), where the additional production load makes the migration work twice as expensive and the consequences of a mistake materially larger.

The specific patterns we have seen go wrong in self-driven migrations: rushing the DKIM overlap window to less than seven days (causing in-flight mail to fail alignment); skipping the SPF flattening step because the team thought they were under the 10-lookup limit when they were over it; warming up new IPs during a complaint spike from the old infrastructure (which corrupts the warmup data and makes the new IPs harder to establish); cutting marketing traffic before transactional traffic (inverting the safer order); failing to write the rollback plan before cutover; running the migration alongside ongoing feature work without a dedicated engineer.

09 / Boundaries

What this guide doesn't cover.

The 21-day pattern adapts to Mailgun, Postmark, Amazon SES with small changes. It does not adapt to in-house Postfix or Exim migrations. Three categories of customer should not attempt the 21-day timeline.

Boundaries of applicability. This guide describes migration from SendGrid to dedicated infrastructure. The pattern adapts to migrations from Mailgun, Postmark, Amazon SES, and similar US-domiciled email service providers, with small changes to the discovery and DNS audit specifics. The pattern does not adapt directly to migrations from in-house Postfix or Exim deployments, which involve a different set of considerations (custom queue manager logic, retention reduction, schema migration on bounce-handler systems). The German media case study we published describes one of those engagements and the timeline for that pattern is six months rather than 21 days.

When the 21-day timeline does not fit. Three categories of customer should not attempt this migration in 21 days. First: very high volume senders (above 50 million emails per month) need a longer warmup window because the IP allocation is larger and the per-IP warmup runs in parallel rather than sequentially. Second: customers under active regulatory inquiry need to coordinate the technical migration with the regulatory response timeline, which typically runs slower than 21 days. Third: customers with multi-tenant architectures (where the same SendGrid account services multiple business units with different sending profiles) need to migrate each tenant separately, and the per-tenant timeline is closer to 14 days rather than 21 because the discovery is already done after the first tenant.

When to ask for help. If the discovery audit in days 1-3 surfaces something the team does not understand: stop and ask for help. The cost of pausing for a week to bring in a specialist is materially smaller than the cost of attempting to migrate a configuration that nobody on the team fully understands. We are happy to take a 30-minute call to assess whether your specific situation fits this pattern, and we route the conversations to specialists in our network where they do not. The honest answer in some cases is that the customer's situation is straightforward enough to do alone with this guide; in others, the answer is that the situation needs more attention than a guide can provide.

Migration conversation?

About one in three of our inbound conversations comes from a team running this exact migration. The 30-minute call we offer covers the discovery checklist for your specific setup, the realistic timeline given your volume and complexity, and an honest assessment of whether the 21-day pattern fits or whether your situation needs the longer six-to-eight-week version. About one in five calls ends with us recommending the customer run the migration themselves with this guide rather than engaging us; that is the honest answer in those cases.

Email Mikael directly → UK Fintech case study