Gigantic Data Loads in Salesforce

Part 107: Gigantic Data Loads in Salesforce

Welcome back to the Salesforce series. At some point in every Salesforce implementation, you will face a data load that is not small. Maybe it is 500,000 Account records from a legacy CRM. Maybe it is 15 million Case history records from a system that has been running for a decade. Maybe it is a nightly sync of 2 million Opportunity line items from an ERP.

Whatever the scenario, loading large volumes of data into Salesforce is a different discipline than loading a few thousand records. The tools change, the risks change, and the planning required goes up dramatically. This post covers how to approach gigantic data loads — the tools, the data quality concerns, the skew traps, and the environments you need to test safely before going live.

Tools for Large Data Loads

Not all data loading tools are built for the same scale. Choosing the wrong tool for a large migration is one of the most common mistakes teams make.

Salesforce Data Loader

Data Loader is Salesforce’s free, official desktop tool. It supports both the SOAP API and the Bulk API. For large loads, you always want to use the Bulk API setting. The SOAP API processes records in batches of 200, which is painfully slow at scale. The Bulk API processes batches of up to 10,000 records asynchronously, which is a massive improvement.

Data Loader is a solid choice for loads up to about 5 million records. Beyond that, you start running into limitations around error handling, restartability, and monitoring. If your load fails at record 3.2 million, Data Loader does not make it easy to pick up where you left off without manual work.

Key settings for large loads in Data Loader:

Enable Bulk API mode (Settings > Use Bulk API)
Set batch size to 10,000 (the Bulk API maximum)
Enable serial mode if you are hitting lock errors — this processes batches one at a time instead of in parallel
Enable Insert Null Values if your source data intentionally has blanks that should overwrite existing values

Informatica Cloud Data Integration

Informatica is the enterprise-grade ETL tool most commonly used with Salesforce. It handles complex transformations, supports millions of records, provides detailed logging, and integrates with virtually any source system. It uses the Bulk API under the hood but wraps it with orchestration, scheduling, and error recovery features that Data Loader lacks.

Informatica shines when you have ongoing data syncs (not just one-time migrations), when you need to pull from multiple source systems, or when your data requires significant transformation before it hits Salesforce. The downside is cost and complexity — Informatica is not cheap, and it requires dedicated skills to configure properly.

Jitterbit

Jitterbit sits between Data Loader and Informatica in terms of capability. It supports the Bulk API, handles transformations, and provides a visual mapping interface. It is a good mid-market option when you need more than Data Loader but do not need the full weight of Informatica or MuleSoft. Jitterbit also has a Salesforce-specific connector that simplifies common patterns like upserts and relationship mapping.

MuleSoft

MuleSoft is Salesforce’s own integration platform (acquired in 2018). It is API-led, meaning it is designed around building reusable API layers rather than point-to-point data loads. For large migrations, MuleSoft can orchestrate complex multi-step loads with built-in error handling, retry logic, and monitoring through Anypoint Platform.

MuleSoft is the right choice when the data load is part of a larger integration architecture — when you are not just loading data once but building a permanent pipeline. For a one-time migration, MuleSoft is often overkill unless your organization already has it deployed.

Custom Bulk API Scripts

Sometimes the best tool is one you build yourself. The Salesforce Bulk API 2.0 is a REST-based API that accepts CSV data and processes it asynchronously. You create a job, upload your CSV, and poll for results. The API handles chunking internally and supports up to 150 MB per file upload.

Custom scripts make sense when you need fine-grained control over retry logic, when your transformation logic is complex enough that a visual tool becomes unwieldy, or when you need to load data from a source that does not have a pre-built connector. Python with the simple-salesforce library or Node.js with jsforce are common choices.

Bulk API 2.0 limits to know:

150 MB maximum CSV file size per upload
10,000 records per batch (the API chunks internally)
15,000 batches per rolling 24-hour period
Each org can have up to 15,000 Bulk API jobs in a 24-hour period on most editions
Jobs time out after 10 minutes of inactivity (no new data uploaded or no polling)

Data Quality Considerations

Loading 10 million records of bad data is worse than loading nothing at all. Data quality work should consume at least 40% of your total migration effort. Here are the areas that matter most.

Deduplication

Before you load anything, deduplicate your source data. Salesforce has built-in duplicate rules, but they are not designed to catch duplicates during a Bulk API load — duplicate rules do not fire in Bulk API mode by default. You need to deduplicate before you load, or you will end up with thousands of duplicate records that are painful to clean up after the fact.

Use your source system’s unique identifiers to deduplicate. If you are migrating Accounts, match on a combination of company name (normalized), address, and phone number. If you have a legacy system ID, bring it over as an External ID field in Salesforce — this gives you a key for upsert operations and makes incremental loads much simpler.

Data Cleanup

This is the unglamorous work that saves you from disaster. Common cleanup tasks include:

Standardize picklist values. Your source system might have “US”, “USA”, “United States”, and “U.S.” for the same country. Salesforce picklists need one consistent value.
Trim whitespace. Leading and trailing spaces in text fields cause matching failures and look wrong in the UI.
Validate email formats. Salesforce will reject records with invalid email addresses. Catch these before you load, not during.
Normalize phone numbers. Decide on a format and apply it consistently. Salesforce does not enforce a phone format, but inconsistent data makes reporting unreliable.

Field Mapping

Document every field mapping between your source system and Salesforce. This is not optional for large loads. You need a spreadsheet that maps every source field to a Salesforce field, notes the data type, the max length, whether the field is required, and any transformation logic.

Pay special attention to:

Text field length limits. A standard text field in Salesforce maxes out at 255 characters. Long text areas max out at 131,072 characters. If your source data exceeds these limits, you need to truncate or split.
Rich text vs plain text. If you are loading HTML content into a rich text field, make sure it is valid HTML. Malformed HTML will render poorly.
Lookup relationships. You cannot load a lookup field with a value — you need the Salesforce record ID of the related record. This means you often need to load parent records first, extract their IDs, and then load child records with those IDs mapped.

Date and Timezone Handling

This is where migrations silently go wrong. Salesforce stores DateTime fields in UTC. When you load a DateTime value, Salesforce interprets it based on the format you provide. If your source system stores dates in local time without timezone information, you need to know which timezone those dates represent and convert them to UTC before loading.

Date fields (without time) are not timezone-sensitive — they store the date as-is. But DateTime fields will shift based on timezone interpretation. A record created at “2025-12-31 23:00:00” in US Eastern time is actually “2026-01-01 04:00:00” in UTC. If you load the Eastern time value without conversion, your users will see the wrong date.

Always use ISO 8601 format for dates in your CSV files: 2025-12-31T23:00:00.000Z for UTC, or 2025-12-31T23:00:00.000-05:00 for a specific offset.

Preventing Data Skew

Data skew is one of the most dangerous performance problems in Salesforce, and large data loads are exactly when it gets introduced. Skew means an uneven distribution of records that causes locking, slow queries, and degraded performance for your entire org.

Account Skew (Parent Skew)

Account skew occurs when a single Account has an extremely large number of child records — typically Contacts, Opportunities, or Cases. Salesforce’s recommendation is to keep child record counts below 10,000 per parent Account. Beyond that threshold, you will start seeing record lock errors on concurrent updates and slow page loads when users open that Account.

How to prevent it during migration:

Identify accounts in your source data that have more than 10,000 child records. If a legacy “catch-all” account has 500,000 Cases attached to it, you need a strategy — either split that account, archive old records, or use a different object model.
If you must load a parent with many children, load the children in serial mode to reduce lock contention.

Ownership Skew

Ownership skew occurs when a single user owns more than 10,000 records of the same object. This is common during migration when teams assign all migrated records to a single integration user or a default owner. The problem is that Salesforce’s sharing architecture creates implicit share records based on ownership, and calculating sharing for a user with hundreds of thousands of records is expensive.

How to prevent it during migration:

Distribute ownership across multiple users. If you need a temporary owner during migration, create multiple integration users and round-robin ownership across them.
After migration, reassign records to their correct owners using a batch process.
If your org uses private sharing, consider switching to public read-only during migration and then reverting — but only if the security implications are acceptable.

Lookup Skew

Lookup skew occurs when many records reference the same value in a lookup field. For example, if 200,000 Contacts all have the same Account in a lookup, updates to any of those Contacts require a lock on the parent Account. This is similar to Account skew but applies to any lookup relationship.

How to prevent it during migration:

Audit your lookup fields before loading. Identify any lookup values that will be referenced by more than 10,000 records.
For custom lookup fields that are not required, consider loading the records first without the lookup populated, then updating the lookup in a separate pass using serial mode.

Testing Environments for Large Data Loads

You should never run a large data load directly into production without testing it first. But not all sandboxes are suitable for testing large loads.

Full-Copy Sandbox

A full-copy sandbox is the gold standard for testing large data migrations. It contains a complete copy of your production data and metadata, which means you can test your load against realistic data volumes and see how it interacts with existing records, triggers, flows, and validation rules.

Full-copy sandboxes have limitations though. They can only be refreshed once every 29 days, and they take time to provision — sometimes 24 hours or more for large orgs. Plan your testing cycles around this refresh window.

Use a full-copy sandbox to test:

The complete end-to-end load process with production-scale data
Performance under realistic conditions (triggers, flows, sharing rules all active)
Data skew impact with real record distributions
Rollback procedures — practice deleting loaded records and verifying you can recover

Partial Sandbox Strategy

If you do not have access to a full-copy sandbox (or you are between refresh cycles), you can use a partial-copy sandbox with a strategy. Partial sandboxes copy metadata and a configurable subset of data — up to 5 GB of data depending on your edition.

The approach is:

Load your migration data into the partial sandbox at reduced volume — typically 10% of the full load.
Validate field mappings, transformations, and relationship integrity.
Test all triggers, flows, and validation rules fire correctly on the loaded data.
Measure API consumption and batch processing times, then extrapolate to full volume.

This will not catch every performance issue that appears at full scale, but it will catch the majority of mapping errors, transformation bugs, and automation conflicts that would otherwise surface in production.

Developer Sandbox as a Smoke Test

Developer sandboxes contain no production data, but they are free and refresh quickly (once per day). Use them for quick smoke tests of your load scripts — verifying CSV formats, field mappings, and basic automation behavior. Do not rely on them for performance or skew testing.

Section Notes

Large data loads are project management problems as much as technical problems. Here are the things that experienced teams always do:

Build a runbook. Document every step of the load process — the order of operations, the expected record counts, the estimated time for each step, and the rollback procedure. When you are loading at 2 AM on a go-live weekend, you do not want to be figuring things out on the fly.
Disable unnecessary automation. Before loading, deactivate triggers, flows, validation rules, and workflow rules that are not required for the migration. Every automation that fires on insert adds processing time and API consumption. Re-enable them after the load completes.
Use External IDs. Always create an External ID field for your source system’s primary key. This enables upsert operations, makes incremental loads idempotent, and gives you a reliable way to trace records back to their source.
Load in dependency order. Parent objects first (Accounts), then child objects (Contacts, Opportunities), then grandchild objects (Opportunity Line Items, Cases). This ensures lookup fields can be populated correctly.
Monitor governor limits. If you have triggers or flows that fire during the load, monitor API call consumption, batch processing queue depth, and async job limits. The Bulk API itself counts against your org’s 24-hour API call limit, and each batch of 10,000 records counts as one API call.
Plan for failure. Your load will not work perfectly the first time. Build in time for error analysis, data correction, and reload cycles. A common pattern is to load, review the error file, fix the source data, and reload only the failed records using the External ID for matching.
Communicate broadly. Large data loads affect the entire org. Users will see new records appearing, reports will show different numbers, and dashboards will change. Notify stakeholders before the load, provide a timeline, and have a plan for questions.

Large data loads are stressful, but they are manageable with the right preparation. Choose the right tool for your volume, clean your data thoroughly before loading, watch for skew, test in a realistic environment, and document everything. The teams that struggle with migrations are almost always the ones that skipped the preparation and tried to brute-force their way through.