By: David Taber
Every enterprise system has the risk of data quality problems, but customer relationship management (CRM) and marketing automation systems are incredibly vulnerable to incomplete records, bad data, and duplicate records. Data comes from too many flaky sources, and there aren’t enough incentives for organizations and people to really follow standards. Fortunately, a number of services and tools have arisen to automagically improve data quality (in particular, Salesforce’s Data.com). When it comes to deduping, though, even the best tools out there need to be used with careful attention to detail and the prep work that can take days. We’ll show you how to avoid the main pitfalls here.
The big deal about dupes
Duplicate records are dangerous to system credibility because users can’t find the updates that they’ve made (they’re looking at the wrong copy of the record) and each of the dupes represents an incomplete record with multiple data quality issues. The longer a dupe exists, the more the data quality issues metastasize, making rectification ever more costly.
Duplicate records are indicators of process problems, either in user training or because there something that makes the original record invisible or unusable for that user in a specific stage of the process. Dupes are also key indicators of system problems (Is the weekly upload not using fuzzy matching? Is this integration not properly checking for existing records? Is that trigger using the right logic to check for and resolve dupes before doing a record update?)
I hope I’ve scared you about this issue, because fixing it can require a surprising amount of time.
Prevention is the best cure
The starting and ending point of any deduping project is this: finding defects in processes and systems to prevent the creation of dupes in the first place. Step one is to identify which process and system elements are creating which flavor of dupe (and yes, you will have to do pattern recognition on the particular types of dupe you’re experiencing). Once you’ve done this, you need to have a brief checklist of situations to flag, with solution strategies for each one.
What do you do when two people share the same email address?
What do you do when a person has 10 email addresses?
What do you do when one company you do business with acquires another one that’s a prospect?
What do you do with international divisions of conglomerates, some handled by your sales team and others handled by a channel partner?
These checklists of business rules need to go to your relevant developers and consultants so they know what to look for and build in handlers in their code.
There are tools that help prevent dupes, but they are absolutely not created equal. For imports, you’re looking for five main attributes:
Compatibility with your file formats (mostly, this is just XLS and CSV)
Compatibility with the object/table you’re working with (there’s more to life than just leads and contacts)
Matching on more than just a couple of fields
Fuzzy matching of most fields
Ability to configure the fuzzy parameters (hopefully, for each field)
For records created by real-time processes, there’s one additional attribute that has to be mentioned: how does the tool behave when a dupe is created by code or via a web-service call (e.g., SOAP or ReST)? Most tools block creation of such records, which might seem OK…that is, until you try to deploy a bug-fix and discover that you can’t because some piece of test code creates a duplicate record and prevents you from pushing anything into production. Sometimes, the only choice is to temporarily disable the dupe blocker, do the push, and then remember to re-enable the blocking system. Your developers and admins will thank you for buying a dupe preventer that is more forgiving and/or flexible in error handling.
There are three key thoughts before I go on. First, the best approach to processing imports is to do the deduping outside the system (before they get in)…and conversely, the best approach to removing existing dupes from the database is to do that inside the system (any dedupe sequence that requires an export-process-import cycle is almost certainly wrong). Second, there’s no universal answer to the question, “should we do data enrichment before the merge cycle, or after.” But typically, doing it after the merge is better (and it’s always safer and less work). And finally, the third thought that will drive Mac lovers nuts: you really want to be using the best tools, and those only run on a PC (even if you run Parallels on a Mac, you’ll need to get a PC keyboard and make sure everything is mapped properly). Also, it’s best if you have two or even three screens for doing the heavy lifting.
Where to start
The only place to start is to stop and think. Because you can’t do anything good without developing a strategy to fit your specific situation. And there are several situations where something that looks like a dupe actually isn’t one. Here are some examples:
Companies that have the same name because they’ve been abbreviated (for example, Ford for Ford Motor Company and Ford for Ford Modeling Agency). Solution strategy: use official company names in the system-standard Account Name field to avoid confusion, and add a custom field on the company record to reflect “friendly” or “street” names.
Companies with tons of divisions (think Mitsubishi or Tyco). Solution strategy: keep all the divisions as companies, but name them more appropriately and make the divisions as child accounts of the corporate parent (I like naming the parent like this: Company [stock symbol]).
Companies with tons of international operations (think Volkswagen of America vs. Volkswagen do Brasil vs. Volkswagen de France). Solution strategy: name the companies completely and make them children of the “mother ship” (Volkswagenwerke AG).
People who have left one company and joined another. Solution strategy: either rename the person as Joe Blow [former employee] or merge the two Joe Blows and add a Contact Role to Joe’s former employer indicating the former employee status (note that these choices have different implications about the way you manage your long-term relationships, so think through which is the closest match for how you want to see relationships over several years).
Contracts that are obsolete or that went through several iterations before signature. If these need to be kept (and typically, old contracts should be kept for 7 years), the solution strategy should be to rename the obsolete ones to reflect their status (e.g., “BigCompany – 2015 replenishment [concluded]”).
Cases that were re-opening a supposedly-resolved issue. Solution strategy: leave the cases (as “re-opens” are an important indicator of service quality), but rename the earlier avatars and provide a pointer to the final case in the series.
Once you have confirmed that you really have dupes, the next step is to develop a strawman set of criteria for the dupe-detection cycle. Some tools call this a “scenario,” others “matching rules,” but no matter what the nomenclature, the idea is to set up an algorithm that will catch 80 percent of the real dupes but have zero false positives (be willing to compromise on the 80 percent, but you have to have a zero tolerance policy for false positives at this point). The algorithm has to identify the “winner” record (the one that will survive), but errors on this detail aren’t too impactful. The best tools use a scoring system for determining the winner, so you can tune the weighting of various fields. You may discover the need to create new formula fields on the object to properly identify the winner (for example, isVP and isStrategic Booleans to make sure that important records get routed to important people).
Do not use the strawman criteria until you’ve tested them with a good-sized sample (using spreadsheets) and validated the results with the people who will be affected by the merges (e.g., marketing for Leads, sales for Contacts, Accounts, and Opportunities, and service for Cases and Contracts). If you have a full sandbox, use it for your tests! Of course, that won’t let you see the merge repercussions with integrated systems, but it’ll get you a long way.
Once you are satisfied with the strategies that capture dupes, you need to choose the merge parameters. The best tools let you select things like:
Does a “winner” record’s blank values over-write a loser’s non-blank value?
For text fields, does the loser’s text get concatenated to the end of the winner’s text?
Are exceptions possible, where certain “loser” records’ fields trump the winners’?
The merge process in Salesforce.com (SFDC) preserves all “related lists” (child or related records) from both of the merged records. But for the fields within the two records being merged, only one value per field can be preserved. If you need to preserve or concatenate data from the “dying” record into the surviving one, you’ll have to do that with some supplemental steps after the fact (which means you need to do some prep work before the merge).
Integrated systems and plug-ins
This is where things get hairy. Let’s start with the simple stuff: plug-ins. Most plug-ins installed in SFDC don’t really mind deduping and probably won’t corrupt anything when you merge. But read that word “most” – you must test with your particular configuration of plug-ins to make sure nothing nasty happens.
Things get tougher with external systems, which may be integrated via a plug-in or a SOAP/ReST connector. The most valuable integrations are the trickiest:
Marketing automation (MA) systems probably won’t pay attention when you merge most tables, but may throw a hissy-fit if you merge Leads and Contacts from within SFDC. The smarter MA systems will probably have their own merge features, so you’ll need to do those merges on the MA side of the integration. If your MA system doesn’t, contact your vendor’s support (or even consulting) team to ask for guidance.
Accounting, billing, collections, e-commerce and other business operations systems probably will ignore the merges on most tables, but will freak out if you try to merge Accounts, closed Opportunities, Contracts, or related records. The external system may block the merge or may instantly recreate the dupe or all kinds of surprising things. The only way to know what’s going to happen is to read the vendor’s docs, reach out to their support/consulting team and do a test with dummy records in your production system. Note the word and in that sentence: you need to do all three steps with each system.
Enterprise system buses, integration hubs and the like are the unpredictable Frankenstein because you probably don’t know what all the end-points are. In addition to the first two bullets’ considerations, you need to think about what will happen to data warehouses, web sites/portals, and even cash registers. Test with some fake records in production, then wait 48 hours and look at the error logs of the integration hub and key applications before doing anything with data of any value.
What if testing determines you can’t merge certain kinds of records? There are two general solutions: one is using record ownership fields and SFDC roles/sharing privileges to hide the dupes from most users (the admins and other apps must still see them), and the other is to modify the records’ Name field to indicate the “ignore this one” status (for example, “Company XYZ [do not use]” or “▼Joe Blow”) in reports and user screens.
Danger Will Robinson!
Deduping is such a tricky business that I have to stress these best practices:
Never start a deduping session before completing a full system backup (all tables, really, every merge cycle).
Do not allow regular users to do merges.
Do not deploy triggers or classes that do merges unless you have done a code review to understand all the conditions under which the logic may fire, and make sure that the repercussions of such automated merges (including reverberations from systems integrated with SFDC) are reasonable.
Almost never merge accounts.
If you’re doing merges of any size, always do them in Sandbox first… and look for error messages or any unexpected behaviors.
If you’re doing merges of any size in production, it’s typically best to do them when the system is quiesced (read: nights and weekends) and it may be a good idea to turn off relevant flows, workflows and triggers; consider turning off relevant scheduled APEX jobs as well.
Do merges in small batches (typically less than 10 percent of the entire set of candidates in each pass) and pause a few hours to review the impact of the merges (listen for panicked calls from someone who’s lost their pet records or found some reason why your solution strategy is bogus).
Document the strategies, parameters, thresholds and procedures used in each deduping session. Do not throw away log files that are generated by your deduping tool.
Never do merges when you are stressed out, pressed for time or just plain tired.