Why Automate Deduplication?
Quick Checklist
- Inspect the dataset to spot common issues
- Standardise formats (dates, cases, phone numbers)
- Deduplicate using fuzzy/phonetic matching
- Flag and fill missing values where possible
- Validate results and audit changes
Duplicates silently reduce trust in your data and cost teams time. Manual cleanup is error-prone and does not scale. Automation makes deduplication repeatable, auditable and schedulable.
Sample Dataset and Goals
We will use a contacts sheet with name, email and phone columns. The goal: Find likely duplicate rows, mark them with a confidence score and optionally merge or flag them for review.
Apps Script Workflow (copy-paste)
Drop this script into Extensions → Apps Script in your sheet. The example below reads rows from a Contacts sheet, computes match signals (name normalisation, email similarity and optional phonetic keys) and writes back a match confidence score and suggested merge action in adjacent columns.
Key ideas:
- Generate lightweight match keys for fast pairwise scans.
- Use Flookup (add-on or external matching endpoint) to compute high-quality similarity when available.
- Write suggestions to a review sheet to avoid destructive edits.
- Batch work to stay within Apps Script quotas.
/**
* Dedupe runner: read rows, compute signals, call external matching endpoint for scored matches.
* Adjust column indices to match your sheet. This writes results to columns F..H.
*/
function runDedupe() {
const ss = SpreadsheetApp.getActive();
const sh = ss.getSheetByName('Contacts');
const lastRow = sh.getLastRow();
if (lastRow < 2) return; // nothing to process
const rows = sh.getRange(2,1,Math.max(0,lastRow-1),4).getValues();
const out = [];
for (let i=0;i<rows.length;i++){
const [name,email,phone,notes] = rows[i];
const key = (name||"").toString().toLowerCase().replace(/[^a-z0-9]/g, "");
// naive local score placeholder
let localScore = (email && email.toString().trim()) ? 0.6 : 0.2;
// call external matching endpoint for a better score (example - replace URL and key)
try{
const payload = { record: { name: name||"", email: email||"", phone: phone||"" }, top: 3 };
const res = UrlFetchApp.fetch('https://REPLACE_WITH_MATCHING_ENDPOINT/v1/match', {
method: "POST",
contentType: "application/json",
headers: { "Authorization": "Bearer YOUR_API_KEY" },
payload: JSON.stringify(payload),
muteHttpExceptions: true
});
const json = JSON.parse(res.getContentText()||"{}");
const best = (json.matches && json.matches[0]) ? json.matches[0].score : 0;
const finalScore = Math.max(localScore, best || 0);
out.push([finalScore, best || "", JSON.stringify(json.matches || [])]);
} catch(e) {
out.push([localScore, "", "error"]);
}
}
// write confidence (col F), apiScore (col G), rawMatches (col H)
if (out.length > 0) {
sh.getRange(2,6,out.length,3).setValues(out);
}
}
Integrating with Flookup
Flookup provides higher-quality similarity scores and contextual matching than simple key comparisons. Use the add-on for quick, UI-driven workflows or schedule programmatic deduplication.
When calling the API from Apps Script: Keep requests small (batch 10-50 records), cache responses where possible and store the raw match payload in a review sheet so humans can inspect suggested merges before applying them.
// Example: minimal UrlFetchApp call (already shown in runDedupe above)
// Remember to protect your API key and store it in PropertiesService rather than inlined.
Scheduling, Notifications and Approvals
Add a time-based trigger to run the dedupe regularly. Best practice: Run a conservative scoring pass daily that writes suggestions to a Review sheet, then run a weekly manual approval that applies merges. You can also email a short summary when the number of high-confidence matches exceeds a threshold.
function scheduleDedupe(){
ScriptApp.newTrigger('runDedupe').timeBased().everyDays(1).atHour(2).create();
}
For notifications, generate a CSV summary and email it via MailApp.sendEmail() or integrate with Slack using an incoming webhook.
Testing, Rollback and Metrics
Always run the script on a copy first. Store suggested edits in a separate sheet with a proposed_action column. Keep a change log (timestamp, user, rows affected) and provide a one-click rollback that replays the log to revert changes.
Track metrics: Matches found, merges applied, false positive rate (from manual review). These numbers inform threshold tuning.
Conclusion: The Future of Automated Data Hygiene
Automating deduplication saves time and reduces risk, but start slow. Use suggested matches in a review workflow, monitor false positives and tune thresholds. When ready, automate the merge step for high-confidence matches only.