Hybrid Fuzzy Matching With Embeddings in Google Sheets and Excel
Key Takeaways
- Hybrid matching combines the lexical efficiency of traditional fuzzy logic with the semantic understanding of embeddings to optimise cost and accuracy.
- By generating a small candidate list first, you minimise API calls to expensive embedding services while maintaining high-precision results.
- Implementation in Google Sheets via Apps Script or Excel via Office Script enables scalable, low-friction data deduplication directly within spreadsheet environments.
- Flookup Data Wrangler provides the lexical fuzzy matching pass, text normalisation and confidence scoring that form the first stage of this hybrid pipeline, all within Google Sheets without custom code.
- Proactive candidate reduction, batch processing and caching are critical to managing costs and performance at scale.
Why Hybrid Matching?
Quick Checklist
| Step | Action | Why It Matters |
|---|---|---|
| 1 | Normalise text fields in the source sheet | Consistent casing and formatting improve both lexical and semantic match quality |
| 2 | Run a lexical fuzzy pass for initial filtering | Quickly discard clear non-matches before invoking expensive embedding calls |
| 3 | Generate embeddings for shortlisted candidates | Capture semantic meaning where surface-level character overlap is low |
| 4 | Score semantic similarity against a confidence threshold | Translate embedding distance into an actionable match or reject decision |
| 5 | Merge hybrid scores into the final match verdict | Combine lexical and semantic signals for higher overall accuracy |
Classical fuzzy matching remains efficient for typographical errors and close variants; embeddings capture semantic similarity across phrasing, abbreviations and synonyms. Combining both reduces calls to embedding services while preserving precision. The hybrid approach is particularly suitable for spreadsheet audiences who require low-friction, low-cost integration.
A practical scenario: you are comparing supplier names from a procurement database. A simple edit-distance check catches "Walmart Inc" versus "WalMart Inc". But "Walmart Inc" and "Wal Mart Stores" share little character overlap. An embedding approach maps both to the same semantic region, returning a high match score. By running the lexical filter first, you only pay for embedding API calls on the small subset of candidates that survive the initial pass.
High-level Pattern
- Normalise text in the sheet (for example with Standardize Data).
- Use Flookup functions such as Match and Merge or Remove Duplicates tool to produce a short candidate list per row.
- Compute embeddings for candidates only and score with a semantic service.
- Combine semantic score with a lexical check (for example Compare Text) and apply thresholds to auto-accept, reject or flag for review.
Google Sheets via Apps script
The snippet below is a minimal, production-aware Apps Script that batches candidate requests, calls a hypothetical embedding match endpoint and writes scores back to the sheet. Adapt the endpoint and authentication to the chosen provider.
function batchSemanticMatch() {
var ss = SpreadsheetApp.getActiveSpreadsheet();
var sheet = ss.getSheetByName('Matches');
var data = sheet.getRange(2,1,sheet.getLastRow()-1,3).getValues();
var batchSize = 50;
for (var i=0; i<data.length; i+=batchSize) {
var batch = data.slice(i, i+batchSize);
var payload = batch.map(function(r){
return {id: r[0], query: r[1], candidates: JSON.parse(r[2])};
});
var resp = UrlFetchApp.fetch('https://your-embedding-service.example/v1/match', {
method: 'post',
contentType: 'application/json',
payload: JSON.stringify({items: payload, top_k:5}),
muteHttpExceptions: true
});
if (resp.getResponseCode()!== 200) continue;
var results = JSON.parse(resp.getContentText());
results.forEach(function(r, idx){
var row = i + idx + 2;
sheet.getRange(row,4).setValue(JSON.stringify(r.matches));
sheet.getRange(row,5).setValue(r.top_score);
});
}
}
Notes:
- Batching reduces network overhead and avoids hitting per-request rate limits.
- Store and cache computed embeddings where records are static to avoid recomputation.
- Use Schedule Functions or Apps Script triggers for periodic rechecks.
Excel via Office Script
Office Script can perform an equivalent flow in Excel Online. The following is a concise example using the Fetch API available in Office Scripts runtimes.
async function main(workbook: ExcelScript.Workbook) {
const sheet = workbook.getWorksheet('Matches');
const range = sheet.getRange('A2:C101');
const values = range.getValues();
const items = values.map(r => ({id: r[0], query: r[1], candidates: JSON.parse(r[2])}));
const resp = await fetch('https://your-embedding-service.example/v1/match', {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify({items: items, top_k: 5})
});
if (!resp.ok) return;
const results = await resp.json();
results.forEach((r, i) => {
sheet.getRange(`D${i+2}`).setValue(JSON.stringify(r.matches));
sheet.getRange(`E${i+2}`).setValue(r.top_score);
});
}
Performance and Cost Benchmarks
Worked example: 10,000 rows naive embedding per row vs hybrid candidate approaches.
| Approach | Embedding calls | Cost (@ $0.0005/call) |
|---|---|---|
| Naive: One embedding per row | 10,000 | $5.00 |
| Hybrid: FLOOKUP with ~20 candidates (no optimisation) | 200,000 | $100.00 |
| Optimised hybrid: Blocking + caching (avg 5 candidates) | 50,000 | $25.00 |
Interpretation: Unoptimised hybrid approaches can actually increase embedding costs due to candidate volume. Candidate reduction and caching are essential to keep costs manageable. The cost figures above are illustrative; adapt pricing to your provider and model.
A good rule of thumb: if your Flookup lexical step produces fewer than 10 candidates per row on average, the semantic scoring step remains cost effective at typical embedding API rates. When candidate counts rise above 20, consider adding blocking fields such as industry code or geographic region to narrow the pool before sending candidates to the embedding service.
ANN and Production Notes
- For larger datasets persist an ANN index (FAISS, Annoy, Milvus) and precompute embeddings offline.
- Use the spreadsheet flow for verification and human review and push large-scale deduplication tasks to batch services that update the sheet with results.
- Monitor embedding stability reindex or retrain if nearest-neighbour behaviour changes over time.
UX and Review Workflows
Recommended columns in the review sheet:
- Candidate_score: Semantic score from embedding service
- lexical_score: Compare Text or similar
- decision: Accept / reject / review
- review_flag: Boolean for manual triage
Frequently Asked Questions
What is hybrid fuzzy matching?
Hybrid fuzzy matching combines traditional string similarity algorithms (such as Levenshtein or Jaro-Winkler) with semantic embedding models to improve matching accuracy. The lexical component catches typographical variations while the semantic component understands meaning, so "car" matches "automobile" even when the strings share no common characters.
When should I use embeddings instead of traditional fuzzy matching?
Embeddings are preferred when data contains synonyms, paraphrases or conceptually equivalent terms that share few characters in common. For example, matching job titles ("CEO" vs "Chief Executive Officer") or product descriptions across different catalogues. Traditional edit-distance algorithms remain better for typo correction and short-code matching.
Does Flookup support embedding-based matching?
Flookup integrates with embedding services to provide semantic matching alongside its built-in fuzzy matching algorithms. This allows you to combine lexical and semantic scores in a single decision framework, getting the best of both approaches within your Google Sheets workflow.