STANDARDIZE DATA
Overview
The Standardize Data tool helps you clean and normalise your data by removing noise such as stop words, punctuation and diacritical marks. This is a crucial pre-processing step that significantly improves the accuracy of fuzzy matching and data reconciliation by focusing on the substantive content of your text.
To begin, open the tool by navigating to Extensions > Flookup Data Wrangler > Matching and Analysis > Standardize data in your Google Sheets menu.
How to Use the Standardize Data Tool
- Select the cleaning operation
Choose the type of normalization to perform, such as Remove punctuation marks, Remove unwanted words, Remove diacritical marks, Keep URL path only, or Keep URL domain only. - Select the data to normalize
Highlight the column or range of data you wish to process and click the Grab selected range button for the Range of entries to normalize. - Select entries to remove (if applicable)
For punctuation or word removal, highlight a range containing the specific characters or words you wish to exclude and click the Grab selected range button for the Range with entries to be removed. - Specify the output location
Click an empty cell within your spreadsheet to mark the starting position where the cleaned results should be displayed. - Set the threshold (if applicable)
If you selected Remove unwanted words, you can adjust the Threshold value (default 0.75) to account for slight variations in the words you wish to remove. - Execute the normalization
Click the Normalize text entries button to finish.
Notes on Standardizing Data
- URL Normalization: Use Keep URL domain only to strip protocols and paths, leaving just the root domain (e.g., example.com). Use Keep URL path only to extract the subdirectory structure.
- Fuzzy Word Removal: The Threshold allows you to remove words that are phonetically or visually similar to those in your removal list, ensuring a more thorough cleaning of inconsistent datasets.
- In-place vs New Location: The tool writes results starting from the specified output position, preserving your original data for easy comparison.
- Diacritical Marks: Selecting Remove diacritical marks will convert accented characters (like é, ñ, or ö) to their base equivalents, which is often necessary before performing character-based fuzzy matching.
Visual Learning Resources
Labels might differ slightly but the steps remain the same.