I recently hired a panel survey company to conduct surveys in several countries around the world. With some of the questions, we gave participants the option to select “other” and then fill in the blank. Given the ethics approval that we have for the data, we are not allowed to retain any data that could be personally identified in the dataset. With roughly 8,000 responses in the survey, examining each of the responses to determine whether it is personally identifying could have potentially taken a lot of time. That’s when I realized that ChatGPT has the ability to find such information. I have a paid account with ChatGPT 4, which allows you to upload a CSV file. I went through my dataset in R and exported each variable individually into a CSV file. While examining one variable at a time took longer, I figured it would minimize the risk of personally identifiable information being linked with any other data in the dataset. I then uploaded the CSV file along with the following prompt into ChatGPT:
Will you please look through every row in the attached CSV file and check to see if there is any information that could personally identify someone? If there is personally identifying information, please indicate the row of the spreadsheet it is on.
ChatGPT returned the row identifier for the responses that it thought might be personally identifiable information (PII). In all but one case, the response was not personally identifiable. For instance, ChatGPT flagged “Roman Catholic” as possibly personally identifiable but didn’t flag “roman catholic” as personally identifiable information (capitalizing two words sequentially seemed to be an indicator it used to flag possible PII).
ChatGPT was able to quickly look through every response and find any that might be personally identifiable. The closest it came to finding personally identifiable information was someone who entered part of a street address instead of what we asked for, the country from where they emigrated. I was able to quickly delete that information from the dataset.