Blog

Profanity filter: removing English swear words

It’s not nice to be browsing your favourite word clouds and suddenly see a rude word appear. Now WordItOut will filter out such word clouds by default. Of course, you still have the freedom to see these word clouds if you want to. Details of our approach is outlined for developers who may be interested.

Why

Not everybody on WordItOut is out of school, and even adults don’t always like to see rude words appear on a website. Whilst we cannot guarantee 100% that every word cloud will be identified, we have done our best to ensure that most of them will. It’s also possible that some words (especially if they are not English) may be incorrectly included when they are harmless. Please let us know if that happens.

Criteria: the basics

For now, we only check for English words. We include many spelling variations, as people using these words are not always so careful about how they type. We also endeavour not to include harmless words which include naughty ones (suppose we want to keep the word pluck but not luck). Interestingly, we’ve worked hard so that the filter also applies to words using leet, where letters are substituted to evade filters but remain legible (for example hello becomes h3110). Whenever there’s a match for just one word, the whole word cloud gets filtered out.

How: advanced information for developers

There appears to be little documentation available for helping developers write filters that are more sophisticated than simply checking individual words against a very large and predefined blacklist. We’re proud of how our filter works and think that this approach may be useful for others, so we’d like to share with you briefly how it works. Knowledge of regular expressions (regex) is assumed, and the coding here is in JavaScript.

We start by creating two arrays of words, one for nouns and the other for verbs. If a word is both a noun and a verb, we put it with the verbs. Simple spelling variations or even different words can be included with regex.

var naughtyNouns = ['apple','bann?ana','(base|foot)?ball'];
var naughtyVerbs = ['colou?r','fetch','play'];

We then turn the arrays into a string, separated by the vertical bar (pipe symbol). What’s more, we simultaneously add various suffixes (and their common misspellings) to the nouns and the verbs. Note that we append the suffix regex to the string so that the final word is also included. This extends our initial list so that it will now include the following (and more): apples, playa, played, player, playin, playing, plays, fetches, …

var naughtyWords = naughtyNouns.join('s?|') + 's?';
naughtyWords += '|' +
  naughtyVerbs.join('(a|e|ed|er|ing?)?s?|') +
  '(a|e|ed|er|ing?)?s?';

The naughtyWords string is now a long and difficult to read regular expression, but this approach gives us confidence that it is correctly written. We do, of course, risk including words that are not to be filtered with this method, but that risk is minimal, and we’d rather filter a few blocks of text incorrectly than let something slip through.

We’re almost done, but now we want to filter out any leets and even more common misspellings. To do this, we replace every occurrence of certain letters in our current regular expression with a series of possible variations. Some examples are below, and they could be easily extended. We haven’t been fully comprehensive in our list of variations, but include the most common ones. Note that some charaters are escaped or even double escaped for both the JavaScript and regex.

naughtyWords = naughtyWords.replace(/[a]/gim,'(a|4|@|\\*)');
naughtyWords = naughtyWords.replace(/[s]/gim,'([sz5$]+)');

These steps have transformed our simple word ‘play’ to the regular expression:

pl(a|4|@|\\*)y((a|4|@|\\*)|e|ed|er|ing?)?([sz5$]+)?

This means that we will also filter out the following words: pl4yer, pl*yin, pl@y$, and so on. Our final step makes sure that we only look for whole words. An example of how to apply the string as a regular expression is also given below.

naughtyWords = '\\b('+naughtyWords+')\\b';
if (testThisString.match(naughtyWords) != null) {...}

Our approach is modular and thus easy to update, so that you may include new words, apply the method to other languages, and adapt it to new types of leets and spelling variations.

We hope you find this small tutorial useful. Please share it if you did.

Sunday 28 February 2010