Mastodon: filtering words using regex filters

11 июля 2018 г.

Regular expressions (regex) are quite an arcane language. It's easy to say simple instructions like “no nastythings, please” but then you have to construct something like “no bad news unless it's about weather in Ireland” and it becomes unpronounceable.

In case of Mastodon, we'll be working in a small text filter field, so here's how to find it:

I'm lazy, so the screenshots are in Russian. Shouldn't be a problem.

Oh, and Mastodon filters are case insensitive, so NaStYThInGs would get caught in the filter, too.

Testing

Don't test your regex on Mastodon. The UI is horrendous for that.

Use something like Regex Tester service. Just copy the line to the regex field and some of the toots to the text field, “no match” means nothing is filtered and a match would mean this toot would be hidden.

Filtering

A word is something between two spaces, right? There's a special instruction that means “any space” – spaces, tabs, newlines… all good.

\snastythings\s

This is not good enough. What if we add another word later? This filter would catch two words:

\s(nastythings|vilethings)\s

Still not good enough. What if these words are hashtags? What if someone writes i hate "nastythings" – so the word is inside quotes, not spaces? \s doesn't cut it but there's another special instruction – “anything that's not a letter”:

\W(nastythings|vilethings)\W

What if there's one thing instead of many? Either we write four words (nastything or nastythings) or make the last “s” letter optional:

\W(nastythings?|vilethings?)\W

It's worse if you're not speaking English. See, the \W shortcut is expanded into “everything that's not A to Z or digits”. And, as you understand, it's not enough even for Extended Latin alphabet. So here's another arcane glyph: character sets.

[abc] means “letter a, b or c”.

[^abc] means “anything except for letters a, b and c”.

\W means [^a-zA-Z0-9_]

And for Cyrillic alphabet the equivalent would be [^а-яА-Яa-zA-Z0-9_] because we're including Latin too. So same filter becomes this:

[^а-яА-Яa-zA-Z0-9_](nastythings?|vilethings?)[^а-яА-Яa-zA-Z0-9_]

Good enough for English language, but there's still a room for improvement if your language is more complex with word stemming. There are articles (un, une, le, la, les in French), individual names (I'm writing this because I'm really not interested in yet another Silicon Valley entrepreneur billionaire) and custom emojis in a middle of words (i have no idea how to catch those).

So, in a nutshell, that is how you have to write toot filters.