SEO Slugification in Dotnet aka Unicode to Ascii aka Diacritic Stripping

This article looks at SEO using url slugs, and how we can generate them in dotnet/asp.

I’ve recently conducted quite a lot of research, looking for a class to create url slugs in C#. If you are not familiar with these, they are when sentences are converted to a url friendly format. Normally this involves converting spaces and punctuation to hyphens and removing accents from characters. e.g. “This is my resumé” becomes “this-is-my-resume”.

Slugification of urls to make them human readable, is considered good SEO strategy as it gets keywords into urls. Although how much weight engines put into it these days is a matter of debate as it has been abused by spammers. nonetheless I think humans are more likely to click on a human readable link if a search engine throws it up, so there is no downside.

Before digging into the implementation aspects, one thing worth mentioning, as it took me a while to realise, is that it’s not really worth trying to use the slug as the unique key to whatever resource you want to offer. Dealing with the problem of collisions where you have matching slugs is just not worth the pain. Especially if you want your slugs to contain something highly repeatable like names.

What you will see the experts do is include the id of the resource (e.g. a GUID) and the slug. You can see it on both Stack Overflow and Facebook:

http://stackoverflow.com/questions/3769457/how-can-i-remove-accents-on-a-string

http://ms-my.facebook.com/people/Joe-Bloggs/12343243267683877

For StackOverflow, the id is actually 3769457, and if you change the slug string it makes no odds to what you get back.

Facebook is exactly the same except they place the real id at the end instead of the middle. This seems a little smarter to me as google etc truncate very long URLs when they display them in search results, so by making sure the human readable part is at the front, you aren’t losing that whilst preserving a human meaningless id.

They also both do permanent redirects if you type in a random slug. This tackles the canonical problem of having the same content on multiple links if people mess up the slug, which is pure SEO poison if not tackled.

Right, to business – how do we do it? Well, the first part, punctuation removal and hyphenation, is trivial.

Removing accents however is rather more complex, and goes by many names including diacritic (accent) stripping and Unicode to ascii. My research took me to stack overflow where it became apparent that whilst there are libraries to help, C# coverage is very weak.

Looking at other language’s libraries, they tend to rely on lookup tables and regex. The tables are very difficult to make complete, and regex is quite an expensive operation.

There is also a more standards based way of doing this, which is known as Normalisation and is described in more detail on the Unicode website here. This describes the different forms of normalisation which I’m not going to go into here. What normalisation does, is split Unicode characters into the represented letter and a series of accents etc that follow (see the unicode link for more details).

Having got this string one then needs to remove the accent characters which should leave us with just the character and no accent. This is great in theory, unfortunately certain characters don’t map to a low Ascii character with normalise, so even with this approach one needs a lookup table for exceptions.

So the answer I’ve come to is based on two snippets which Jeff Atwood kindly shared on Stack Overflow. These are apparently the functions Stack Overflow uses for this very operation. You can find these here and here. Using these you have the warm glow of knowing they are production tested on a high volume site and performance is solid – note the lack of regex!

I have however, made a few changes:

  • The hyphens were appended in such a way that one could be added, and then need removing as it was the last character in the string. i.e. We never want “my-slug-”. This mean an extra string allocation. I’ve worked around this by delay-hyphening. If you compare my code to Jeff’s the logic for this is easy to follow.
  • His approach is purely lookup based and missed a lot of characters I found in examples whilst researching on stack overflow. To counter this, I first perform a normalisation pass, and then ignore any characters outside the acceptable ranges. This works most of the time…
  • …For when it doesn’t I’ve also had to add a lookup table. As mentioned above, some characters don’t map to a low ascii value when normalised. Rather than drop these I’ve got a manual list of exceptions that is doubtless full of holes, but better than nothing. The normalisation code was inspired by Jon Hanna’s great post here.
  • The case conversion is now also optional.

The upshot of all this, is that my version has better coverage than Jeff’s original and is a bit smarter with the hyphenation. I have a suspicion that the Microsoft implemented normalisation is likely to be slower than Jeff’s lookup table, so we are trading completeness for performance on that aspect. The hyphenation I would expect to be a bit faster, but not much as his extra string copy was only an edge case.

Anyway here’s the code:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

	public static class Slug
	{
		public static string Create(bool toLower, params string[] values)
		{
			return Create(toLower, String.Join("-", values));
		}

		/// <summary>
		/// Creates a slug.
		/// Author: Daniel Harman, based on original code by Jeff Atwood
		/// References:
		/// http://www.unicode.org/reports/tr15/tr15-34.html
		/// http://meta.stackoverflow.com/questions/7435/non-us-ascii-characters-dropped-from-full-profile-url/7696#7696
		/// http://stackoverflow.com/questions/25259/how-do-you-include-a-webpage-title-as-part-of-a-webpage-url/25486#25486
		/// http://stackoverflow.com/questions/3769457/how-can-i-remove-accents-on-a-string
		/// </summary>
		/// <param name="toLower"></param>
		/// <param name="normalised"></param>
		/// <returns></returns>
		public static string Create(bool toLower, string value)
		{
			if (value == null) return "";

			var normalised = value.Normalize(NormalizationForm.FormKD);

			const int maxlen = 80;
			int len = normalised.Length;
			bool prevDash = false;
			var sb = new StringBuilder(len);
			char c;

			for (int i = 0; i < len; i++)
			{
				c = normalised[i];
				if ((c >= 'a' && c <= 'z') || (c >= '0' && c <= '9'))
				{
					if (prevDash)
					{
						sb.Append('-');
						prevDash = false;
					}
					sb.Append(c);
				}
				else if (c >= 'A' && c <= 'Z')
				{
					if (prevDash)
					{
						sb.Append('-');
						prevDash = false;
					}
					// tricky way to convert to lowercase
					if (toLower)
						sb.Append((char)(c | 32));
					else
						sb.Append(c);
				}
				else if (c == ' ' || c == ',' || c == '.' || c == '/' || c == '\' || c == '-' || c == '_' || c == '=')
				{
					if (!prevDash && sb.Length > 0)
					{
						prevDash = true;
					}
				}
				else
				{
					string swap = ConvertEdgeCases(c, toLower);

					if (swap != null)
					{
						if (prevDash)
						{
							sb.Append('-');
							prevDash = false;
						}
						sb.Append(swap);
					}
				}

				if (sb.Length == maxlen) break;
			}

			return sb.ToString();
		}

		static string ConvertEdgeCases(char c, bool toLower)
		{
			string swap = null;
			switch (c)
			{
				case 'ı':
					swap = "i";
					break;
				case 'ł':
					swap = "l";
					break;
				case 'Ł':
					swap = toLower ? "l" : "L";
					break;
				case 'đ':
					swap = "d";
					break;
				case 'ß':
					swap = "ss";
					break;
				case 'ø':
					swap = "o";
					break;
				case 'Þ':
					swap = "th";
					break;
			}
			return swap;
		}
	}

and here are my mbunit tests:

	[TestFixture]
	public class When_Creating_Slug
	{
		[Test]
		[Row("ṃ,ỹ,ṛ,è,ş,ư,ḿ,ĕ", "m-y-r-e-s-u-m-e")]
		[Row("á-é-í-ó-ú", "a-e-i-o-u")]
		[Row("à,å,á,â,ä,ã,å,ą", "a-a-a-a-a-a-a-a")]
		[Row("è,é,ê,ë,ę", "e-e-e-e-e")]
		[Row("ì,í,î,ï,ı", "i-i-i-i-i")]
		[Row("ò,ó,ô,õ,ö,ø", "o-o-o-o-o-o")]
		[Row("ù,ú,û,ü", "u-u-u-u")]
		[Row("ç,ć,č", "c-c-c")]
		[Row("ż,ź,ž", "z-z-z")]
		[Row("ś,ş,š", "s-s-s")]
		[Row("ñ,ń", "n-n")]
		[Row("ý,Ÿ", "y-Y")]
		[Row("ł,Ł", "l-L")]
		[Row("đ", "d")]
		[Row("ß", "ss")]
		[Row("ğ", "g")]
		[Row("Þ", "th")]
		public void Should_Remove_Accents_Case_Invariant(string value, string expected)
		{
			var result = Slug.Create(false, value);
			
			Assert.AreEqual(expected, result);
		}

		[Test]
		[Row("ý,Ÿ", "y-y")]
		[Row("ł,Ł", "l-l")]
		public void Should_Remove_Accents_To_Lower(string value, string expected)
		{
			var result = Slug.Create(true, value);
			
			Assert.AreEqual(expected, result);
		}

		[Test]
		[Row("Slug Me ", "Slug-Me")]
		[Row("Slug Me,", "Slug-Me")]
		[Row("Slug Me.", "Slug-Me")]
		[Row("Slug Me/", "Slug-Me")]
		[Row("Slug Me\", "Slug-Me")]
		[Row("Slug Me-", "Slug-Me")]
		[Row("Slug Me_", "Slug-Me")]
		[Row("Slug Me=", "Slug-Me")]
		[Row("Slug Me--", "Slug-Me")]
		[Row("Slug Me---,", "Slug-Me")]
		public void Should_Remove_Trailing_Punctuation(string value, string expected)
		{
			var result = Slug.Create(false, value);

			Assert.AreEqual(expected, result);
		}
	}

After all this, I’ve now realised I don’t really need this code where I thought I did. My use case was to convert people’s names into slugs for a name directory, but having done all this work, I had a look at how facebook does it (after all no harm copying the industry leaders). Well I wish I’d done this first, as it turns out, they just leave the accents in the names when they display them in a directory! Oh well, I’m pretty sure they will normalise for searching to maximise matches, and this code is sound for article type slugification rather than names.

With thanks to Tom Chantler for the spotting the bug handling large whitespace strings.

3 thoughts on “SEO Slugification in Dotnet aka Unicode to Ascii aka Diacritic Stripping

  1. Tom Chantler

    Hey Dan this is great, thanks.

    So far I have just made one tiny edit which is to change line 84 from if (i == maxlen) break; to if (sb.Length == maxlen) break; so that it will return a slug of the desired length even if the passed string contains a lot of whitespace/invalid characters.

    I also benchmarked it (roughly) against Jeff Atwood’s code doing ten million iterations each and they were about the same speed.

    Reply
    1. Dan Harman Post author

      Hey thanks for the positive comments and for benchmarking! Good to hear performance is ok, as I’ve not out this into production yet. Thanks also for the bug fix. I’ll amend the article when I get a moment.

      Reply

Leave a Reply