Monthly Archives: July 2011

SEO Slugification in Dotnet aka Unicode to Ascii aka Diacritic Stripping

This article looks at SEO using url slugs, and how we can generate them in dotnet/asp.

I’ve recently conducted quite a lot of research, looking for a class to create url slugs in C#. If you are not familiar with these, they are when sentences are converted to a url friendly format. Normally this involves converting spaces and punctuation to hyphens and removing accents from characters. e.g. “This is my resumé” becomes “this-is-my-resume”.

Slugification of urls to make them human readable, is considered good SEO strategy as it gets keywords into urls. Although how much weight engines put into it these days is a matter of debate as it has been abused by spammers. nonetheless I think humans are more likely to click on a human readable link if a search engine throws it up, so there is no downside.

Before digging into the implementation aspects, one thing worth mentioning, as it took me a while to realise, is that it’s not really worth trying to use the slug as the unique key to whatever resource you want to offer. Dealing with the problem of collisions where you have matching slugs is just not worth the pain. Especially if you want your slugs to contain something highly repeatable like names.

What you will see the experts do is include the id of the resource (e.g. a GUID) and the slug. You can see it on both Stack Overflow and Facebook:

http://stackoverflow.com/questions/3769457/how-can-i-remove-accents-on-a-string

http://ms-my.facebook.com/people/Joe-Bloggs/12343243267683877

For StackOverflow, the id is actually 3769457, and if you change the slug string it makes no odds to what you get back.

Facebook is exactly the same except they place the real id at the end instead of the middle. This seems a little smarter to me as google etc truncate very long URLs when they display them in search results, so by making sure the human readable part is at the front, you aren’t losing that whilst preserving a human meaningless id.

They also both do permanent redirects if you type in a random slug. This tackles the canonical problem of having the same content on multiple links if people mess up the slug, which is pure SEO poison if not tackled.

Right, to business – how do we do it? Well, the first part, punctuation removal and hyphenation, is trivial.

Removing accents however is rather more complex, and goes by many names including diacritic (accent) stripping and Unicode to ascii. My research took me to stack overflow where it became apparent that whilst there are libraries to help, C# coverage is very weak.

Looking at other language’s libraries, they tend to rely on lookup tables and regex. The tables are very difficult to make complete, and regex is quite an expensive operation.

There is also a more standards based way of doing this, which is known as Normalisation and is described in more detail on the Unicode website here. This describes the different forms of normalisation which I’m not going to go into here. What normalisation does, is split Unicode characters into the represented letter and a series of accents etc that follow (see the unicode link for more details).

Having got this string one then needs to remove the accent characters which should leave us with just the character and no accent. This is great in theory, unfortunately certain characters don’t map to a low Ascii character with normalise, so even with this approach one needs a lookup table for exceptions.

So the answer I’ve come to is based on two snippets which Jeff Atwood kindly shared on Stack Overflow. These are apparently the functions Stack Overflow uses for this very operation. You can find these here and here. Using these you have the warm glow of knowing they are production tested on a high volume site and performance is solid – note the lack of regex!

I have however, made a few changes:

  • The hyphens were appended in such a way that one could be added, and then need removing as it was the last character in the string. i.e. We never want “my-slug-“. This mean an extra string allocation. I’ve worked around this by delay-hyphening. If you compare my code to Jeff’s the logic for this is easy to follow.
  • His approach is purely lookup based and missed a lot of characters I found in examples whilst researching on stack overflow. To counter this, I first perform a normalisation pass, and then ignore any characters outside the acceptable ranges. This works most of the time…
  • …For when it doesn’t I’ve also had to add a lookup table. As mentioned above, some characters don’t map to a low ascii value when normalised. Rather than drop these I’ve got a manual list of exceptions that is doubtless full of holes, but better than nothing. The normalisation code was inspired by Jon Hanna’s great post here.
  • The case conversion is now also optional.

The upshot of all this, is that my version has better coverage than Jeff’s original and is a bit smarter with the hyphenation. I have a suspicion that the Microsoft implemented normalisation is likely to be slower than Jeff’s lookup table, so we are trading completeness for performance on that aspect. The hyphenation I would expect to be a bit faster, but not much as his extra string copy was only an edge case.

Anyway here’s the code:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

	public static class Slug
	{
		public static string Create(bool toLower, params string[] values)
		{
			return Create(toLower, String.Join("-", values));
		}

		/// <summary>
		/// Creates a slug.
		/// Author: Daniel Harman, based on original code by Jeff Atwood
		/// References:
		/// http://www.unicode.org/reports/tr15/tr15-34.html
		/// http://meta.stackoverflow.com/questions/7435/non-us-ascii-characters-dropped-from-full-profile-url/7696#7696
		/// http://stackoverflow.com/questions/25259/how-do-you-include-a-webpage-title-as-part-of-a-webpage-url/25486#25486
		/// http://stackoverflow.com/questions/3769457/how-can-i-remove-accents-on-a-string
		/// </summary>
		/// <param name="toLower"></param>
		/// <param name="normalised"></param>
		/// <returns></returns>
		public static string Create(bool toLower, string value)
		{
			if (value == null) return "";

			var normalised = value.Normalize(NormalizationForm.FormKD);

			const int maxlen = 80;
			int len = normalised.Length;
			bool prevDash = false;
			var sb = new StringBuilder(len);
			char c;

			for (int i = 0; i < len; i++)
			{
				c = normalised[i];
				if ((c >= 'a' && c <= 'z') || (c >= '0' && c <= '9'))
				{
					if (prevDash)
					{
						sb.Append('-');
						prevDash = false;
					}
					sb.Append(c);
				}
				else if (c >= 'A' && c <= 'Z')
				{
					if (prevDash)
					{
						sb.Append('-');
						prevDash = false;
					}
					// tricky way to convert to lowercase
					if (toLower)
						sb.Append((char)(c | 32));
					else
						sb.Append(c);
				}
				else if (c == ' ' || c == ',' || c == '.' || c == '/' || c == '\' || c == '-' || c == '_' || c == '=')
				{
					if (!prevDash && sb.Length > 0)
					{
						prevDash = true;
					}
				}
				else
				{
					string swap = ConvertEdgeCases(c, toLower);

					if (swap != null)
					{
						if (prevDash)
						{
							sb.Append('-');
							prevDash = false;
						}
						sb.Append(swap);
					}
				}

				if (sb.Length == maxlen) break;
			}

			return sb.ToString();
		}

		static string ConvertEdgeCases(char c, bool toLower)
		{
			string swap = null;
			switch (c)
			{
				case 'ı':
					swap = "i";
					break;
				case 'ł':
					swap = "l";
					break;
				case 'Ł':
					swap = toLower ? "l" : "L";
					break;
				case 'đ':
					swap = "d";
					break;
				case 'ß':
					swap = "ss";
					break;
				case 'ø':
					swap = "o";
					break;
				case 'Þ':
					swap = "th";
					break;
			}
			return swap;
		}
	}

and here are my mbunit tests:

	[TestFixture]
	public class When_Creating_Slug
	{
		[Test]
		[Row("ṃ,ỹ,ṛ,è,ş,ư,ḿ,ĕ", "m-y-r-e-s-u-m-e")]
		[Row("á-é-í-ó-ú", "a-e-i-o-u")]
		[Row("à,å,á,â,ä,ã,å,ą", "a-a-a-a-a-a-a-a")]
		[Row("è,é,ê,ë,ę", "e-e-e-e-e")]
		[Row("ì,í,î,ï,ı", "i-i-i-i-i")]
		[Row("ò,ó,ô,õ,ö,ø", "o-o-o-o-o-o")]
		[Row("ù,ú,û,ü", "u-u-u-u")]
		[Row("ç,ć,č", "c-c-c")]
		[Row("ż,ź,ž", "z-z-z")]
		[Row("ś,ş,š", "s-s-s")]
		[Row("ñ,ń", "n-n")]
		[Row("ý,Ÿ", "y-Y")]
		[Row("ł,Ł", "l-L")]
		[Row("đ", "d")]
		[Row("ß", "ss")]
		[Row("ğ", "g")]
		[Row("Þ", "th")]
		public void Should_Remove_Accents_Case_Invariant(string value, string expected)
		{
			var result = Slug.Create(false, value);
			
			Assert.AreEqual(expected, result);
		}

		[Test]
		[Row("ý,Ÿ", "y-y")]
		[Row("ł,Ł", "l-l")]
		public void Should_Remove_Accents_To_Lower(string value, string expected)
		{
			var result = Slug.Create(true, value);
			
			Assert.AreEqual(expected, result);
		}

		[Test]
		[Row("Slug Me ", "Slug-Me")]
		[Row("Slug Me,", "Slug-Me")]
		[Row("Slug Me.", "Slug-Me")]
		[Row("Slug Me/", "Slug-Me")]
		[Row("Slug Me\", "Slug-Me")]
		[Row("Slug Me-", "Slug-Me")]
		[Row("Slug Me_", "Slug-Me")]
		[Row("Slug Me=", "Slug-Me")]
		[Row("Slug Me--", "Slug-Me")]
		[Row("Slug Me---,", "Slug-Me")]
		public void Should_Remove_Trailing_Punctuation(string value, string expected)
		{
			var result = Slug.Create(false, value);

			Assert.AreEqual(expected, result);
		}
	}

After all this, I’ve now realised I don’t really need this code where I thought I did. My use case was to convert people’s names into slugs for a name directory, but having done all this work, I had a look at how facebook does it (after all no harm copying the industry leaders). Well I wish I’d done this first, as it turns out, they just leave the accents in the names when they display them in a directory! Oh well, I’m pretty sure they will normalise for searching to maximise matches, and this code is sound for article type slugification rather than names.

With thanks to Tom Chantler for the spotting the bug handling large whitespace strings.

Storing Custom Data in Forms Authentication Tickets

This article looks at storing custom data in asp.net forms authentication tickets. I recently updated the article to make the custom model binder generic, and add the necessary registration code which was missing from the first draft.

So you’ve decided to use FormsAuthentication, and perhaps enhanced it with your own custom providers. In your AccountController Login method you probably have a call along these lines:

FormsAuthentication.SetAuthCookie(account.Id.ToString(), model.RememberMe);

That all works great, but what if you need to store some extra data in the cookie. Perhaps the name you are passing into the AuthTicket isn’t actually the users name, but a GUID. Suddenly that built in ASP.Net login widget, in the top right of the page, doesn’t seem so great when it looks like this:

Hello 5D1D4743-9941-40B5-8931-6BC12617946C

What we need to do is store some extra data in that AuthTicket cookie right? That way we can keep the GUID as the authentication id, but still store things like the users first name in the cookie. Thus saving an expensive round trip to the db each time we render the widget.

Hmmm… whats this ‘UserData’ property we see on the AuthTicket? Perfect!

Erk… It’s read only?!?!?!

At least that’s how my thought process went.

So we need to make an authentication ticket ourselves:

var ticket = FormsAuthenticationTicket(int version, string name, DateTime issueDate,
	DateTime expiration, bool isPersistent, string userData, string cookiePath);

Unfortunately that’s quite a few more parameters than SetAuthCookie(…) required and they should be coming from the web.config rather than hard-coded.

On the plus side, there is access to the UserData!

To avoid losing the web.config driven settings, we can do a little trick and get FormsAuthentication to do the parsing for us. All we need to do is ask it for an AuthTicket and copy the settings from that into a new one we create.

To do this, a few steps are required. Firstly, after getting the ticket, we have to decrypt it, copy the data into a new ticket, and then make sure we encrypt that. Then we need to add it to the response.

Now before getting to the code, we should think about where it should live. It would seem logical to encapsulate this an extension method on FormsAuthentication, but being a static class we can’t. Instead we can attach it to HttpResponseBase which is not a bad home, especially as we have to add the cookie onto a response anyway. I’d recommend creating the following class in an ‘Infrastructure’ folder in your project:

	public static class HttpResponseBaseExtensions
	{
		public static int SetAuthCookie<T>(this HttpResponseBase responseBase, string name, bool rememberMe, T userData)
		{
			/// In order to pickup the settings from config, we create a default cookie and use its values to create a 
			/// new one.
			var cookie = FormsAuthentication.GetAuthCookie(name, rememberMe);
			var ticket = FormsAuthentication.Decrypt(cookie.Value);
			
			var newTicket = new FormsAuthenticationTicket(ticket.Version, ticket.Name, ticket.IssueDate, ticket.Expiration,
				ticket.IsPersistent, userData.ToJson(), ticket.CookiePath);
			var encTicket = FormsAuthentication.Encrypt(newTicket);

			/// Use existing cookie. Could create new one but would have to copy settings over...
			cookie.Value = encTicket;

			responseBase.Cookies.Add(cookie);

			return encTicket.Length;
		}
	}

There are a couple of things of note here, firstly we are accepting a generic type for the UserData, and secondly we are encoding it to Json!

Why? well lets think about the UserData field. Being on a cookie, this can only contain string data. Now we could do our own custom serialisation into this string, but my preference is to use JSON as it’s designed for the task. In this instance I’m using the serialiser from MongoDb as I happen to be using that in my project, but any Json serialiser will do. You might like to try the ServiceStack implementation for example.

I’m also returning the size of the cookie – cookies should never be longer than 4000 bytes as some browsers will just discard them. Its worth keeping an eye on this as it’s not just the size of your UserData but the other mandatory parts of the cookie too.

So let’s get this wired into our AccountController.

First we define a UserData class with a FirstName in it:

	public class UserData
	{
		public string FirstName { get; set; }

		public UserData()
		{
			FirstName = "Unknown";
		}
	}

Now here’s an example Login Action. There are some extras in here around validation, but you can use whatever approach here that fits your project.

[HttpPost]
		public ActionResult LogIn(AccountLoginVM model, string returnUrl)
		{
			try
			{
				if (ModelState.IsValid)
				{
					// Some code to validate and check authentication
					if (!Membership.ValidateUser(model.Email, model.Password))
						throw new RulesException("Incorrect username or password");

					Account account = _accounts.GetByEmail(model.Email);

					UserData userData = new UserData
					{
						FirstName = account.FirstName
					};

					Response.SetAuthCookie(account.Id.ToString(),
						model.RememberMe, userData);
				
					if (Url.IsLocalUrl(returnUrl))
					{
						return Redirect(returnUrl);
					}
					else
					{
						return RedirectToAction("Index", "Home");
					}
				}
			}
			catch (RulesException ex)
			{
				ex.CopyTo(ModelState);
			}

			model.Password = "";
			return View(model);
		}

That’s it. We’ve now got a cookie with our extra UserData in it.

Hang on… what about fixing that login widget in the top right?

One elegant way to crack this is to create a custom model binder, then if we swap the example widget from being a partial view to a partial action, all we need to do is demand a UserData object as an input param and the magic of binding will save us.

So, the custom model binder, again leveraging the MongoDb Json deserialiser:

	/// <summary>
	/// Binder to pull the UserData out for any actions that may want it.
	/// </summary>
	public class UserDataModelBinder<T> : IModelBinder
	{
		public object BindModel(ControllerContext controllerContext,
			ModelBindingContext bindingContext)
		{
			if (bindingContext.Model != null)
				throw new InvalidOperationException("Cannot update instances");
			if (controllerContext.RequestContext.HttpContext.Request.IsAuthenticated)
			{
				var cookie = controllerContext
					.RequestContext
					.HttpContext
					.Request
					.Cookies[FormsAuthentication.FormsCookieName];

				if (null == cookie)
					return null;

				var decrypted = FormsAuthentication.Decrypt(cookie.Value);

				if (!string.IsNullOrEmpty(decrypted.UserData))
					return BsonSerializer.Deserialize<T>(decrypted.UserData);
			}
			return null;
		}
	}

This is a generic so you can use whatever class suits to store the userdata. This then needs to be registered in Application_Start() in ‘Global.asax.cs’ :

ModelBinders.Binders.Add(typeof(UserData), new UserDataModelBinder<UserData>());

Now our login widget action, which passes a UserData object into our view (wrapped in a view model as we may not always want to pass all the UserData into the view).

		public ActionResult LoginWidget(UserData userData)
		{
			AccountLoginWidgetVM model = new AccountLoginWidgetVM();
			if (null != userData)
				model.UserData = userData;

			return PartialView(userData);
		}
@model TestProj.Web.Models.AccountLoginWidgetVM
         
@if(Request.IsAuthenticated) {
    <text>Welcome <b>@Model.UserData.FirstName</b>!
    [ @Html.ActionLink("Logout", "Logout", "Account") ]</text>
}
else {
...
}

We’ve covered quite a broad range of topics here, but hopefully its clear and of use. If you need any clarification leave a comment.

Next time… a change of tack. I’m going to look at how to get some performance out of a devexpress WPF grid.