Sunday, July 21, 2013

Removing Diacritical Marks

I recently found myself in the situation where I needed to remove the diacritical marks from a string. The poor man's way would have been to substitute the individual letters with their unadorned equivalents (é, è, ê => e; à, á, â => a; etc.) But I was looking for a more elegant solution.

A little googling found me this little gem in C#. The article is in Dutch, but the code speaks for itself:

string q = "Wûnseradiel"
char[] normalised = q.Normalize(NormalizationForm.FormD).ToCharArray(); 
q = new string(normalised.Where(c => (int) c <= 127).ToArray()); 
// q == "Wunseradiel" 

The issue is that in Unicode two characters that look the same can have different binary representations (look here for the full details.) The process of giving all equivalent characters the same representation is called normalization. In .NET the String class has a built in method to do just that. The form that is chosen is Normalisation Form D (NFD) which separates each 'accented' character into the unadorned character followed by the diacritical mark. This makes it easy to iterate over the resulting character array and skip all the diacritical marks, leaving us with just the unadorned characters. Elegant and concise. But what if I wanted to apply the same solution in Go?

The heart of the above solution is the ability to normalize a Unicode string as NFD. So I checked the standard libraries for normalization support for Unicode. At the time of writing such support is not included in the standard Go libraries. But it is available as a third party package at: https://code.google.com/p/go.text. The documentation is available here.

To make use of this package, simply cd into the ./src of your GOPATH directory and execute:

hg clone https://code.google.com/p/go.text/

Now the unicode/norm package is available to be used in your own project. The following code illustrates how to achieve the same effect as with the C# snippet above, but fleshed out into an executable project:

package main

import (
 "code.google.com/p/go.text/unicode/norm"
 "fmt"
)

func StripDiacritics(value string) string {
 normalized_value := norm.NFD.String(value)
 var buffer []rune
 for _, char := range normalized_value {
  if char < 128 {
   buffer = append(buffer, char)
  }
 }
 return string(buffer)
}

func main() {
 message := "Buén día, mundo!"
 fmt.Println("message: ", message)
 fmt.Println("stripped: ", StripDiacritics(message))
}

The output:
message: Buén día, mundo!
stripped: Buen dia, mundo!

No comments:

Post a Comment