html entities with utf-8

html, internet, php, unicode apr 2008 misterhaan

a while ago i started switching my php code from using htmlspecialchars() to htmlentities(), which as i understand converts a much larger group of characters into html entity form (for example, “—” becomes “—”). i decided to go with that partially because there’s an html_entity_decode() and partially because i was thinking that it’s better to have “—” in my html than “—.” i also started using smart quotes in my pages where i actually enter the html as “ ” ‘ and ’.

more recently i got to thinking about the fact that i serve my html with the utf-8 character set, but convert everything that’s not part of ascii (character codes 32 through 126) into html entities, which themselves are composed of ascii characters. so why bother even specifying a character set if i’m not using it? i’m also making my html harder for me to read because it looks like i’m instead of i’m, for example.

looking at utf-8, the right single quote character (0x2019 in unicode) gets encoded using 3 bytes. the html entity is all ascii which is one byte per character, for a total of 7 bytes. so if i instead actually put the character into a utf-8 html document, i save 4 bytes on that one character in this example. all html entities start with an ampersand and end with a semicolon, so with at least one character in between (though i don’t think there are any entities with only one-character names), that’s 3 bytes already. the most bytes any single character can take up in utf-8 is 4 bytes, which is still less than any html entity i know of.

so with that, i’ll be switching back to htmlspecialchars() but decoding with html_entity_decode(), and using the actual characters in my files. i don’t forsee any problems with this unless someone visits my site with a browser that can’t handle utf-8.