html entities with utf-8 posted in internet, html, unicode, php, apr 24, 2008

a while ago i started switching my php code from using htmlspecialchars() to htmlentities(), which as i understand converts a much larger group of characters into html entity form (for example, “—” becomes “—”).  i decided to go with that partially because there’s an html_entity_decode() and partially because i was thinking that it’s better to have “—” in my html than “—.”  i also started using smart quotes in my pages where i actually enter the html as “ ” ‘ and ’.

more recently i got to thinking about the fact that i serve my html with the utf-8 character set, and converting everything that’s not part of ascii (character codes 32 through 126) into html entities, which themselves are composed of ascii characters.  so why bother even specifying a character set if i’m not using it?  i’m also making my html harder for me to read because it looks like i’m instead of i’m, for example.

looking at utf-8, the right single quote character (0x2019 in unicode) gets encoded using 3 bytes.  the html entity is all ascii which is one byte per character, for a total of 7 bytes.  so if i instead actually put the character into a utf-8 html document, i save 4 bytes on that one character in this example.  all html entities start with an ampersand and end with a semicolon, so with at least one character in between (though i don’t think there are any entities with only one-character names), that’s 3 bytes already.  the most bytes any single character can take up in utf-8 is 4 bytes, which is still less than any html entity i know of.

so with that, i’ll be switching back to htmlspecialchars() but decoding with html_entity_decode(), and using the actual characters in my files.  i don’t forsee any problems with this unless someone visits my site with a browser that can’t handle utf-8.

comments / complaints / compliments

users online

2 guests

user list

statistics

hits today603
registered users184
forum posts408
comments153

powered by

  • dreamhost
  • linux
  • apache
  • php
  • mysql