PHP Unicode support - or the lack thereof

Well, I just had the pleasure to fix special character (umlaut) handling in a legacy PHP application. To put it short: It has been a while since I saw so many i18n issues as I figured out in PHP (version 5) during the last hour:

  • PHP strings are just plain byte arrays. Their content is non-portable as it is dependent on the current default encoding.

  • The same applies to the representation built by serialize. It contains a length-prefixed byte representation of the string without actually storing any encoding information.

  • Most PHP (string) functions have no clue about Unicode. For a detailed list including each function’s risk level, refer to: http://www.phpwact.org/php/i18n/utf-8

Note to self: Never ever use PHP for a new project.