In our eBook " PHP 7 Explained ", we have already explained why the successor of PHP 5 is PHP 7 rather than PHP 6.
Since the attempt to create a Unicode-based PHP implementation has failed, PHP 7 – just like PHP 5 – does not handle Unicode strings natively. The commonly used UTF-8 encoding, for example, is a multibyte encoding, as opposed to ASCII, where each character is represented by one single byte. Calculating the string length is trivial for ASCII characters: just count the bytes. Calculating the length of a string that is encoded using UTF-8 is more challenging. UTF-8 is a variable-length encoding and each character (code point, to be exact) is represented by one to four bytes. For ASCII characters, everything works smoothly, because UTF-8 is a superset of ASCII. The problems start with non-ASCII characters:
var_dump ( strlen ( 'ö' ) ) ; |
This simple script, at least when saved as UTF-8, will produce a most interesting result:
int(2)
When encoding the one German umlaut as UTF-8, two bytes are being used. Since PHP does not know about UTF-8 (or Unicode in general), the built-in strlen()
function just counts bytes, which leads to a wrong result.
There are commonly used PHP extensions, for example iconv
or mbstring
("multibyte string") that offer Unicode-enabled string handling functions, for example mb_strlen()
(which, of course, requires the mbstring
extension):
var_dump ( mb_strlen ( 'ö' ) ) ; |
This function counts code points rather than bytes and thus yields the correct result:
int(1)
You can do the same with the iconv
extension, if you have that one installed:
var_dump ( iconv_strlen ( 'ö' ) ) ; |
Unsurprisingly, this function yields the same result:
int(1)
In both cases, we are cheating a little, since we are not specifying that our string is UTF-8 encoded. This works since by convention UTF-8 is the assumed default encoding pretty much everywhere on the Internet.
Now we will add magic into the mix, and new problems will arise. If you are using the mbstring
extension then you can use the php.ini
directive mbstring.func_overload
to overload built-in PHP functions with the multibyte-enabled mb_
functions. Depending on the value you set mbstring.func_overload
to, the mail()
function, string functions, and regular expressions (unfortunately not the preg_
ones, but the removed ereg_
ones) can be overloaded.
The problem with this magic is that your program cannot know whether PHP's string functions operate with or without support for multibyte characters. And you certainly do not want to wrap an if
around every string function. So just like with magic quotes, which we wrote about earlier
, using mbstring.func_overload
is not a good idea. That is why this php.ini
directive has been deprecated in PHP 7.2, and will likely be removed in PHP 8.
Even if it potentially means a lot of work: you have to walk through your code and make it explicit with which encodings you work. Do not wait until PHP 8, because that would put you in a situation where you cannot upgrade to PHP 8. You effectively want to get started with your PHP 8 migration right now.