Internationalized domain names, are you ready?
Since may 11 TLD's (top-level domainnames) have been added. In order for this to work successfully, a lot of applications will have to be fixed.
Many email-validation scripts might use an approach like this:
$ok = preg_match('/^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,6}$/i', $email);
This one is pretty simple, it matches the most common address formats, as long as the tld (.com, nl, .uk, etc) is under 6 characters. For a bit more sophistication you might want to ensure that the tld is a bit more valid:
$ok = preg_match('/^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.(?:[A-Z]{2}|com|org|net|edu|gov|mil|biz|info|mobi|name|aero|asia|jobs|museum)$/i',$email);
Note: both these regexes were taken from regular-expression.info. The top google hit, and decent examples.
The new TLD's use non-ascii characters, and they might become aliases for existing top-level domains, or new tld's altogether. Here are the currently working examples:
- http://مثال.إختبار - Arabic.
- http://例子.测试 - Chinese (simplified)
- http://例子.測試 - Chinese (traditional)
- http://παράδειγμα.δοκιμή - greek
- http://उदाहरण.परीक्षा Hindi
- http://例え.テスト - Japanese
- http://실례.테스트 - Korean
- http://مثال.آزمایشی - Persian
- http://пример.испытание - Russian
At first sight these look like regular utf-8, characters, but if you look at the sourcecode of this page, you'll notice that it's actually encoded differently.
The korean url http://실례.테스트, is actually encoded as http://xn--9n2bp8q.xn--9t4b11yi5a/. This is called Punycode.
If you want support for these new urls (and thus domainnames in emails), you should have support for punycode. You will likely receive UTF-8 encoded domainnames for email address (example@실례.테스트), but internally you must make sure that you only deal with the punycode representation.
This translating is also what modern browsers do. If you were to paste "http://xn--9n2bp8q.xn--9t4b11yi5a/" directly in the firefox address bar, it will show you the UTF-8 characters instead. Firefox will re-encode to punycode though and use that format for HTTP requests.
The best way really to check for valid email addresses is to use a very liberal regex, but verify with a simple MX record lookup if a mailserver exists for the given domain. This example is an expansion on the first regex.
$email = '[email protected]';
if(preg_match('/^[A-Z0-9._%+-]+@([A-Z0-9.-]+\.[A-Z0-9-]{2,})$/i', $email,$matches)) {
$hostname = $matches[1];
if (!getmxrr($hostname, $hosts)) {
echo "Host has an MX record\n";
} else {
echo "Host does not exist or does not have an MX record\n";
}
} else {
echo "Email address did not match regular expression\n";
}
The preceeding code does not convert UTF-8 to punycode though. There's not yet an easy native way in PHP to do this, but Pear's Net_IDNA2 provides a way. The implementation seems very complex though, and leaves me wondering if there's an easier way to go about it.