Fixing Perl's Unicode problems on the command line on Windows: A trilogy in N parts

I have used Perl on Windows for decades without being seriously hampered by any of its past or current limitations. Still, it would be nice to solve some of the issues, if only so I can post cute screenshots.

Here are some problems with Perl and Unicode on the command line in Windows.

1. Can't pass interesting characters to perl on the command line

You can't pass characters that are outside of the Windows code page to perl on the command line. It doesn't matter whether you have set the code page to 65001 and use the -CA command line argument to perl: Because perl uses main instead of wmain as the entry point, it never sees anything other than characters in the ANSI code page.

For example:

$ chcp 65001
$ perl -CAS -E "say for @ARGV" şey
sey

That's because ş does not appear in CP 437 which is what my laptop is using. By the time it reaches the internals of perl, it has already become s.

On the other hand,

$ perl -CAS -E "say for @ARGV" ünür
Malformed UTF-8 character (unexpected end of string) in say at -e line 1.
�n�r

because ü does appear in CP 437 so it remains intact. But then we lied, the command line is not UTF-8 encoded.

This "works":

$ perl -CS -E "say for @ARGV" ünür
ünür

but not for the right reasons.

2. Can't use interesting characters in perl one-liners

For example:

$ perl -Mutf8 -CS -E "say 'şey'"
sey

Again, by the time perl sees the source of the one-liner, it is too late for -Mutf8.

3. Can't use interesting characters in script names

For the same reason:

$ type şey
use v5.24;
use utf8;
say 'şey';


$ perl şey
Can't open perl script "sey": No such file or directory

This one comes with the added caveat that even if perl did get the name of the file right, it would still not be able to run the script because it would be using the ANSI API which would once again not be able to deal with characters outside of the current code page.

4.a. Can't access environment variables with interesting characters in their names;

4.b. Can't access values of environment variables if they contain intersting characters

For example:

$ set iş=kârlı

$ set hava=karlı

$ echo %iş%
kârlı

$ echo %hava%
karlı

$ type t.pl
use v5.24;
use utf8;

say $ENV{$_} for qw(iş hava);

$ perl t.pl

karli

So, business is profitable, and the weather is snowy, but we can't look up $ENV{iş} and the value of $ENV{hava} is misspelled.

5. Can't read lines containing interesting characters from the console

For example:

$ perl -e "print while <>"
hava yağmurlu mu karlı mı olacak?
hava yagmurlu mu karli mi olacak?

Depending on your Windows version, this script may terminate prematurely.

6. Using standard Perl functions, interacting with data files with interesting characters in their names is weird

perl tries to access files using their short names which doesn't work if you have disabled short name creation. Even if it does, it's ugly

$ dir
...
2017-02-17  09:45 AM                38 şey

$ perl -E "opendir $d, '.'; say for grep !/^\./, readdir $d"
EY61AE~1

This one is not a huge problem, because one can use Win32::LongPath to deal with the issue.

In fact, none of these are huge problems: I have done useful work with Perl on Windows for decades despite the occasional glitch.

However, they are things I thought I should make some effort to fix some day. After I fixed Perl6's Unicode issues on the command line in Windows, I felt slightly guilty that I had not given Perl the same TLC. Maybe "some day" has arrived.

The easiest to fix among the problems I mentioned above is the case of command line arguments, and that's what I am going to start with. I will dig deeper in subsequent posts.

I approached this problem a little sideways: I decided to leverage Perl's support for UTF-8 encoded command line arguments. I just had to modify the arguments before perl's internals saw them to make sure they were UTF-8 encoded. To try out my idea, I first wrote a wrapper for perl.exe. The wrapper was very simple: It used wmain as its entry point, and constructed a UTF-8 encoded command line argument array with -CA inserted between the first and the second elements to invoke perl.exe. It was ugly, but it worked in the sense that any interesting characters I used in command line arguments made it to the Perl side of things intact.

perl itself comes with a wrapper, runperl.c which becomes perlmain.c during build. This would be the ideal place to transform both the command line arguments and the environment array perl sees before it sets up anything. Basically, the idea is to always run perl with UTF-8 encoded arguments. This keeps any changes we want to make to Perl's internals minimal. Of course, Windows does not have an API for console programs to receive their arguments as UTF-8 encoded strings. Instead, we use wmain as the entry point so we receive the command line arguments and the environment as UTF-16 encoded strings. Then, we create UTF-8 encoded command line argument and environment arrays using standard Windows APIs. The patch is rather straightforward:

diff --git a/win32/runperl.c b/win32/runperl.c
index 2157224..9cd3c7c 100644
--- a/win32/runperl.c
+++ b/win32/runperl.c
@@ -2,6 +2,11 @@
 #include <crtdbg.h>
 #endif
 
+#include <windows.h>
+#include <fcntl.h>
+#include <io.h>
+#include <stdlib.h>
+
 #include "EXTERN.h"
 #include "perl.h"
 
@@ -21,9 +26,54 @@ int _CRT_glob = 0;
 
 #endif
 
+static void
+error_exit(const wchar_t *msg)
+{
+    int err = GetLastError();
+    _setmode(_fileno(stderr), _O_U16TEXT);
+    fwprintf(stderr, L"%s: %d\n", msg, err);
+    exit( err );
+}
+
+static char *
+utf8_encode_wstring(const wchar_t *src)
+{
+    char *encoded;
+    int len;
+
+    len = WideCharToMultiByte( CP_UTF8, WC_ERR_INVALID_CHARS, src,
+            -1, NULL, 0, NULL, NULL);
+
+    encoded = malloc(len + 1);
+    if (!encoded) {
+        error_exit(L"Failed to allocate memory for UTF-8 encoded string");
+    }
+
+    (void) WideCharToMultiByte( CP_UTF8, WC_ERR_INVALID_CHARS, src,
+            -1, encoded, len, NULL, NULL);
+
+    return encoded;
+}
+
+static void
+utf8_encode_warr(const wchar_t **warr, const int n, char **arr)
+{
+    int i;
+
+    for (i = 0; i < n; ++i) {
+        arr[i] = utf8_encode_wstring(warr[i]);
+    }
+
+    return;
+}
+
 int
-main(int argc, char **argv, char **env)
+wmain(int argc, wchar_t **wargv, wchar_t **wenv)
 {
+    char **argv;
+    char **env;
+    int env_count;
+
 #ifdef _MSC_VER
     /* Arrange for _CrtDumpMemoryLeaks() to be called automatically at program
      * termination when built with CFG = DebugFull. */
@@ -36,6 +86,30 @@ main(int argc, char **argv, char **env)
     _CrtSetBreakAlloc(-1L);
 #endif
 
+    ++argc; /* we are going to insert -CA between argv[0] and argv[1] */
+    argv = malloc((argc + 1) * sizeof(*argv));
+    if (!argv) {
+        error_exit(L"Failed to allocate memory of UTF-8 encoded argv");
+    }
+
+    argv[0] = utf8_encode_wstring(wargv[0]);
+    argv[1] = "-CA";
+    argv[ argc ] = NULL;
+
+    utf8_encode_warr(wargv + 1, argc - 1, argv + 2);
+
+    env_count = 0;
+    while ( wenv[env_count] ) {
+        ++env_count;
+    }
+    env = malloc( (env_count + 1) * sizeof(*env));
+    if (!env) {
+        error_exit(L"Failed to allocate memory for UTF-8 encoded environment");
+    }
+    env[ env_count ] = NULL;
+
+    utf8_encode_warr(wenv, env_count, env);
+
     return RunPerl(argc, argv, env);
 }

Here's what this change gets us:

$ ..\perl -Mopen=:std,:utf8 -E "say for @ARGV" iş
iş

or

$ ..\perl -CAS -E "say for @ARGV" iş
iş

Yes, if we are going to use -CS we must actually use -CAS because, apparently, -C flags are not cumulative. That is, it looks like perl -CA -CS is equivalent to perl -CS and not perl -CAS. I have considered whether to make -CAS the default, but that is of doubtful usefulness because it would require everyone using Perl to use the UTF-8 codepage in the console. That is a bigger change than transparently converting anything passed on the command line to UTF-8.

These changes solve only part of the problem:

$ ..\perl.exe şey
Can't open perl script "şey": No such file or directory

perl looks for the correct script file, but can't open it because it uses the ANSI functions in the Windows API.

$ ..\perl.exe -Mutf8 -E "say $ENV{iş}"
kârlı

But, of course,

$ ..\perl.exe -Mutf8 -Mopen=:std,:utf8 -E "say $ENV{iş}"
kârlı

By the way, I know why kârlı gets double UTF-8 encoded, I know what needs to be fixed, but, as the title says, this post is the first in a series, and I will discuss those issues and their fixes in follow-up posts. The most important criterion for me is to change the smallest number of lines possible to get correct behavior. Otherwise, making changes all over the place in such a large codebase with a long history is bound to get one in trouble by breaking things.

The good news is I got very few test failures due to these changes.

I know at least some people think no one ought to write about anything unless they have filed bug reports and sent patches in triplicate with the blue copy stamped and filed with the Open Source Planning Agency or something, but, rest assured, I will do all that ... When I have a complete patch set I can submit with confidence (i.e., when the set of tests failng with the patched perl is identical to the set of tests failing with blead perl).

Along the way, I am going to share how I arrived at that state.

You can discuss this post on r/perl.