Notes on Unicode on the command line in Windows with applications to Perl and Perl 6

Handling of interesting characters on the command line in Windows or DOS environments has never been an annoyance-free experience. Heck, 30 years ago, I was patching lookup tables in keyboard drivers for IBM PCs and compatibles at METU so we could write stuff using Turkish characters. At the time, there wasn't even a standard Turkish keyboard layout. So, we have come a long way.

If you are writing a C program from scratch, it is simple to accept all sorts of characters on the command line and work solely with UTF-8 encoded stuff. Instead of main, use wmain:

int wmain(int argc, wchar_t *argv[], wchar_t *envp[]) {

Your program will now receive command line arguments in UTF-16. You can convert the argv and envp arrays to UTF-8 encoding and just work with them or stick with the wchar_t and compatible functions, depending on which makes the most sense for your specific situation.

However, when you are dealing with a script interpreter written mostly for *nix folks, things get hairer. For example, the following behavior always annoyed me:

C:\> chcp 65001
Active code page: 65001

C:\> perl -CS -E "say 'kârlı iş'"
kârli is

C:\> perl -CS -E "say 'kârlı iş'"|xxd
00000000: 6bc3 a272 6c69 2069 730d 0a              k..rli is..

What happened there? Why do we see "â" but not "ı" or "ş"?

Simple, perl does not define a wmain, but uses the standard main function as the entry point. Therefore, arguments to it are simple chars corresponding to entries in the current ANSI code page. Windows looks at the string passed to this program, and tries to map the arguments to their best representation using the characters available in the OS' code page. In my case, this is CP 437 (I have never used anything other than the US code page simply because, throughout the decades, it was easier to give up using "ş" in filenames than dealing with various uncertainties in various incarnations of DOS and Windows). As luck would have it "â" does exist in CP 437 at 0xE2. Using -CS, I told perl to encode the output in UTF-8 and I set the locale code page to UTF-8, so I get the correcly encoded output displayed correctly. phew!

But, the string lost its original meaning in the process: A "profitable business" has become "profitable soot" (most Turks are not fooled by accidental substitutions of "i" for "ı" :-) That is because neither "ı" not "ş" are in CP 437.

This behavior is not specific to perl, but Perl is the language I use most often.

What happens if we ask perl to execute a file whose name contains another character that does not exist in the ANSI code page?

C:\> perl yağmur
Can't open perl script "yagmur": No such file or directory

Yup, the Turkish soft g, "ğ" does not exist in CP 437 either, so "g" is substituted in its place with predictable effects.

None of this is original or new. And none of it prevented me from doing extremely useful work in many languages on Windows using Perl by avoiding the trouble spots. I avoided investing time into figuring out a solution, because I was convinced such a fix would have to touch way too many spots all throughout Perl's source code and I did not feel up for that.

So, that was a long intro. I am going to ask you to tuck that away for a bit while I digress a little.

A couple of weeks ago, brian and I were discussing a hidden gotcha with perl6. Currently, perl6 on Windows is a batch file and on *nix systems it is a shell script. Which means invoking it via system or opening a pipe ends up involving a shell no matter what you do ... That is not the end of the world, but it is problematic in certain contexts.

These shell scripts and batch files are just wrappers around moar invocations. The thought occurred to me that one could just templatize a simple OS-specific C file to wrap the invocation of moar. Then, Configure.pl would fill in the various paths, and, bingo, system perl6 => @args no longer needs to involve the shell. Of course, the Windows version of this idea is more fiddly because you have to take the command line arguments passed to the wrapper and flatten them correctly to a string containing both the arguments to moar and the arguments passed to the wrapper because CreateProcess expects command line arguments in a string.

While writing the code for flattening the arguments (and making sure everything is correctly quoted and escaped), another thought popped up: perl has for a long time allowed one to specify that command line arguments are UTF-8 encoded. Except, on Windows, it doesn't work well because by the time perl's main sees the arguments, they have already been mapped to whatever ANSI code page by Windows.

What if my wrapper used wmain so it received the command line arguments in UTF-16, but used CreateProcessA to invoke perl with the -CA argument along with any additional arguments specified on the command line? (As far as I can tell, I can't use a similar flag with perl6 or moar.)

If I did that, I could encode the path to perl using the ANSI code page and append the arguments to the wrapper to the plain char array holding the command line after encoding them in UTF-8. I wrote a simple proof of concept. Lo and behold, it works on my simple set up:

C:\> p5run -Mutf8 -CS  -E "say 'kârlı iş'"
kârlı iş

except ...

C:\> p5run -CS yağmur
Can't open perl script "yağmur": No such file or directory

That's what we economists call a Pareto-improvement: The situation is made better in some contexts and no worse in others. Not perfect, but a movement in the right direction.

At this point, I remembered that Perl 6 is designed from the ground up around Unicode and the wrapper may have more success there. So, I cobbled together something and I was met with disappointment:

C:\> p6run -e "say 'kârlı iş'"
kârlı iş

Ouch! Clearly, something somewhere was re-encoding things.

I must admit, I am still not comfortable with exactly how all the layers involved in executing Perl 6 code fit together, so I went searching in GitHub repositories. During the process, I filed a confused bug report because I got fooled by GitHub's syntax highlighting inside a POD section, but that serendipitously led to timo pointing me in the right direction.

The deed indeed happens in MoarVm/src/io/procops.c:

        MVMROOT(tc, clargs, {
            const MVMuint16 acp = GetACP();
            const MVMint64 num_clargs = instance->num_clargs;
            MVMint64 count;

            MVMString *prog_string = MVM_string_utf8_c8_decode(tc,
                instance->VMString,
                instance->prog_name, strlen(instance->prog_name));
            MVMObject *boxed_str = MVM_repr_box_str(tc,
                instance->boot_types.BOOTStr, prog_string);
            MVM_repr_push_o(tc, clargs, boxed_str);

            for (count = 0; count < num_clargs; count++) {
                char *raw_clarg = instance->raw_clargs[count];
                char * const _tmp = ANSIToUTF8(acp, raw_clarg); /* <-- here, line 1243  */
                MVMString *string = MVM_string_utf8_c8_decode(tc,
                    instance->VMString, _tmp, strlen(_tmp));
                MVM_free(_tmp);
                boxed_str = MVM_repr_box_str(tc,
                    instance->boot_types.BOOTStr, string);
                MVM_repr_push_o(tc, clargs, boxed_str);
            }
        });

So when my wrapper encodes the command line arguments in UTF-8 and passes them to moar, they go through the blender … and out come some minced guts or some such. To verify my intuition, I deleted lines 1243 and 1246 and rebuilt MoarVM. This time, my wrapper gave the correct output.

That meant I just had to make sure command line arguments got encoded in UTF-8 at the earliest opportunity. I added the following function to procops.c:

MVM_PUBLIC char **
UnicodeToUTF8_argv(const int argc, const wchar_t **wargv)
{
    int i;
    char **argv = MVM_malloc((argc + 1) * sizeof(*argv));
    for (i = 0; i < argc; ++i)
    {
        argv[i] = UnicodeToUTF8(wargv[i]);
    }
    argv[i] = NULL;
    return argv;
}

and modified MoarVM/main.c to use wmain on Windows:

#ifndef _WIN32
int main(int argc, char *argv[])
#else

char ** UnicodeToUTF8_argv(const int argc, const wchar_t **wargv);

int wmain(int argc, wchar_t *wargv[])

#endif
{
    MVMInstance *instance;
    const char  *input_file;
    const char  *executable_name = NULL;
    const char  *lib_path[8];

#ifdef _WIN32
    char **argv = UnicodeToUTF8_argv(argc, wargv);
#endif

and rebuilt MoarVM (note that creating the UTF-8 encoded argv array involves allocation memory which needs to be freed at some point, but, at this point, I am just exploring).

And, here we go:

C:\> perl6 -e "say 'kârlı iş'"
kârlı iş

and

C:\> type yağmur
say "it's raining!";


C:\> perl6 yağmur
it's raining!

I haven't had time to run the test suites yet. In addition, MVM_proc_getenvhash also needs to be fixed in a similar manner:

C:\> set iş=kârlı

C:\> @echo %iş%
kârlı

C:\> perl6 -e "say %*ENV<iş>"
(Any)

C:\> perl6 -e "say %*ENV<is>"
kârli

That's why I haven't put together a pull request yet.

The discovery process itself was interesting enough for me to want to share it. I'll take care of the pull request as soon as I can. If someone decides to go ahead and patch MoarVM with these changes or improve upon them, I am OK with that, too. In that case I would really appreciate an acknowledgement. I think I deserved one in response to my discovery of erroneous EOL handling, among others.

I am not sure if the fix to perl will be so straightforward.

PS: For reference, examples using other interpreters:

C:\> ruby -e "print 'kârlı iş'"
kârlı iş

C:\> python3.6.exe -c "print('kârlı iş')"
kârlı iş

C:\> python2.7.exe -c "print 'kârlı iş'"
k�rli is

PPS: I still think wrapping moar using a proper C program is the way to go and I am working on a nice templatable wrapper on Windows which I'll make available soon.

PPPS: You can discuss this post on r/perl.

PPPPS: Here is the pull request.