UTF-8 everywhere and command line argument expansion on Windows

UTF-8 Everywhere is a good idea.

In particular, see their advice on how to do text on Windows. It is possible to follow their advice manually.

This morning, I thought of a utility I could write very easily in any scripting language, but decided I would implement it in modernish C++. In writing the utility, I thought I should take advantage of the standalone version of boost::nowide so as to minimize the amount of code I’d need to write to make sure it could handle command line arguments including fancy characters in both Windows and *nixy environments.

One of the facilities this library provides is nowide::args. It “temporarily replaces standard main() function arguments with their equal, but UTF-8 encoded values under Microsoft Windows for the lifetime of the instance.

The class uses GetCommandLineW(), CommandLineToArgvW() and GetEnvironmentStringsW() in order to obtain Unicode-encoded values. It does not relate to actual values of argc, argv and env under Windows.

This is not wrong per se, but it interacts badly with another dimension of handling command line arguments on Windows: cmd.exe does not do glob expansion. Instead, if you want prog *.txt to give you file1.txt, file2.txt, etc in argv, you need to explicitly link with setargv.obj or wsetargv.obj. That way, the runtime sets up an expanded argv using either the OEM charset or the “Unicode” charset depending on whether the program has a main or wmain.

Since boost::nowide::args bypasses the actual argv, but instead reparses the “Unicode” version of the command line as originally given, it is oblivious to the now expanded arguments. Since there is no Win32 API function you can call to the filename expansion on the result of CommandLineToArgvW() (at least, I could not find it), this means the Windows version of my utility will need to have a wmain instead of main.

I’ve written about fixing this in MoarVM a few years ago and submitted a PR. When I first read about boost::nowide::args, I thought it was going to help me avoid the need to engage in various contortions. Unfortunately, it seems like if you do want file name expansion in command line arguments, you cannot use boost::nowide::args (or its standalone equivalent).

It sure is not rocket surgery, but disappointing nevertheless.

I am going to include a few examples to illustrate the problems I mentioned here.

No filename expansion in cmd

Consider the following C program:

#include <stdio.h>

int main(int argc, char *argv[])
{
    for (int i = 1; i < argc; ++i)
    {
        puts(argv[i]);
    }
    return 0;
}

Compile it using:

C:\Temp> cl t.c

and now run it in cmd:

C:\Temp> t t.*
t.*

Now, open a Cygwin or Git Bash shell and try again without re-compiling:

$ ./t t.*
t.c
t.c.swp
t.exe
t.obj

Now, let’s recompile:

C:\Temp> cl t.c /link setargv.obj

and try again in cmd.exe:

C:\Temp> t t.*
t.c
t.c.swp
t.exe
t.obj

Can’t handle “funny” characters

In cmd:

C:\Temp> dir /b k*
kârlı.txt

C:\Temp> t k*
kΓrli.txt

No file name expansion with nowide::args

Let’s try this minimal program:

#include <nowide/args.hpp>
#include <nowide/iostream.hpp>

int
main(int argc, char* argv[])
{
    nowide::args a(argc, argv);
    nowide::cout << "With 'nowide::args'\n";

    for (int i = 1; i < argc; ++i) {
        nowide::cout << argv[i] << '\n';
    }

    return 0;
}

Compile using:

cl /EHsc /DUNICODE /D_UNICODE /MD /Ic:\...\opt\include t.cpp /link setargv.obj c:\...\opt\lib\nowide.lib Shell32.lib

In cmd:

After 'nowide::args'
k*

In bash:

$ ./t k*
After 'nowide::args'
kârlı.txt

Let’s make a simple modification by deleting the instantiation of the nowide::args object:

#include <nowide/args.hpp>
#include <nowide/iostream.hpp>

int
main(int argc, char* argv[])
{
    nowide::cout << "Without 'nowide::args'\n";

    for (int i = 1; i < argc; ++i) {
        nowide::cout << argv[i] << '\n';
    }

    return 0;
}

Compile using the same command line and run in cmd:

C:\Temp> t k*
Without 'nowide::args'
k�li.txt

So, why do we want to use nowide::args anyway? Simple:

C:\Temp> t kârlı.txt
Without 'nowide::args'
k�li.txt

whereas:

C:\Temp> t kârlı.txt
With 'nowide::args'
kârlı.txt

Conclusion

I want the utility I am writing to both handle filenames containing non-OEM characters and have the benefit of file name expansion in command line arguments. Therefore, I can’t take advantage of nowide::args and will need to ensure the entry point for the Windows version is wmain and will need to handle the UTF-8 encoding of argv myself.