UTF-8 output from Perl and C programs in cmd.exe on Windows 8

A. Sinan Unur

This all happened because I decided to put together a cute post involving objects representing functions for perltricks.com. One thing lead to another, and I found myself completely incapable of understanding what's going on. So, let's start at the end:

#include <stdio.h>

int main(void) {
    /* UTF-8 encoded alpha, beta, gamma */
    char x[] = { 0xce, 0xb1, 0xce, 0xb2, 0xce, 0xb3, 0x00 };
    puts(x);
    return 0;
}

Let's see what happens when we compile and run that program in cmd.exe on my system:

When, I switch to using the UTF-8 codepage, I get:

If this was all there was to it, there would be no need for a blog post.

Let's see some Perl:

use utf8;
use strict;
use warnings;
use warnings qw(FATAL utf8);

binmode STDOUT, ':utf8';

print 'αβγ', "\n";

And, here is the output:

Copying and pasting the output in the cmd.exe window, we have:

C:\Users\sinan\src\poly> pttt.pl
αβγ
�

xxd does not help at all:

C:\Users\sinan\src\poly> pttt.pl | xxd
0000000: ceb1 ceb2 ceb3 0d0a                      ........

C:\Users\sinan\src\poly> cttt.exe | xxd
0000000: ceb1 ceb2 ceb3 0d0a                      ........

So, the Perl program seems to output the exact same byte sequence as the C program, but when I run the Perl program, I get an extra line with a mystery character.

OK, let's reduce the Perl program to just output the octets like the C program does:

C:\Users\sinan\src\poly> perl -e "print qq{\xce\xb1\xce\xb2\xce\xb3\n}"
αβγ
�

Copying and pasting that text and examining the octets in a hex editor gives me U+FFFD which reveals no information. However, changing the print statement in the script to:

print 'αβγ1', "\n";

gives me the following output:

C:\Users\sinan\src\poly> pttt.pl
αβγ1
1

C:\Users\sinan\src\poly> pttt.pl | xxd
0000000: ceb1 ceb2 ceb3 310d 0a                   ......1..

When directed to a pipe, we don't get the extra line with the extra digit. However, when directed to the cmd.exe window where the code page is set to 65001, there is an extra line with the digit one.

This leads me to believe that somehow the last octet gets repeated on a separate line when output is not redirected. Given that the octet 0xb3 is not a valid encoding of any character, somewhere along the way, it gets replaced with U+FFFD.

I have tried this on Windows 8.1 Pro (64-bit), and Windows Vista Home (32-bit), both with self-compiled 5.18.2 and ActiveState's 5.16.3. The problem is not seen with mintty with Cygwin's perl 5.14.4. Nor do I see it when I run Cygwin's perl from the cmd.exe window set to code page 650001:

C:\Users\sinan\src\poly> c:\opt\cygwin64\bin\perl.exe pttt.pl
αβγ

What should I try next?

Update

For some reason, I had forgotten about ConEmu. I installed the 64-bit version, and everything works, presumably because it is capturing output from the Perl script:

My question on Stackoverflow.