Newline translation in Perl6 is broken

I started investigating some of the test failures I saw after my attempt to build Perl6 on Windows.

First, I looked at t\spec\S02-literals\heredocs.t mainly because it was at the top of the list.

Here is the output:

t\spec\S02-literals\heredocs.t .. 
1..22
ok 1 - q:to// is singular
not ok 2 - here doc interpolated

# Failed test 'here doc interpolated'
# at t\spec\S02-literals\heredocs.t line 21
# expected: "blah\r\nBAR\r\nblah\r\nFOO\r\n"
#      got: "blah\nBAR\nblah\nFOO\n"
not ok 3 - here doc interpolating with indentation

# Failed test 'here doc interpolating with indentation'
# at t\spec\S02-literals\heredocs.t line 34
# expected: "blah\r\nBAR\r\nblah\r\nFOO\r\n"
#      got: "blah\nBAR\nblah\nFOO\n"
ok 4 - q:to// is singular, also when indented
not ok 5 - indentation stripped

# Failed test 'indentation stripped'
# at t\spec\S02-literals\heredocs.t line 47
# expected: "blah blah\r\n\$foo\r\n"
#      got: "blah blah\n\$foo\n"
ok 6 - q:heredoc// is singular
not ok 7 - backslashes

# Failed test 'backslashes'
# at t\spec\S02-literals\heredocs.t line 57
# expected: "yoink\\n\r\nsplort\\n\r\n"
#      got: "yoink\\n\nsplort\\n\n"
not ok 8 - indent with multiline interpolation

# Failed test 'indent with multiline interpolation'
# at t\spec\S02-literals\heredocs.t line 70
# expected: "first line\r\nHello\r\nWorld\r\nanother line\r\n"
#      got: "first line\nHello\r\nWorld\nanother line\n"
not ok 9 - indent with multiline interpolation with spaces at the beginning

# Failed test 'indent with multiline interpolation with spaces at the beginning'
# at t\spec\S02-literals\heredocs.t line 81
# expected: "first line\r\nHello\r\n    World\r\nanother line\r\n"
#      got: "first line\nHello\r\n    World\nanother line\n"
not ok 10 - extra spaces after interpolation will be kept

# Failed test 'extra spaces after interpolation will be kept'
# at t\spec\S02-literals\heredocs.t line 90
# expected: "first line\r\nHello\r\n    World        something\r\nanother line\r\n"
#      got: "first line\nHello\r\n    World        something\nanother line\n"
not ok 11 - interpolations without constant strings in the middle

# Failed test 'interpolations without constant strings in the middle'
# at t\spec\S02-literals\heredocs.t line 100
# expected: "foobar\r\nstuff\r\n"
#      got: "foobar\nstuff\n"
not ok 12 - interpolations at the very end

# Failed test 'interpolations at the very end'
# at t\spec\S02-literals\heredocs.t line 107
# expected: "stuff\r\nfoobar\r\n"
#      got: "stuff\nfoobar\n"
not ok 13 - empty lines

# Failed test 'empty lines'
# at t\spec\S02-literals\heredocs.t line 117
# expected: "line one\r\n\r\nline two\r\n\r\nfoo\r\n"
#      got: "line one\n\nline two\n\nfoo\n"
not ok 14 - Tabs get correctly removed

# Failed test 'Tabs get correctly removed'
# at t\spec\S02-literals\heredocs.t line 126
# expected: "stuff\r\nstuff\r\n"
#      got: "stuff\nstuff\n"
not ok 15 - mixed tabs and spaces get correctly removed

# Failed test 'mixed tabs and spaces get correctly removed'
# at t\spec\S02-literals\heredocs.t line 133
# expected: "stuff\r\nbarfoo\r\n"
#      got: "stuff\nbarfoo\n"
not ok 16 - mixing tabs and spaces even more evil-ly

# Failed test 'mixing tabs and spaces even more evil-ly'
# at t\spec\S02-literals\heredocs.t line 140
# expected: "line one\r\nline two\r\n"
#      got: "line one\nline two\n"
not ok 17 - Constant heredocs work

# Failed test 'Constant heredocs work'
# at t\spec\S02-literals\heredocs.t line 150
# expected: "Hello world\r\n:)\r\n"
#      got: "Hello world\n:)\n"
not ok 18 - Heredoc leading and trailing empty lines

# Failed test 'Heredoc leading and trailing empty lines'
# at t\spec\S02-literals\heredocs.t line 162
# expected: "\r\n\r\nsomething\r\n\r\n\r\n"
#      got: "\n\nsomething\n\n\n"
ok 19 - Completely empty heredoc
not ok 20 - Heredoc one empty line

# Failed test 'Heredoc one empty line'
# at t\spec\S02-literals\heredocs.t line 171
# expected: "\r\n"
#      got: "\n"
not ok 21 - Heredoc two empty lines

# Failed test 'Heredoc two empty lines'
# at t\spec\S02-literals\heredocs.t line 176
# expected: "\r\n\r\n"
#      got: "\n\n"
not ok 22 - Heredoc tab explosion makefile use case is usesul.

# Failed test 'Heredoc tab explosion makefile use case is usesul.'
# at t\spec\S02-literals\heredocs.t line 185
# Looks like you failed 18 tests of 22
Dubious, test returned 18 (wstat 4608, 0x1200)
Failed 18/22 subtests 

Test Summary Report
-------------------
t\spec\S02-literals\heredocs.t (Wstat: 4608 Tests: 22 Failed: 18)
  Failed tests:  2-3, 5, 7-18, 20-22
  Non-zero exit status: 18
Files=1, Tests=22,  2 wallclock secs ( 0.06 usr +  0.00 sys =  0.06 CPU)
Result: FAIL

Look at test 20: expected: “\r\n” and got: “\n”. That is weird!

Here is the relevant test code:

    my $e = q:to;

END
    is no-r($e), "\n", 'Heredoc one empty line';

It is almost like the literal string "\n" is being translated to "\r\n".

Let’s try: C:\...\> unix2dos t\spec\S02-literals\heredocs.t, and run the test again:

t\spec\S02-literals\heredocs.t ..
1..22
ok 1 - q:to// is singular
ok 2 - here doc interpolated
ok 3 - here doc interpolating with indentation
ok 4 - q:to// is singular, also when indented
ok 5 - indentation stripped
ok 6 - q:heredoc// is singular
ok 7 - backslashes
ok 8 - indent with multiline interpolation
ok 9 - indent with multiline interpolation with spaces at the beginning
ok 10 - extra spaces after interpolation will be kept
ok 11 - interpolations without constant strings in the middle
ok 12 - interpolations at the very end
ok 13 - empty lines
ok 14 - Tabs get correctly removed
ok 15 - mixed tabs and spaces get correctly removed
ok 16 - mixing tabs and spaces even more evil-ly
ok 17 - Constant heredocs work
ok 18 - Heredoc leading and trailing empty lines
ok 19 - Completely empty heredoc
ok 20 - Heredoc one empty line
ok 21 - Heredoc two empty lines
ok 22 - Heredoc tab explosion makefile use case is usesul.
ok
All tests successful.
Files=1, Tests=22,  1 wallclock secs ( 0.06 usr +  0.02 sys =  0.08 CPU)
Result: PASS

Well, that’s just completely and utterly wrong!

Before I dive into why it is wrong, let me explain what happened:

Unix and Windows have different EOL conventions. On Unix, a single solitary 0x0a works whereas Windows has customarily used the pair 0x0d 0x0a to signify EOL. In the olden days, Macs used to use 0x0d. If you are curious, do read the Wikipedia article.

Now, mind you, with the notorious exception of Notepad, almost all Windows programs deal perfectly well with either Unix or Windows newline representations.

The way this works is straightforward: When reading files in text mode, 0x0d 0x0a sequences are translated to a canonical representation (i.e. Unix), and, when writing out files in text mode, newlines are translated to 0x0d 0x0a pairs. If a text file being read is already using Unix style EOLs, then the file is not affected. This is what perl does with script files when compiled with -DPERL_TEXTMODE_SCRIPTS so, for example:

#!/usr/bin/env perl

use strict;
use warnings;

use Test::More;

my $x = <<END;

END

my $y = "\n";

is($x, $y, "'$x' eq '$y'");

done_testing;

works correctly on Windows regardless of whether the script was saved with Windows line endings or Unix ones.

A long long time ago, I decided to stick with Unix line endings on all my computers. My archive utilities don’t do automatic EOL conversion, git and hg are set up not to touch line endings, my editors are set up to work for Unix EOLs only.

So, the test files were unpacked with Unix style line-endings.

Now, from the test failures, it looks like the Perl6 interpreter/compiler thingie is not doing the newline translation on input-output. Instead, it looks like it is altering literal strings embedded in the program. Otherwise, the line endings in the script file would not affect the outcome of the tests.

This is not just useless, it is harmful. It does away with the convention of translating EOLs at input-output boundaries, and transforms internal strings based on the platform on which the program is running.

Consider the following input file:

C:\...\Temp> xxd ttt
00000000: 7468 6973 2069 7320 6120 6c69 6e65 0a0a  this is a line..

and the following Perl script:

open my $fh, 'ttt';
my $line = <$fh>;
print "no\n" if $line ne "this is a line\n";

and the following Perl6 script:

use v6;
my $fh = open 'ttt', chomp => False;
my $line = $fh.get;
say 'no' if $line ne "this is a line\n";

Running the Perl script gives no output because conversion is done at input-output boundaries, not to internal strings in a running program.

With the Perl6 script, we get:

C:\...\Temp> perl6 t6.pl
no

because "this is a line\n" is transformed to "this is a line\r\n".

If we convert the input file to use Windows EOLs:

C:\...\Temp> unix2dos ttt
unix2dos: converting file ttt to DOS format...

C:\...\Temp> xxd ttt
00000000: 7468 6973 2069 7320 6120 6c69 6e65 0d0a  this is a line..
00000010: 0d0a                                     ..

C:\...\Temp> perl6 t6.pl

it “works.”

I don’t want to sound like I am ranting here. I know people are doing incredible work on Perl6, but right now it looks like I have a Ferrari with square wheels.

Here is what the docs say:

sub open

my $fh = open(IO::Path() $path, :$r, :$w, :$a, :$rw,
    :$bin, :$enc, :$nl, :$chomp)

Opens the $path (by default in text mode) with the given options, returning an IO::Handle object.

Text mode means all EOL sequences are translated on input so that, in your program, EOL is always single solitary "\n" everywhere and always.

I seriously hope this behavior is not by design, and that I did something wrong when building Rakudo Star.

Here is the list of test files with failures. I decided to try all of them after running unix2dos on them. Any reduction in failures would signal instances of problematic behavior.

  • t\spec\S02-literals\heredocs.t: Fixed by converting EOLs to 0x0d 0x0a.
  • t\spec\S02-literals\quoting.t: Only #126 and #129 still fail after converting EOLs.
  • t\spec\S16-filehandles\io_in_while_loops.t: All tests pass after converting EOLs.
  • t\spec\S17-supply\lines.t: Hot mess.
  • t\spec\S19-command-line-options\02-dash-n.t: All tests pass after EOL conversion.
  • t\spec\S19-command-line\dash-e.t: See above. EOL conversion does not make a difference.
  • t\spec\S19-command-line\repl.t: Still fails.
  • t\spec\S22-package-format\local.t: Fails due to wrong directory separators. Probably has something to do with RT #126876.
  • t\spec\S26-documentation\04-code.t: One more test fails after EOL conversion.
  • t\spec\S26-documentation\05-comment.t: All tests pass after EOL conversion.
  • t\spec\S32-io\IO-Socket-Async.rakudo.moar: Still gets stuck.
  • t\spec\S32-io\io-spec-win.t: Same tests fail.
  • t\spec\S32-io\move.t: Failures don’t change.
  • t\spec\S32-io\pipe.t: No change.
  • t\spec\S32-io\rename.t: No change.
  • t\spec\integration\advent2009-day21.t: All tests pass after EOL conversion.
  • t\spec\integration\advent2012-day06.t: All tests pass after EOL conversion.
  • t\spec\integration\advent2012-day23.t: All tests pass after EOL conversion.
  • t\spec\integration\advent2012-day24.t: Test passes after EOL conversion.
  • t\spec\integration\advent2013-day04.t: All tests pass after EOL conversion.
  • t\spec\integration\advent2014-day16.t: All tests pass after EOL conversion.
  • t\spec\rosettacode\greatest_element_of_a_list.t: Test passes after EOL conversion.
  • t\spec\rosettacode\sierpinski_triangle.t: Test passes after EOL conversion.

This really isn’t how any of this is supposed to work. Take, for example t\spec\rosettacode\greatest_element_of_a_list.t:

# http://rosettacode.org/wiki/Greatest_element_of_a_list#Perl_6

use v6;
use Test;

plan 1;

my $rosetta-code = {

#### RC-begin
say [max] 17, 13, 50, 56, 28, 63, 62, 66, 74, 54;

say [max] 'my', 'dog', 'has', 'fleas';

sub max2 (*@a) { reduce -> $x, $y { $y after $x ?? $y !! $x }, @a }
say max2 17, 13, 50, 56, 28, 63, 62, 66, 74, 54;
#### RC-end

}

my $output;
{
    temp UT = class {
    method print(*@args) {
        $output ~= @args.join;
    }
    }

    $rosetta-code.();
}

my $expected = "74
my
74
";

is($output.subst("\r\n", "\n", :g), $expected.subst("\r\n", "\n", :g), "Greatest element of a list");

The commit message says:

The completion of the NFG work making \r\n a single grapheme, per the Unicode spec, means that the approaches these tests took to work on Windows became bogus. Note that we’ll probably have to take a pass through these all again in the near future, when \n in a string will become \r\n on Windows by default (hopefully clearing up a lot of this cruft).

which I sincerely do not understand very well — but, I am afraid, it does seem to suggest that \n characters in strings being converted to \r\n is by design.

In every other programming language I have used, EOL conversion has always been something that happens at input/output boundaries. If that were true for Perl6, why would I have to run unix2dos on t\spec\rosettacode\greatest_element_of_a_list.t for the test to pass?

Let’s reduce that test to the following Perl code:

#!/usr/bin/env perl

use 5.022; # I'm lazy
use strict;
use warnings;
use Test::More;

my $output;

{
    local *STDOUT;
    open STDOUT, '>>', \$output
        or die "Cannot redirect STDOUT to string: ";
    say for 74, 'my', 74;
}

my $expected = "74
my
74
";

# The s/// is needed because we did not binmode STDOUT
# in the block above.
is($output =~ s/\r\n/\n/gr, $expected =~ s/\r\n/\n/gr);

done_testing;

Note that the file was saved with Unix line endings.

C:\..\Temp> perl tt.pl
ok 1
1..1

It works. Now, let’s convert it to use Windows EOL and try:

C:\...\Temp> unix2dos tt.pl
unix2dos: converting file tt.pl to DOS format...

C:\...\Temp> perl tt.pl
ok 1
1..1

But, the Perl6 code only works if the source file uses Windows style line endings.

The only explanation I have is that when perl6 encounters $output.subst("\r\n", "\n", :g) on Windows, it turns it into $output.subst("\r\n", "\r\n", :g), an expensive NOP. If the file itself uses Unix line endings, that leaves the expected string as 74\nmy\n74\n whereas $output still contains 74\r\nmy\r\n74\r\n (by virtue of the fact that the filehandle used to write to it was opened in text mode). The substitution does not canonicalize the line endings.

Incidentally, we would not have needed the s/// in the Perl code above, if we had written the test like:

#!/usr/bin/env perl

use 5.022; # I'm lazy
use strict;
use warnings;
use Test::More;

my $buffer;

{
    local *STDOUT;
    open STDOUT, '>>', \$buffer
        or die "Cannot redirect STDOUT to string: $!";
    say for 74, 'my', 74;
}

my $output = do {
    local $/;
    open my $fh, '<', \$buffer
        or die "Cannot open buffer for reading: $!";
    <$fh>;
};

my $expected = "74
my
74
";

is($output, $expected);

done_testing;

because, the source code would be read in text mode, thereby letting perl take care of the EOL conversion. The strings would be written to $buffer would use "\r\n" thanks to EOL conversion at input-output boundaries, and the strings read from it, once again thanks to EOL conversion at input-output boundaries, would have any "\r\n" sequences converted to "\n".

This is so elementary, I hope I am missing something, but, I am afraid I am not.

It looks like there is no Perl6 in my future because I can foresee how frustrating this can become.

PS: You can comment on this post on /r/perl6.

PPS: RT #126881. Of course, I forgot to put [BUG] in the subject line. Apologies.