[nem-en] Source files encoding auto detection and coding tag

Snaury snaury at gmail.com
Fri Jul 14 13:11:39 CEST 2006


Hi everyone,

Currently when nemerle source files don't have BOM compiler will
always open it in utf-8 encoding. Although widely spread these days it
might not always be a good thing, because at least I have (or rather
had, since yesterday I decided to try emacs once more and surprisingly
found how I could configure it for my needs and my likings) my
favourite editor that doesn't support any form of unicode when editing
files, and sometimes I want to have string literals with national
symbols right inside the code. I previously wrote similar patch for
Boo (although Rodrigo still hasn't applied it and I'm not sure already
he ever will), now I ported it to Nemerle:

http://snaury.googlepages.com/nemerle-encoding-detection.patch

Here's what I'd like to propose and hear your comments if you don't like it.

Instead of always falling back just to utf-8 it would be good to
analyze source files when it doesn't have BOM and check if all of its
characters are valid utf-8, and if not fall back to default system
encoding (for my system it is cp1251, for instance): this is what
FileStreamAutoEncoding module does. However, additionally it could be
nice if file encoding could also be specified straight inside source
files themselves, for example in first two lines, so sources like this
could be possible:

// -*- coding: windows-1251 -*-
System.Console.WriteLine("проверка вывода текста");

(it actually mimicks Python's PEP-<i forgot that number again> for
doing the same thing) which would force it to compile in the same
encoding on all systems, not just in the same country as the one who
wrote the code. Also, if coding: <encoding> is wrapped in -*- it is
detected by emacs and file opens in correct encoding, but that's
already described in PEP. :)

However, currently, there's a possible problem. For example the following code:

<begin file>
def r = @"
// coding: nonsense
";
System.Console.WriteLine(r);
<end file>

Would compile properly in before my patch and will give an error with
invalid encoding name after my patch. If ultimate solution is needed
we could just reduce number on lines for coding tag to just one line
(because, well, for python and boo two lines are minimum, since both
can have #!<whatever> as a first line, I often had #!/usr/bin/env booi
in my scripts for example, for nemerle there's no such necessity).
Alternatively we could just ignore such an extremely rare case.

P.S.
If you decide that coding tags are a bad idea at least just checking
whether file has valid utf-8 is imho a good idea, that's what at least
msc# does (if you have national text in your source file it will be
parsed in national encoding), and then you can change
NemerleSourceAutoEncoding.OpenText to FileStreamAutoEncoding.OpenText.
But still I think coding tags are good. :)

P.P.S.
I checked and it compiles, automatic tests pass and I think it
shouldn't break any existing code (only maybe some code that is utf-8
but has encoding error in at least one character, but if someone
absolutely needs it that way, it can just specify coding: utf-8 and
forget about it...).


More information about the devel-en mailing list