[nem-en] Source files encoding auto detection and coding tag

Kamil Skalski kamil.skalski at gmail.com
Sat Jul 15 10:22:33 CEST 2006


Well, from the beginning we chose to only allow utf-8 files, as it is
the best format to handle multi-language file encodings. Any other
approach encourages people to use old formats, which causes problems.
Your patch brings much complexity to the lexing process, I'm also a
little bit concerned about performance.
Maybe forcing people to utf-8 is not the most friendly thing we could
do... but IMHO this is the best we can do to spread this standard and
it has the increasing value of interoperability between operating
systems. What if you saved the sources on Windows using cp-1251(or
something similar) and then tried to read it in editor on Linux? I
doubt you can easily find one showing the language marks correctly.

Maybe we should just add the command like switch to specify encoding?

BTW. I use XEmacs/Emacs and it handles utf-8, though I had to play
with it a little bit (installed xemacs-mule, specified some special
settings, etc.)

On 7/14/06, Snaury <snaury at gmail.com> wrote:
> Hi everyone,
>
> Currently when nemerle source files don't have BOM compiler will
> always open it in utf-8 encoding. Although widely spread these days it
> might not always be a good thing, because at least I have (or rather
> had, since yesterday I decided to try emacs once more and surprisingly
> found how I could configure it for my needs and my likings) my
> favourite editor that doesn't support any form of unicode when editing
> files, and sometimes I want to have string literals with national
> symbols right inside the code. I previously wrote similar patch for
> Boo (although Rodrigo still hasn't applied it and I'm not sure already
> he ever will), now I ported it to Nemerle:
>
> http://snaury.googlepages.com/nemerle-encoding-detection.patch
>
> Here's what I'd like to propose and hear your comments if you don't like it.
>
> Instead of always falling back just to utf-8 it would be good to
> analyze source files when it doesn't have BOM and check if all of its
> characters are valid utf-8, and if not fall back to default system
> encoding (for my system it is cp1251, for instance): this is what
> FileStreamAutoEncoding module does. However, additionally it could be
> nice if file encoding could also be specified straight inside source
> files themselves, for example in first two lines, so sources like this
> could be possible:
>
> // -*- coding: windows-1251 -*-
> System.Console.WriteLine("проверка вывода текста");
>
> (it actually mimicks Python's PEP-<i forgot that number again> for
> doing the same thing) which would force it to compile in the same
> encoding on all systems, not just in the same country as the one who
> wrote the code. Also, if coding: <encoding> is wrapped in -*- it is
> detected by emacs and file opens in correct encoding, but that's
> already described in PEP. :)
>
> However, currently, there's a possible problem. For example the following code:
>
> <begin file>
> def r = @"
> // coding: nonsense
> ";
> System.Console.WriteLine(r);
> <end file>
>
> Would compile properly in before my patch and will give an error with
> invalid encoding name after my patch. If ultimate solution is needed
> we could just reduce number on lines for coding tag to just one line
> (because, well, for python and boo two lines are minimum, since both
> can have #!<whatever> as a first line, I often had #!/usr/bin/env booi
> in my scripts for example, for nemerle there's no such necessity).
> Alternatively we could just ignore such an extremely rare case.
>
> P.S.
> If you decide that coding tags are a bad idea at least just checking
> whether file has valid utf-8 is imho a good idea, that's what at least
> msc# does (if you have national text in your source file it will be
> parsed in national encoding), and then you can change
> NemerleSourceAutoEncoding.OpenText to FileStreamAutoEncoding.OpenText.
> But still I think coding tags are good. :)
>
> P.P.S.
> I checked and it compiles, automatic tests pass and I think it
> shouldn't break any existing code (only maybe some code that is utf-8
> but has encoding error in at least one character, but if someone
> absolutely needs it that way, it can just specify coding: utf-8 and
> forget about it...).
>
> _______________________________________________
> https://nemerle.org/mailman/listinfo/devel-en
>
>
>


-- 
Kamil Skalski
http://nazgul.omega.pl


More information about the devel-en mailing list