[nem-en] Language support and BOM

Alexey Borzenkov snaury at gmail.com
Tue Jan 30 20:16:19 CET 2007


Hi Kamil,

What's the point in detecting BOM manually? Implementations of
IO.StreamReader *must* do that on their on, and it's even more important
that they can evolve and detect future BOM cases that don't exist these
days, something you won't be able to foretell in your implementation (or
will have to constantly change it). Check this:

using System;
using System.IO;
using System.Text;

def file = StreamReader("1.txt", Encoding.GetEncoding(Encoding.UTF8.CodePage,
EncoderExceptionFallback(), DecoderExceptionFallback()));
def text = file.ReadToEnd();
Console.WriteLine(file.CurrentEncoding.HeaderName);

It detects BOMs very well (at least on MS.NET, haven't checked with Mono,
but if it's not working with Mono it's Mono bug), and throws exceptions when
BOM-less file has non-utf-8 sequences (and it seems it throws on invalid
sequences even when file actually has BOM).

Hint (just in case): StreamReader.CurrentEncoding is *not* detected until
you actually do read at least one characted from the file. :)

On 1/30/07, Kamil Skalski <kamil.skalski at gmail.com> wrote:
>
> Ok, I need volunteers using various codepages to test attached bom
> parsing / encoding enforcement program.
>
> You must save it as t.n, edit the comment at the beginning to your
> country's native characters, compile, run executable.
>
> We need following confirmations:
> - program fails when you process t.n saved in non-utf codepage
> - program runs fine with the same file saved as utf-8 (with and withouth
> bom)
> - you can also test other utfs, but they must always have explicit BOM
>
> UTF32 Big Endian is unfortunately not supported by mono, so we will
> skip it at the moment.
>
>
> 07-01-30, Michal Moskal <michal.moskal at gmail.com> napisał(a):
> > On 1/30/07, vc <vc at rsdn.ru> wrote:
> > > > On Behalf Of Michal Moskal
> > >
> > >
> > > > > But, in my option, UTF-16 also should be supported.
> > > >
> > > > But with BOM, then UTF-32 should also be fine.
> > >
> > > Well...
> > >
> > > But, we remain old problem - we can't warn user if it use "non utf
> file".
> >
> > How come? We would just reject non-utf8 file with no BOM.
> >
> > --
> >    Michał
> >
> > _______________________________________________
> > https://nemerle.org/mailman/listinfo/devel-en
> >
> >
> >
>
>
> --
> Kamil Skalski
> http://nazgul.omega.pl
>
> _______________________________________________
> https://nemerle.org/mailman/listinfo/devel-en
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: /mailman/pipermail/devel-en/attachments/20070130/c399a90b/attachment.html


More information about the devel-en mailing list