[nem-en] Language support and BOM

Alexey Borzenkov snaury at gmail.com
Tue Jan 30 20:36:56 CET 2007


Oh, btw, I just found a better way for creating proper UTF8 encoding:

def file = StreamReader(filename, UTF8Encoding(true,true));

It's important to set both parameters to true, as this will ensure that
encoding has proper preamble that will be captured by StreamReader first,
and thus we'll get invalid sequences captured even when file has BOM. And if
you want to ignore invalid sequences when file has BOM (but still want to
capture them when file has no BOM), then just use
UTF8Encoding(false,true)...

On 1/30/07, Alexey Borzenkov <snaury at gmail.com> wrote:
>
> Hi Kamil,
>
> What's the point in detecting BOM manually? Implementations of
> IO.StreamReader *must* do that on their on, and it's even more important
> that they can evolve and detect future BOM cases that don't exist these
> days, something you won't be able to foretell in your implementation (or
> will have to constantly change it). Check this:
>
> using System;
> using System.IO;
> using System.Text;
>
> def file = StreamReader("1.txt", Encoding.GetEncoding(
> Encoding.UTF8.CodePage, EncoderExceptionFallback(),
> DecoderExceptionFallback()));
> def text = file.ReadToEnd();
> Console.WriteLine(file.CurrentEncoding.HeaderName);
>
> It detects BOMs very well (at least on MS.NET, haven't checked with Mono,
> but if it's not working with Mono it's Mono bug), and throws exceptions when
> BOM-less file has non-utf-8 sequences (and it seems it throws on invalid
> sequences even when file actually has BOM).
>
> Hint (just in case): StreamReader.CurrentEncoding is *not* detected until
> you actually do read at least one characted from the file. :)
>
> On 1/30/07, Kamil Skalski <kamil.skalski at gmail.com> wrote:
> >
> > Ok, I need volunteers using various codepages to test attached bom
> > parsing / encoding enforcement program.
> >
> > You must save it as t.n, edit the comment at the beginning to your
> > country's native characters, compile, run executable.
> >
> > We need following confirmations:
> > - program fails when you process t.n saved in non-utf codepage
> > - program runs fine with the same file saved as utf-8 (with and withouth
> > bom)
> > - you can also test other utfs, but they must always have explicit BOM
> >
> > UTF32 Big Endian is unfortunately not supported by mono, so we will
> > skip it at the moment.
> >
> >
> > 07-01-30, Michal Moskal <michal.moskal at gmail.com> napisał(a):
> > > On 1/30/07, vc <vc at rsdn.ru> wrote:
> > > > > On Behalf Of Michal Moskal
> > > >
> > > >
> > > > > > But, in my option, UTF-16 also should be supported.
> > > > >
> > > > > But with BOM, then UTF-32 should also be fine.
> > > >
> > > > Well...
> > > >
> > > > But, we remain old problem - we can't warn user if it use "non utf
> > file".
> > >
> > > How come? We would just reject non-utf8 file with no BOM.
> > >
> > > --
> > >    Michał
> > >
> > > _______________________________________________
> > > https://nemerle.org/mailman/listinfo/devel-en
> > >
> > >
> > >
> >
> >
> > --
> > Kamil Skalski
> > http://nazgul.omega.pl
> >
> > _______________________________________________
> > https://nemerle.org/mailman/listinfo/devel-en
> >
> >
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: /mailman/pipermail/devel-en/attachments/20070130/a2fd2086/attachment.html


More information about the devel-en mailing list