[nem-en] Language support and BOM

Alexey Borzenkov snaury at gmail.com
Tue Jan 30 20:40:57 CET 2007


Correction: here I speak only of invalid sequences in utf-8. Because for
other codepages StreamReader creates encodings on its own, and of course
they don't have exception fallback. On the other hand, is it truly that
highly possible to create a UTF16 or UTF32 file with invalid sequences? I
guess that'd be very minor.

On 1/30/07, Alexey Borzenkov <snaury at gmail.com> wrote:
>
> Oh, btw, I just found a better way for creating proper UTF8 encoding:
>
> def file = StreamReader(filename, UTF8Encoding(true,true));
>
> It's important to set both parameters to true, as this will ensure that
> encoding has proper preamble that will be captured by StreamReader first,
> and thus we'll get invalid sequences captured even when file has BOM. And if
> you want to ignore invalid sequences when file has BOM (but still want to
> capture them when file has no BOM), then just use
> UTF8Encoding(false,true)...
>
> On 1/30/07, Alexey Borzenkov <snaury at gmail.com> wrote:
> >
> > Hi Kamil,
> >
> > What's the point in detecting BOM manually? Implementations of
> > IO.StreamReader *must* do that on their on, and it's even more important
> > that they can evolve and detect future BOM cases that don't exist these
> > days, something you won't be able to foretell in your implementation (or
> > will have to constantly change it). Check this:
> >
> > using System;
> > using System.IO;
> > using System.Text;
> >
> > def file = StreamReader("1.txt", Encoding.GetEncoding(
> > Encoding.UTF8.CodePage, EncoderExceptionFallback(),
> > DecoderExceptionFallback()));
> > def text = file.ReadToEnd();
> > Console.WriteLine(file.CurrentEncoding.HeaderName);
> >
> > It detects BOMs very well (at least on MS.NET , haven't checked with
> > Mono, but if it's not working with Mono it's Mono bug), and throws
> > exceptions when BOM-less file has non-utf-8 sequences (and it seems it
> > throws on invalid sequences even when file actually has BOM).
> >
> > Hint (just in case): StreamReader.CurrentEncoding is *not* detected
> > until you actually do read at least one characted from the file. :)
> >
> > On 1/30/07, Kamil Skalski <kamil.skalski at gmail.com> wrote:
> > >
> > >  Ok, I need volunteers using various codepages to test attached bom
> > > parsing / encoding enforcement program.
> > >
> > > You must save it as t.n, edit the comment at the beginning to your
> > > country's native characters, compile, run executable.
> > >
> > > We need following confirmations:
> > > - program fails when you process t.n saved in non-utf codepage
> > > - program runs fine with the same file saved as utf-8 (with and
> > > withouth bom)
> > > - you can also test other utfs, but they must always have explicit BOM
> > >
> > >
> > > UTF32 Big Endian is unfortunately not supported by mono, so we will
> > > skip it at the moment.
> > >
> > >
> > > 07-01-30, Michal Moskal < michal.moskal at gmail.com> napisał(a):
> > > > On 1/30/07, vc <vc at rsdn.ru> wrote:
> > > > > > On Behalf Of Michal Moskal
> > > > >
> > > > >
> > > > > > > But, in my option, UTF-16 also should be supported.
> > > > > >
> > > > > > But with BOM, then UTF-32 should also be fine.
> > > > >
> > > > > Well...
> > > > >
> > > > > But, we remain old problem - we can't warn user if it use "non utf
> > > file".
> > > >
> > > > How come? We would just reject non-utf8 file with no BOM.
> > > >
> > > > --
> > > >    Michał
> > > >
> > > > _______________________________________________
> > > > https://nemerle.org/mailman/listinfo/devel-en
> > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > Kamil Skalski
> > > http://nazgul.omega.pl
> > >
> > > _______________________________________________
> > > https://nemerle.org/mailman/listinfo/devel-en
> > >
> > >
> > >
> > >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: /mailman/pipermail/devel-en/attachments/20070130/0775cdcd/attachment-0001.html


More information about the devel-en mailing list