1

I have the following piece of code:

using (Stream inputFileStream = File.OpenRead("C:\\Users\\User\\Downloads\\test.txt"))
{
    using (Stream transcodingStream = Encoding.CreateTranscodingStream(inputFileStream, Encoding.GetEncoding(500), new UnicodeEncoding(bigEndian: true, byteOrderMark: true)))
    {
        using (Stream outputStream = File.OpenWrite("C:\\Users\\User\\Downloads\\test.txt"))
        {
            await transcodingStream.CopyToAsync(outputStream, cancellationToken);
        }
    }
}

My file before transcoding has the following first 16 bytes and is of Ebcdic type encoding (code page 500):

F5 F1 F1 F0 F2 C2 D4 E6 40 40 40 40 40 40 F1 F1 = 51102BMW 11

After performing the transcoding to Unicode with Big-Endiant and Byte Order Markings, I expect the file to begin with:

FF FE

However, I get:

00 35 00 31 00 31 00 30 00 32 00 42 00 4D 00 57 = �5�1�1�0�2�B�M�W

Where am I going wrong with this?

5
  • 1
    It looks like all that's wrong here is that it's missing the byte order mark. To be honest, the documentation is unclear about when the byte order mark will be emitted... you could always just write that yourself. Is there anything else in the data that's not as expected? Commented Jan 23, 2023 at 15:38
  • I see. What's also bothering me that each character is punctuated by 00 bytes @Jon Skeet . My understanding is that Unicode is sort of a superset of Ebcdic so regardless of the input file content, the output should be more or less the same as Unicode. Commented Jan 23, 2023 at 15:41
  • No, that's entirely incorrect. Unicode is a superset of the set of characters supported by EBCDIC, but how those characters are encoded can differ significantly between Unicode encodings. For example, UTF-8 is one-byte-per-character for ASCII, but varies in length for non-ASCII characters. UTF-16 (which is what UnicodeEncoding uses) has 2 bytes for every character in the Basic Multilingual Plane (BMP) - it's behaving as you asked it to. What are your actual requirements here? (I'd normally advise UTF-8 as the encoding to choose where possible.) Commented Jan 23, 2023 at 15:48
  • The basic requirement here is to really support a set amount of encodings between which files can be transcoded, the most notable ones are Ebcdic, Ascii, UTF-8, UTF-16. However, for UTF protocols I need to specify byte endianness and whether to include byte order marks or not. The above example is only illustrative and the application itself is a bit more convoluted logically. Based on everything you've just said, I need to write some custom logic to prepend (or append if little-endian) the FF and FE bytes. Does this sound about right @Jon Skeet . Also where can I learn more about this? Commented Jan 23, 2023 at 15:54
  • "However, for UTF protocols I need to specify byte endianness and whether to include byte order marks or not" You don't need to specify endianness for UTF-8. And no, you shouldn't always add FF and FE - you should write U+FEFF in the chosen encoding. In UTF-8 for example, the byte order mark is three bytes: EF BB BF. I suggest you read the Wikipedia article about byte order marks: en.wikipedia.org/wiki/Byte_order_mark Commented Jan 23, 2023 at 15:59

1 Answer 1

0

It seems like the transcoding stream does not care about maintaining the BOM for target encoding and it's something you have to manage yourself.

I've implemented the following solution: targetEncoding is of type Encoding

outputStream.Seek(0, SeekOrigin.Begin);
await outputStream.WriteAsync(targetEncoding.Preamble.ToArray(), 0, targetEncoding.Preamble.Length, cancellationToken);
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.