Reading two bytes from array of bytes

Question

I am creating a simple aplication in Java, which allows me to read text file. I have a byte array which is wrapped into ByteBuffer:

 FileInputStream inputStream = new FileInputStream(name);
 FileChannel channel = inputStream.getChannel();
 byte[] bArray = new byte[8192];
 ByteBuffer byteBuffer = ByteBuffer.wrap(bArray);
 int read;

and then I use a while loop to go through the text file:

while ( (read=channel.read(byteBuffer)) != -1 )
{
    for ( int i=0; i<read; i++ )
        //my code
    byteBuffer.clear( );
}

My question is how to read a Unicode character in this case. Unicode characters consist of 2 bytes (16 bits) so I suppose that bArray[i] holds first (higher) 8 bits and the subsequent 8 bits is the second part of this character. So for instance if I need to find out whether this character: "#" is currently on index i and i + 1, can I do it like this?? ("#" in binary representation: 0010 0011):

if (bArray[i] == (byte)10 && bArray[i+1] == (byte) 11)

Thanks for responds

What exactly are you trying to do? Why do you want to read a text file at such a low level? Do you even know the encoding of the file you're reading? — Diego Basch
– Diego Basch, Commented Dec 11, 2012 at 20:19
If "#" is 0010 0011, shouldn't you only be checking if bArray[i] == 0x0 and bArray[i+1] == 0x23? Unicode is two bytes, and since "#" is part of the standard set of ASCII characters, it does not have any bits set in the higher byte, so its representation is 0000 0000 0010 0011 — jonhopkins
– jonhopkins, Commented Dec 11, 2012 at 20:23
@jonhopkins Actually, since java doesn't have a binary representation it really should be 0x0 and 0x23 respectively — Jeff
– Jeff, Commented Dec 11, 2012 at 20:24
@Jeff fair enough. I was just going off the provided example. I haven't worked with bytes in Java before — jonhopkins
– jonhopkins, Commented Dec 11, 2012 at 20:25

Matt Ball · Accepted Answer · 2012-12-11 20:21:03Z

6

The simple answer is that you should not treat textual data as a stream of bytes. Specifically that means: don't use ByteBuffer.

Use an InputStreamReader, which knows how to interpret sequences of bytes using a given encoding.

answered Dec 11, 2012 at 20:21

Matt Ball

361k102 gold badges655 silver badges725 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Louis Wasserman Over a year ago

+1. If you want to read characters, use a Reader that knows which Charset to use to convert between bytes and characters.

Husky Over a year ago

The problem is that this reading of text file has to be very fast and if I read that file at such a low level, I can skip some characters and increase the efficiency...

Matt Ball Over a year ago

@Husky have you benchmarked the code and found that an InputStreamReader is too slow? I seriously doubt it would be a bottleneck.

Joop Eggen Over a year ago

For Unicode variants: "UTF-8", "UTF-16LE", "UTF-16BE". Unicode numbers the characters well into the 3-byte range. UTF-8 is multibyte, UTF-16 2-byte, partially with incomplete characters.

Matt Ball Over a year ago

That's a 5-year-old benchmark. Much has changed in Java and JVM since then. I'm serious: do the simple, sensible thing first, then measure (1) if your program is too slow and only then (2) if the biggest speedup will come from avoiding InputStreamReader. I don't see what this is if not premature optimization. If you need any encoding fancier than ASCII, 99%+ of the time, it simply does not make sense to roll your own implementation.

|

Collectives™ on Stack Overflow

Reading two bytes from array of bytes

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related