3

I am creating a simple aplication in Java, which allows me to read text file. I have a byte array which is wrapped into ByteBuffer:

 FileInputStream inputStream = new FileInputStream(name);
 FileChannel channel = inputStream.getChannel();
 byte[] bArray = new byte[8192];
 ByteBuffer byteBuffer = ByteBuffer.wrap(bArray);
 int read;

and then I use a while loop to go through the text file:

while ( (read=channel.read(byteBuffer)) != -1 )
{
    for ( int i=0; i<read; i++ )
        //my code
    byteBuffer.clear( );
}

My question is how to read a Unicode character in this case. Unicode characters consist of 2 bytes (16 bits) so I suppose that bArray[i] holds first (higher) 8 bits and the subsequent 8 bits is the second part of this character. So for instance if I need to find out whether this character: "#" is currently on index i and i + 1, can I do it like this?? ("#" in binary representation: 0010 0011):

if (bArray[i] == (byte)10 && bArray[i+1] == (byte) 11)

Thanks for responds

4
  • What exactly are you trying to do? Why do you want to read a text file at such a low level? Do you even know the encoding of the file you're reading? Commented Dec 11, 2012 at 20:19
  • 1
    If "#" is 0010 0011, shouldn't you only be checking if bArray[i] == 0x0 and bArray[i+1] == 0x23? Unicode is two bytes, and since "#" is part of the standard set of ASCII characters, it does not have any bits set in the higher byte, so its representation is 0000 0000 0010 0011 Commented Dec 11, 2012 at 20:23
  • 1
    @jonhopkins Actually, since java doesn't have a binary representation it really should be 0x0 and 0x23 respectively Commented Dec 11, 2012 at 20:24
  • @Jeff fair enough. I was just going off the provided example. I haven't worked with bytes in Java before Commented Dec 11, 2012 at 20:25

1 Answer 1

6

The simple answer is that you should not treat textual data as a stream of bytes. Specifically that means: don't use ByteBuffer.

Use an InputStreamReader, which knows how to interpret sequences of bytes using a given encoding.

Sign up to request clarification or add additional context in comments.

6 Comments

+1. If you want to read characters, use a Reader that knows which Charset to use to convert between bytes and characters.
The problem is that this reading of text file has to be very fast and if I read that file at such a low level, I can skip some characters and increase the efficiency...
@Husky have you benchmarked the code and found that an InputStreamReader is too slow? I seriously doubt it would be a bottleneck.
For Unicode variants: "UTF-8", "UTF-16LE", "UTF-16BE". Unicode numbers the characters well into the 3-byte range. UTF-8 is multibyte, UTF-16 2-byte, partially with incomplete characters.
That's a 5-year-old benchmark. Much has changed in Java and JVM since then. I'm serious: do the simple, sensible thing first, then measure (1) if your program is too slow and only then (2) if the biggest speedup will come from avoiding InputStreamReader. I don't see what this is if not premature optimization. If you need any encoding fancier than ASCII, 99%+ of the time, it simply does not make sense to roll your own implementation.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.