1

After more than 20 years off, I started to get back into coding again, and started re-learning C++ (a language I practically never used, I used to be an Object Pascal man). So, I consider myself a complete n00b at C++.

I try to write a program that reads an entire file into a ByteStream, and then read values out of that. The (numerical) values I expect are 2-bytes or 4-bytes, unsigned, little-endian (yes, I got the basics down).

Here's my (global) variables:

std::string File_Content = "";  // entire file contents in ByteArray
unsigned __int32 File_Pos = 0;  // position in the File_Content, next Byte about to be read

Here's the function doing the reading (return Value by reference, so there's no pointless copying, because there's strings, too):

static __int8 Read2Bytes(unsigned __int32 iPos, unsigned __int8 iEndian, unsigned __int16 &iReturn)
{
   if (iEndian == 0)  // Little Endian
   { 
      // Push "second" byte to High Byte
      iReturn = File_Content[iPos] | (File_Content[iPos + 1] << 8);
   }  // end if (iEndian == 0)
   else
   {  // Big Endian, not important right now
   }  // end else (iEndian == 0)

   File_Pos += 2;  // move the Position-Marker along
   return 0;
}  // end Read2Bytes

Here's me calling the function to read a 2-byte int:

unsigned __int16 iTimeStamp = 0;      // 2 bytes
Read2Bytes(File_Pos, 0, iTimeStamp);  // Position in the FileStream, Endian, Return Value)

Now, the bytes I expect to read from the ByteStream read as 0xC6 0x9D (in that order), according to WinHex. So, after endianizing, I expect iTimeStamp to return as 0x9DC6, right?

But in fact, it returns as 0xFFC6, and for the life of me, I can't figure out why.

I have another function that is supposed to read 4 bytes, little-endian, LowWord-HighWord, and there I have the same problem.

Please, can anyone open my eyes as to why my HighBytes get lost in translation somewhere?


Edit:

I have experimented a bit to debug the problem, and tried this:

static __int8 Read2Bytes(unsigned __int32 iPos, unsigned __int8 iEndian, unsigned __int16 &iReturn)
{
   unsigned __int8 bHiByte, bLoByte;  // both bytes separately

   if (iEndian == 0)  // Little Endian
   { 
      // Not-Working Version
      //iReturn = File_Content[iPos] | (File_Content[iPos + 1] << 8);

      // new attempt, byte-by-byte
      bLoByte = File_Content[iPos];       // read "first" byte
      bHiByte = File_Content[iPos + 1];   // read "2nd" byte
      iReturn = (bHiByte << 8) | bLoByte; // endian them together

   }  // end if (iEndian == 0)

   (Rest unchanged)

and suddenly it works! From the bytes 0xC6 and 0x9D I do get 0x9DC6!

What did I do wrong with my first attempt?

This program is supposed to favor performance, and I don't think declaring 2 extra variables (including their garbage collection) is the fastest way to do this.

Or, is there a different, better way?

6
  • 2
    I would guess an integer signedness issue. 0xC6 when promoted to an int becomes 0xFFFFFFC6. So the simple fix is to use unsigned integers. You are using non-standard type names, the standard 8 bit integer is int8_t and the standard 8 bit unsigned integer is uint8_t. Similarly for 16 and 32 bit integers. These types are defined in the header <cstdint>. So switch to unsigned bytes, and the problem should go away. Commented Aug 7, 2024 at 15:51
  • Hello john, thanks for your idea, it would make sense. I tried it, and rewrote my code like this (and optimized the File_Pos parameter out, since its global): static uint8_t Read2Bytes(uint8_t iEndian, uint16_t &iReturn) { if (iEndian == 0) { iReturn = File_Content[File_Pos] | (File_Content[File_Pos + 1] << 8); } // end if (iEndian == 0) else bla bla unfortunately, it didnt change anything, the HighByte still gets lost -> 0xFFC6. Commented Aug 7, 2024 at 16:10
  • 1
    Here's your minimal reproducible example. Commented Aug 7, 2024 at 16:14
  • What i wonder: When i do it step by step, with separate unsigned-byte variables, it works fine. When i try to do it in one line, without separate specifically unsigned variables, it doesnt. Is it possible, that the "specifically unsigned" part gets somehow lost, when i do it in one line? If so, how do i get that fact back in? Commented Aug 7, 2024 at 16:16
  • 2
    including their garbage collection There is no garbage collection. They are not on the heap. The compiler's optimizer is incredibly good. Write code for people. Do not try to write code for micro optimizations that the compiler will (probably) do for you — but if you do, make sure you're finely tuned hand-mangled code actually performs better to justify writing harder to comprehend code. For example, some programmers reuse variables because "less resources is better", but the compiler does static single assignment (SSA) optimization and "undoes" all of that inefficient shenanigans. Commented Aug 7, 2024 at 16:53

2 Answers 2

4

Your compiler's char type is signed. char is what you get from std::string.

You should never shift or bit-fiddle with signed type values unless you know extremely well what you are doing. Cast to unsigned char immediately:

iReturn = (unsigned char)File_Content[iPos] 
        | ((unsigned char)File_Content[iPos + 1] << 8);

Live demo.

Sign up to request clarification or add additional context in comments.

2 Comments

Consider using static_cast instead of C-style casting. Or better, just use std::vector<unsigned char> instead of std::string to hold the file content.
John, you steered me gently in the right direction, and Remy, you slapped the solution right in my face. To both: THANK YOU, you are (in hindsight: obviously) correct. The re-casting to (unsigned char) did the trick. I do not yet know what std::vector does, but i will investigate. John, if you still want the minimal reproducable example, i have it now ready (was doing it while remy wrote his answer). Feel free to let me know if you still want it. Again: Thanx to you both.
1

Decided to post it anyway, for possible future readers: My initial problem, in (what i think) minimal reproducable example), with the solution by Remy Lebeau worked into it. Enjoy!

#include <filesystem>
#include <fstream>
#include <iostream>
#include <string>

//==============================================================================
// Global Variables
//==============================================================================
std::string File_Content = "";
uint32_t File_Pos = 0;

//==============================================================================
// read the entire file into a string/ByteArray
//==============================================================================
static uint8_t GetFileContents(const std::string sFileName)
{
   uint32_t iFileSize = 0;  

   iFileSize = (uint32_t)std::filesystem::file_size(sFileName);

   std::ifstream strStream(sFileName, std::ios::in | std::ios::binary);
   if (strStream)
   {  
      File_Content.resize(iFileSize);
      strStream.read(File_Content.data(), iFileSize);
      strStream.close();
   }  // end if (sStream)

   return 0;
}  // end get_file_contents

//==============================================================================
// read 2-bytes-int out of the string, considering Endian
//==============================================================================
static uint8_t Read2Bytes(uint8_t iEndian, uint16_t &iReturn)
{
   uint8_t bHiByte, bLoByte;

   if (iEndian == 0)  // little Endian
   { 
      // my first attempt, not-working, HiByte gets lost
      //iReturn = File_Content[File_Pos] | (File_Content[File_Pos + 1] << 8);

      // my 2nd attempt, WORKING, but too complicated
      //bLoByte = File_Content[File_Pos];
      //bHiByte = File_Content[File_Pos + 1];
      //iReturn = (bHiByte << 8) | bLoByte;

      // Remys solution, with re-casting to unsigned char per byte
      iReturn = (unsigned char)File_Content[File_Pos] | ((unsigned char)File_Content[File_Pos + 1] << 8);

   }  // end if (iEndian == 0)
   else
   {  // Big Endian
      // not importand here
   }  // end else (iEndian == 0)

   // 2 bytes read, move along PosMarker accordingly
   File_Pos += 2;

   return 0;
}  // end Read2Bytes

//==============================================================================
// read 4-bytes-int out of the string, considering Endian
//==============================================================================
static uint8_t Read4Bytes(uint8_t iEndian, uint32_t &iReturn)
{
   uint8_t bByte1, bByte2, bByte3, bByte4;

   if (iEndian == 0)  // Little Endian
   {
      // my first attempt, not-working, HiByte gets lost
      //iReturn = File_Content[File_Pos] | File_Content[File_Pos + 1] << 8 | File_Content[File_Pos + 2] << 16 | File_Content[File_Pos + 3] << 24;

      // my 2nd attempt, WORKING, but too complicated
      //bByte1 = File_Content[File_Pos];
      //bByte2 = File_Content[File_Pos + 1];
      //bByte3 = File_Content[File_Pos + 2];
      //bByte4 = File_Content[File_Pos + 3];
      //iReturn = bByte1 | (bByte2 << 8) | (bByte3 << 16) | (bByte4 << 24);

      // Remys solution, with re-casting to unsigned char per byte
      iReturn = (unsigned char)File_Content[File_Pos] | ((unsigned char)File_Content[File_Pos + 1] << 8) | ((unsigned char)File_Content[File_Pos + 2] << 16) | ((unsigned char)File_Content[File_Pos + 3] << 24);
   }
   else
   { // Big Endian
     // not important here
   }
   
   // 4 bytes read, move along PosMarker accordingly
   File_Pos += 4;

   return 0;
}  // end Read4Bytes

static uint8_t AnalyzeFileContents()
{

   uint32_t Some4ByteInt = 0;    // 4 bytes, unsigned
   uint16_t Some2ByteInt = 0;    // 2 bytes, unsigned

   Read2Bytes(0, Some2ByteInt);  // read 2 byte int from the string
   Read4Bytes(0, Some4ByteInt);  // read 4 byte int from the string

   // ... and so on
   // omitted: whatever is done with the values afterwards, not essential to the problem

   return 0;
} // end AnalyzeFileContents


int main(int, char const* argv[])
{
   std::string sFileName;
   sFileName = "SomeFile.bin";

   // read the entire file into the string
   GetFileContents(sFileName);  

   // analyze the data in the string
   AnalyzeFileContents(); 
}  // end int main()

In hindsight, it is kind of obvious.

My attempt, where i process byte by byte in separate, uint8_t (i.e. specifically unsigned int) variables working perfectly, as opposed to my "all in one sentence", where i had no idea what the compiler is doing with the bytes while mashing them together... oh well.

And yes, i did not know, that taking a byte out of a string is essentially a signed byte, whereas i needed it to be unsigned.

Kudos to John and Remy, and i hope this helps future readers.

Cheers everybody! AlexisVanWien

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.