Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Compiler dead space is it all the same ?
#1
And what the hell is compiler dead space, you may ask.

People who write compilers have a habit of being lazy, very lazy.  The coding part (your executable) will be the same from computer to computer.  This is true as long as the compilers are the same version.  So code from one computer coding will have the same SHA hash value as the next.  What will change is data and string areas.  The same areas which contain uninitialized memory for storage.  A dumb compiler will just assign the space.  Which shows up as a random segment of the compiling computers memory.  Looking at the EXE code in RAW mode may show some interesting information.  A smart compiler will zero out all data and string areas before sending it to the linker/exe module.

So is QB64pe using a smart or dumb compiler/linker ?  The reason I am asking is "If I send someone source code and a SHA of the resulting EXE."  I can determine if compilers are the same version on my computer and his/her computer.  Thus resolving compiler issues before they become a problem.  AKA: They said they were up to date, only to find they are 3 versions behind.

Thanks.
Nobody here may have an answer.  This a very inside question about the GCC compiler QB64pe is using.
Reply
#2
I know you can spot a QB64 exe from another exe but not sure you can get the version also?
b = b + ...
Reply
#3
(08-20-2023, 11:08 AM)bplus Wrote: I know you can spot a QB64 exe from another exe but not sure you can get the version also?

Version is not really the problem.
The following is not a thing anymore.  Long ago to infect an exe the virus code was tacked on the end and redirect the start point.  Antivirus could detect larger file than normal.  Next infect point was to use the compiler's dead space in the EXE and jump in there.  To kill that one, PXE packing was added to EXE files.  My problem still exists, the dead space after unpacking the PXE still holds the random data.  Which would change the SHA hash every time.

Unless you are really into open source fully.  Being sure the open code compiles like the original writer's code is hard.  That's one of the reasons original coders will only publish the exe and hash today.  A prefect problem with that is glaringly evident in the "BEST" sorter program out there.  QSORT!  Nobody has the original code at all.  The author took it to the grave.

IMHO, qsort will see no more improvements, even running in windows is a DOS emulation requirement.  QB64pe will not suffer the death like qsort did.  Because wise men chose open source.
Reply
#4
Oh Qsort wasn't a sorting routine  Confused

Well that's why we share bas code to compile ourselves.

And if you want to keep secrets don't make public posts at forums.
b = b + ...
Reply
#5
(08-20-2023, 09:10 AM)doppler Wrote: And what the hell is compiler dead space, you may ask.

People who write compilers have a habit of being lazy, very lazy.  The coding part (your executable) will be the same from computer to computer.  This is true as long as the compilers are the same version.  So code from one computer coding will have the same SHA hash value as the next.  What will change is data and string areas.  The same areas which contain uninitialized memory for storage.  A dumb compiler will just assign the space.  Which shows up as a random segment of the compiling computers memory.  Looking at the EXE code in RAW mode may show some interesting information.  A smart compiler will zero out all data and string areas before sending it to the linker/exe module.

So is QB64pe using a smart or dumb compiler/linker ?  The reason I am asking is "If I send someone source code and a SHA of the resulting EXE."  I can determine if compilers are the same version on my computer and his/her computer.  Thus resolving compiler issues before they become a problem.  AKA: They said they were up to date, only to find they are 3 versions behind.

Thanks.
Nobody here may have an answer.  This a very inside question about the GCC compiler QB64pe is using.

Why would this matter? You usually send a SHA along with a EXE or generally binary packages to be able to verify it wasn't hacked/modified on the way from you to the recipient.

If you wanna send the source, then you should make sure the source code arrives in its original form, hence you need to send a SHA over the source not the EXE.

How that source is then compiled on the target system doesn't matter anymore and is depending on the system. The EXE will be definitly different if you compile/SHA it on Windows and your recipient will do it on Linux/MacOS, so a SHA for the EXE in this case is pointless.
Reply
#6
Calling compiler creators lazy is hard. Because development of a compiler is difficult to begin with. It's way easier than developing an operating system full with drivers and a visual environment.

I'd tell you one thing. As far as "gcc" is concerned, the program "ld" is a sorry excuse. Generally it just appends one object file to another and then does the necessary to make sure that mess could be executed, as program or as library. It has basic searching capabilities. It would have to unpack a "dot-a" file, or I think it summons "ar" for it.

EDIT: Otherwise "ld" requires precise instructions. Well so does the compiler but there are many more ways to get "ld" to refuse to create an executable file.

Previously I underestimated "LINK.EXE" and "QLINK.EXE" from M$ language products. Because I noticed what was going on with the Power C linker. All it did was search two "library" files after it appended the object files the programmer asked from the compiler.

Some compilers don't directly create object files, it is left to the assembler. So the assembler in some cases should be termed as lazy. The move from 16-bit to 32-bit to 64-bit was just to give slightly different names to the registers according to their size. To make it easy to transition from small to big bit depth, almost nothing was changed about code generation. Except having to deal with those ugly pseudo-assembler blocks in C and C++ programs.

EDIT: Take into account the many different CPU's that the assemblers have to generate code for. Take a short look at the "man pages" for "gcc", the compiler switches having to do with code generation specific to CPU's. There are like 50 different CPU's in there!

I'm just rambling. But creating a compiler and/or assembler isn't a small undertaking for a group of people who expect later to sit back and reap the rewards of their work. There was a lot of grousing from the AT&T employees who were trying to get the C programming language right. This was before many things had to be added to the language, especially function prototyping which made things more complicated for code generation but it was necessary to reduce the difficult bugs that resulted from calling functions improperly. Before "gcc" became "mature" there was almost no type-casting. Now you see it everywhere in C and C++ code. (Especially what QB64 has to generate for "g++" to compile.) Because it was another thing that must have been driving "lazy" programmers nuts and they refused to blame themselves for it, they blamed it on the compiler and assembler creators. There is still a lot of C "legacy" code that has never been fixed and the companies that own those vaults refuse to fix it, they complain it costs too much money.
Reply
#7
(08-20-2023, 09:10 AM)doppler Wrote: The same areas which contain uninitialized memory for storage.  A dumb compiler will just assign the space.  Which shows up as a random segment of the compiling computers memory.
This is incorrect, modern binary formats like PE (Windows) and ELF (Linux) do not take up disk space with uninitiated memory. What actually happens is the compiler will create a memory segment that is listed as larger than its on-disk contents. The "extra space" that doesn't exist on disk is then allocated by the OS as a bunch of zero'd memory.

That said, the produced EXEs or executables are very unlikely to come out the same every time. You _can_ do that with `gcc`, people do, but it requires a little work to ensure that the compiler doesn't embed things like the date of compilation into the program (which will obviously screw it up). You can read about some of the challenges here.

I will say, I don't think QB64-PE uses the `__DATE__` and friends macros anywhere in the source, but I also don't really know for sure. `__FILE__` should only be used if you've turned on some C++ debugging information (you can't turn it on without editing the C++ code, so unlikely). Point being, I think there's a decent chance they could be deterministic just accidentally, but we've never checked and it's typically a hard thing to get right so I doubt it.
Reply
#8
(08-20-2023, 02:56 PM)DSMan195276 Wrote:
(08-20-2023, 09:10 AM)doppler Wrote: The same areas which contain uninitialized memory for storage.  A dumb compiler will just assign the space.  Which shows up as a random segment of the compiling computers memory.
This is incorrect, modern binary formats like PE (Windows) and ELF (Linux) do not take up disk space with uninitiated memory. What actually happens is the compiler will create a memory segment that is listed as larger than its on-disk contents. The "extra space" that doesn't exist on disk is then allocated by the OS as a bunch of zero'd memory.

That said, the produced EXEs or executables are very unlikely to come out the same every time. You _can_ do that with `gcc`, people do, but it requires a little work to ensure that the compiler doesn't embed things like the date of compilation into the program (which will obviously screw it up). You can read about some of the challenges here.

I will say, I don't think QB64-PE uses the `__DATE__` and friends macros anywhere in the source, but I also don't really know for sure. `__FILE__` should only be used if you've turned on some C++ debugging information (you can't turn it on without editing the C++ code, so unlikely). Point being, I think there's a decent chance they could be deterministic just accidentally, but we've never checked and it's typically a hard thing to get right so I doubt it.

Programs compiled with QB64(pe), or in general using gcc will ALWAYS be different. Even if you compile the very same program several times (compile once, rename the exe, compile again) I always have at least 3 different bytes within the first 256 bytes, maybe a binary date/time entry?
Reply
#9
(08-20-2023, 03:32 PM)RhoSigma Wrote: Programs compiled with QB64(pe), or in general using gcc will ALWAYS be different. Even if you compile the very same program several times (compile once, rename the exe, compile again) I always have at least 3 different bytes within the first 256 bytes, maybe a binary date/time entry?
You would be correct, that's the start of the PE header. I didn't know this until now but the PE header contains a timestamp, which matches up with the spot you're seeing as different. You can see it on the Wikipedia page about PE format.

The other byte I think is part of the checksum, but you'd expect all 4 to be different so maybe I counted wrong. That wouldn't be surprising though, that would just indicate there's more differences if you keep going.
Reply
#10
Well the last couple of messages kinda proves my point.  But not exactly as I expected.  The original EXE file format was a unlinked com file expanded.  Inside the exe is a link table that describes all the absolute calls and jumps.  So a program that did a call X or jmp X would have a table entry X listing all references to X's  locations.  At load time the location of X becomes known and references are back filled.  Since windows is 95% intel architecture.  64K code segments are the norm.  Any even memory address could be the CS register location.  Never use odd address start 16/32/64 bit processors take a very bad cycles hit for it.  So all the exe link table entries are processed with CS:offset memory locations.  Then start executing at the location in EXE header, which is a CS:IP address.  Since old EXE format can be hacked and code modified.  PXE compression makes it almost impossible to change.

So it's a bad idea to send source to someone (working on the project) and get an expected hash.  In short it won't happen.
My bad idea is shot down in flames and spirals into the channel.
Reply




Users browsing this thread: 6 Guest(s)