32

The ultimate goal is comparing 2 binaries built from exact same source in exact same environment and being able to tell that they indeed are functionally equivalent.

One application for this would be focusing QA time on things that were actually changed between releases, as well as change monitoring in general.

MSVC in tandem with PE format naturally makes this very hard to do.

So far I found and neutralized those things:

  • PE timestamp and checksum
  • Digital signature directory entry
  • Debugger section timestamp
  • PDB signature, age and file path
  • Resources timestamp
  • All file/product versions in VS_VERSION_INFO resource
  • Digital signature section

I parse PE, find offsets and sizes for all those things and ignore byte ranges when comparing binaries. Works like charm (well, for the few tests I've run it). I can tell that signed executable with version 1.0.2.0 built on Win Server 2008 is equal to unsigned one, of version 10.6.6.6, build on my Win XP dev box, as long as compiler version and all sources and headers are the same. This seems to work for VC 7.1 -- 9.0. (For release builds)

With one caveat.

Absolute paths for both builds must be the same must have the same length.

cl.exe converts relative paths to absolute ones, and puts them right into objects along with compiler flags and so on. This has unproportional effects on whole binary. One character change in path will result in one byte changed here and there several times over whole .text section (however many objects were linked I suspect). Changing length of the path results in significantly more differences. Both in obj files and in linked binary.

Feels like file path with compile flags is used as some kind of hash, which makes it into linked binary or even affects placement order of unrelated pieces of compiled code.

So here is the 3-part question (summarized as "what now?"):

  • Should I abandon the whole project and go home because what I am trying to do breaks laws of physics and corporate policy of MS?

  • Assuming I handle absolute path issue (on policy level or by finding a magical compiler flag), are there any other things I should look out for? (things like __TIME__ do mean changed code, so I don't mind those not being ignored)

  • Is there a way to either force compiler to use relative paths, or to fool it into thinking the path is not what it is?

Reason for the last one is beautifully annoying Windows file system. You just never know when deleting several gigs worth of sources and objects and svn metadata will fail because of a rogue file lock. At least creating new root always succeeds while there is space left. Running multiple builds at once is an issue too. Running bunch of VMs, while a solution, is a rather heavy one.

I wonder if there is a way to setup a virtual file system for a process and its children so that several process trees will see different "C:\build" dirs, private to them only, all at the same time... A light-weight virtualization of sorts...

UPDATE: we recently opensourced the tool on GitHub. See Compare section in documentation.

Eugene
  • 7,180
  • 1
  • 29
  • 36
  • (+1) Thanks for the `--compare` option of peparser. But this part `PDB ... file path` doesn't seem to work in all cases. If I rebuild a VC++ 2015 project after just adding [`/PDBALTPATH:%_PDB%`](https://learn.microsoft.com/en-us/cpp/build/reference/pdbaltpath-use-alternate-pdb-path?view=vs-2015) to the linker command line (which causes the actual path to be stripped from the binary image), then peparse reports it as `not equivalent` to the original build. – dxiv May 09 '19 at 04:07
  • @dxiv can you open a bug on github with attached binaries if possible? – Eugene May 09 '19 at 15:48
  • [Done](https://github.com/smarttechnologies/peparser/issues/2), thank you for looking into this. – dxiv May 10 '19 at 02:23

5 Answers5

13

I solved this to an extent.

Currently we have build system that makes sure all new builds are on the path of constant length (builds/001, builds/002, etc), thus avoiding shifts in the PE layout. After build a tool compares old and new binaries ignoring relevant PE fields and other locations with known superficial changes. It also runs some simple heuristics to detect dynamic ignorable changes. Here is full list of things to ignore:

  • PE timestamp and checksum
  • Digital signature directory entry
  • Export table timestamp
  • Debugger section timestamp
  • PDB signature, age and file path
  • Resources timestamp
  • All file/product versions in VS_VERSION_INFO resource
  • Digital signature section
  • MIDL vanity stub for embedded type libraries (contains timestamp string)
  • __FILE__, __DATE__ and __TIME__ macros when they are used as literal strings (can be wide or narrow char)

Once in a while linker would make some PE sections bigger without throwing anything else out of alignment. Looks like it moves section boundary inside the padding -- it is zeros all around anyway, but because of it I'll get binaries with 1 byte difference.

UPDATE: we recently opensourced the tool on GitHub. See Compare section in documentation.

Eugene
  • 7,180
  • 1
  • 29
  • 36
  • 1
    Here are simple workaround for TLB timestamp (tested only on msvs_2015 + MIDL version 7.00.0555): [peparser_with_tlb](https://github.com/smalti/peparser) – Smalti Jun 29 '17 at 10:22
8

Standardise Build Paths

A simple solution would be to standardise on your build paths, so they are always of the form, for example:

c:\buildXXXX

Then, when you compare, say, build0434 to build0398, just preprocess the binary to change all occurrences of build0434 to build0398. Choose a pattern you know is unlikely to show up in your actual source/data, except in those strings the compiler/linker embed into the PE.

Then you can just do your normal difference analysis. By using the same length pathnames, you won't shift any data around and cause false positives.

Dumpbin utility

Another tip is to use dumpbin.exe (ships with MSVC). Use dumpbin /all to dump all details of a binary to a text/hex dump. This can make it more obvious to see what/where is changing.

For example:

dumpbin /all program1.exe > program1.txt
dumpbin /all program2.exe > program2.txt
windiff program1.txt program2.txt

Or use your favourite text diffing tool, instead of Windiff.

Bindiff utility

You may find Microsoft's bindiff.exe tool useful, which can be obtained here:

Windows XP Service Pack 2 Support Tools

It has a /v option, to instruct it to ignore certain binary fields, such as timestamps, checksums, etc.:

"BinDiff uses a special compare routine for Win32 executable files that masks out various build time stamp fields in both files when performing the compare. This allows two executable files to be marked as "Near Identical" when the files are truely identical, except for the time they were built."

However, it sounds like you may be already doing a superset of what bindiff.exe does.

Slacker
  • 460
  • 1
  • 4
  • 10
  • Unfortunately source path is not kept in plain text, and I couldn't find any info on what is actually affected by it and if I can safely ignore it. (false negatives are much worse than positives after all). – Eugene Jul 25 '09 at 01:29
3

Have you tried disassembling the executable and comparing the disassembly? That should remove a lot of the distracting details you mention, and make removing others a lot easier.

Ori Pessach
  • 6,777
  • 6
  • 36
  • 51
  • Didn't try that, no. Even if it works it can't really be reliably automated... Although this might bring some light into what exactly is different. I'll try that, thanks. – Eugene Jul 25 '09 at 01:18
  • I'm sure you could automate disassembling software. Run from a command line... It might be a good solution depending on what kind of snags you hit with the disassembler's output ;) – Kieveli Jul 25 '09 at 01:42
3

Is there a way to either force compiler to use relative paths, or to fool it into thinking the path is not what it is?

You have two ways to do this:

  1. Use the subst.exe command and map a drive letter to the build folder (this may not be reliable).
  2. If subst.exe doesn't work, then create shares for each of your build folders and use the "net use" command. This one almost certainly should work.

In either case, you're going to map and reuse the same drive letter for a folder before you start a particular build, so that the path appears identical to the compiler.

hythlodayr
  • 2,377
  • 15
  • 23
  • I'd suggest the same but using symbolic links under a common directory like C:\BUILD\XXX – Preet Sangha Jul 25 '09 at 01:45
  • Preet, how do you create a symbolic link on Windows? – Rob Kennedy Jul 25 '09 at 01:50
  • NTFS supports junction points. But you'll need to download a utility OR be on Vista+. Windows does technically treat junction points differently, so like subst.exe this may or may not work. – hythlodayr Jul 25 '09 at 02:16
  • Junctions would work except for requirement of same path pointing to different place for 2 processes running at the same time. They would simplify cleanup I guess... – Eugene Jul 25 '09 at 03:18
  • I didn't see your "at the same time" requirement tucked at the end there. Why not simplify matters by building sequentially? – hythlodayr Jul 25 '09 at 03:41
  • There are a lot of things to build every night, each can take several hours (on a quite good machine too). Also lots of builds during day. (and those are clean release builds too, not CC) – Eugene Jul 25 '09 at 04:38
  • Aren't there builds that can be safely kicked-off at the same time; i.e., nothing to diff? Anyway, if you don't mind complexity (and you should...), you can set a junction-point at a more granular step. Say, at the project-step rather than at the beginning of the build. As long as the project retains the same link name, the target & object files of each project will consistently use the same path. But you'll need some sort of mutex to prevent two build processes from trying to build the same project at the same time. – hythlodayr Jul 25 '09 at 05:37
  • That didn't come out so clearly. As long as the project consistently creates the same link-name, then the binaries for that project ought to refer to the same path. You just have to hande the case when multiple build processes try to build the same project at the same time... – hythlodayr Jul 25 '09 at 05:47
1

I came across an additional tool to help solve this problem: Ducible on GitHub

"This is a tool to make builds of Portable Executables (PEs) and PDBs reproducible."

It modifies the provided *.exe, *.dll and *.pdb files, in place, replacing non-deterministic data with deterministic data.

Jason
  • 336
  • 2
  • 5