0

I have a .chm file (from 7-Zip, but I don't think it matters). I extracted the contents of the .chm and got the expected .hhc, .hhk, .htm, and .css files. However, I also got 10 more files with no extension, 8 of which beginning with a hash (e.g. '#OBJINST') and two of which with starting with a dollar sign. When trying to open these files in Atom or VSCode, I get a bunch of random characters (empty squares, triangles with question marks, and so on) with a few actual words scattered here and there like "HHA Version 4.74.8702" or "7zip.hhk".

I'm trying to parse these files to learn more about how .chm files work, and I'd really like to figure out how these extensionless files work/how they fit into the picture. I've done google searches, but nothing popped up that seemed relevant. It looks like something with the encoding, but none of Atom's encoding options fixed the probelm.

Any idea what's going on here? More specifically, how can I view the contents of these files (if I even can)?

W. MacTurk
  • 130
  • 1
  • 15

2 Answers2

2

The Microsoft CHM help file format is a proprietary binary format which is basically a LZX archive including:

  • Topics contents as HTML or MHT files
  • Accompanying assets files such as images, CSS, JavaScript...
  • Various textual project related files (table of contents, topic ids...)
  • Some binary files which I believe include indexes (such as search engine data...) for faster operation

Those files are usually produced by the Microsoft HTML Help Workshop compiler, either directly or via a help authoring tool such as HelpNDoc, RoboHelp...

The Microsoft HTML Help Workshop software can be used to decompile CHM help files. Decompression software supporting the LZX algorithm (such as 7-zip), and help authoring tools can usually be used to extract content from those files.

As far as I know, there is no official Microsoft documentation for that format, but it has been reverse engineered by Matthew T. Russotto.

jonjbar
  • 3,896
  • 1
  • 25
  • 46
1

You know Windows HTML Help is delivered as a LZX compressed binary file with the .chm extension. It contains a set of HTML files, a hyperlinked table of contents, and an index file. The file format has been reverse-engineered and documentation of it is freely available e.g. Unofficial (Preliminary) HTML Help Specification. This is the best I know.

In relation to your question, you should look at the Internal file formats section in particular. Please also note the image in the $FIftiMain section.

But I would like to warn you a bit about the wasted time in dealing with this internal file format.

The file starts with bytes "ITSF" (in ASCII), for "Info-Tech Storage Format" (see Microsoft's HTML Help (.chm) format documentation). The CHM can be opened using FAR HTML like shown (see screenshot) in my answer of this SO thread to get CHM details from help ID

For some more decompile info have a look at Decompile CHM too.

help-info.de
  • 6,695
  • 16
  • 39
  • 41