Damn. I just thought up another piece of software (that I’m capable of writing) that I can’t find. This is bad; it means it’s going to haunt me until I code it.
So, frequently, I’m faced with streams of bytes of unknown origin/purpose. (For example, the .TiVo file format, RTMP streams, and most recently, Outlook “NK2” address autocompletion cache files.) I’ve had experience finding patterns, but it’s always so time-consuming. Usually I’m compiling some little C program over and over, slowly tweaking some guessed-at structure. This is basically the advice I got from Andrew Tridgell when I asked how he went about reverse engineering protocols. His methods deal more with sending/receiving, so it’s much more interactive. Most of what I’ve mucked with are just unknown file formats.
What I want is a nice GUI tool that will let me specify a language to describe a data file’s contents. I can see lots of meta-specifications like “repeat this structure until EOF”, and “if byte 5 is 1, read X bytes, otherwise, read X+50 bytes”, etc. Most data formats have pretty simple layouts after you figure them out. As you create the structure for the data to fit into, you can see the data from your example file displayed live. This way you can quickly tweak lengths, offsets, encoding types, endianness, etc, without needing to totally recompile your test harness.
Hell, it could even spit out the C code to process it, too. :)
I’m thinking about using Gtk and Python. We’ll see how rapid that path is for developing a nice GUI. I’ve heard good things. :)
© 2005, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
if you look in the documentation that ships with most distributions you shoud be able to find a bunch of pygtk examples.
included with them is an example which allows you to take a glade file and preview it running using Python. I’ve found this quite useful although I haven’t taken the next step of actually writing a python backend for any of my mockups but it might interest you to check it out.
Comment by Alan — July 14, 2005 @ 9:03 am
i’ve also been in the same boat, reverse engineering file-formats, and i’ve found that instead of using a GUI to do something like this, just use XML. yes, XML …
Comment by jay vaughan — July 14, 2005 @ 9:59 am
[…] Kees idea is brilliant! It would’ve worked well for Alan Turing and others when working on Enigma back during WW2. Posted by jon @ 11:49:53 2005.07.14 […]
Pingback by rejon.org = jon phillips portal — July 14, 2005 @ 10:11 am
I’ve wanted something like that as well in the past, I’ve just been too busy/lazy to produce it.
Another way of looking at it is that it basically amounts to a protocol analyzer whose definitions could be updated on-the-fly.
Comment by MenTaLguY — July 14, 2005 @ 11:31 am
A very brilliant idea as rejon said. i would suggest it accept multiple files (at least two) of the same format to allow deeper pattern analysis.
Comment by verbalshadow — July 14, 2005 @ 11:46 am
I’ve got to recomend python. Not only is it a great language, it seems well suited for this project. You can extend it with c or asm for the tricky parts, and python itself could serve as the built in language used for simple tasks.
@ Jay V. XML … I never would have thought of that.
Comment by Jinx — April 12, 2007 @ 7:19 am
Just for the record, Hachoir pretty much does this nowadays – maybe it was inspired by this post?
Comment by oliver — November 4, 2011 @ 1:57 pm