Monday, December 25, 2006

Another script: revision controlled documents

I mentioned that I used for my thesis.

However, a problem is that binary files, like .ODT, don't work well with revision control systems.

Firstly, they aren't diff-able so you have to also keep a .TXT file, in the repository, in sync with the .ODT. This is error-prone and the .TXT diffs don't reflect formatting changes.

Secondly, because files are compressed (.ZIP format), the binary deltas that revision control systems (e.g. SVN) use to save space fall apart for even the tiniest of changes to documents, due to the characteristics of compression.

But thanks to the second anonymous commenter on that blog post, I came up with a new way of revision-controlling documents:

What I now do is keep the unzipped document in revision control, not the binary .ODT file. This means that I can diff the content.xml between revisions, not bother keeping a .TXT file in sync and avoid SVN's space-inefficient binary deltas since I'm really only keeping text files in the repository (more on this later).

If you checkout such an unzipped document from SVN, you can use my magic Makefile to reconstruct the .ODT from these unzipped contents by typing make. It's just like adding water to milk powder.

And every time you make a change to the .ODT file, the same make command works the other way and updates the unzipped contents to match the changed .ODT. Then you can svn commit a new revision.


* Warning: svn status will not report changes to the .ODT, until you type make. Be careful or you might think that your checkout has no local changes and you decide to delete the checkout to "save space"! A "foolproof" way to get around this problem is to skip the svn propedit svn:ignore . [and type kolourpaint-developer-guide.odt] step.

* It actually does store a binary file in revision control, namely Thumbnails/thumbnail.png. I haven't dared try to work around this though, for fear that won't like me playing with it.

* Some files such as layout-cache and settings.xml, while not binary unlike Thumbnails/thumbnail.png, change on every save and probably shouldn't be revision controlled either.

* You can't have 2 people working on the same document as merging two content.xml files is asking for trouble. But at least you can diff between revisions.

* A lot of lines in content.xml may change in response to even a small layout change due to lots of tags changing control numbers.

If you try this scheme, please let me know how it goes.

BTW, in the end, my thesis turned out to be 192 pages (probably, about 100 pages too long :)). It was on porting the L4 microkernel to processors without virtual memory, specifically the Blackfin processor. It was written in a rush so apologies for the awful number of spelling and grammatical errors!


Dave Page said...

I'm no Subversion expert, but couldn't you handle the unpacking / repacking automatically in a pre-commit hook?

Lee B said...

Agreed: pre-commit and post-checkout hooks would be the best way to handle this. Although if you're looking for the best way, you should rm -rf svn and get a modern vcs like bzr or mercurial ;)

It always bugged me too, that OpenOffice used nice XML all wrapped up in binary.

p.s.: xmldiff is probably going to be a good friend here too ;)

Thomas Z said...

Did you consider using the OpenDocument Flat XML (.fodt) format? I have no idea whether OpenOffice can handle large documents using this format, and maybe you have to adapt the XSLT stylesheets for better XML indenting.

Anonymous said...

use latex, stupid!
the time you wasted with your handmade problems would be enough to learn latex a dozen times.

liquidat said...

I think the best would be to have a revision control system which can handle such things on the server side. A plugin-like structure could manage different file formats and could perform all the necessary steps transparent for the browser.

That could give a real boost - and/or could become a good running web service with paying customers.

Anonymous said...

latex + metapost + metafont rule

H├ęctor said...

Last week I found this OOoSVN extension, but I haven't tried yet...

Anonymous said...


I would rather take latex for such a thing. It's pretty nice to write all the stuff in a simple editor without thinking about the format of every line and such.
Although I see, that it can take some time to get into it to find the right modules and bring it all together.
Plus there are sometimes things that aren't solved well in latex. E.g. pagebreaks before the last line of a three-line code snipplet.
But those are solvable.
For my last papaer I used it and wrote a little make file to clean up the mass a latex run did and had all the text files in a git archive. Pretty nice. :)

on a completely other front I think it's not a good idea "hosting" the makefile in de KDE SVN. Or does it have a connection to the module it resides in?

Clarence Dang said...

Thanks for everyone's comments.

> use latex, stupid!

I try to avoid using LaTeX because already has character, paragraph and page styles accessible from the GUI. Sure it won't look as nice as LaTeX but it's good enough for me.

> I think it's not a good
idea "hosting" the makefile in de KDE SVN.
> Or does it have a connection to the module it resides in?

It used to be used to generate the KolourPaint developer documentation. However, I now use simple .txt files instead, so the WebSVN link you see is to a deleted revision.

This blog post is actually an old post -- I just updated the links and instructions yesterday.

JohnFlux said...

For my thesis, I wrote most of it in MS Word, as my supervisor wanted to be able to modify it etc from word. Then I loaded in OpenOffice, and exported to latex. This actually did a pretty good job. I cleaned up the result, and went from there.

Graeme Geldenhuys said...

Did you know OpenOffice has a built-in diff tool? Open one version of the odt document, then go to Edit > Compare Document... and select the other version of the odt document you want to compare with.

That way you can easily merge changes from one document into another. :-)

But yes, LaTeX or MarkDown documents make this so much easier. Either way, I still prefer OpenOffice, even though I have written numerous documents with LaTeX.

Chris said...

> Did you know OpenOffice has a built-in diff tool?

The built-in diff isn't very useful when performing automatic merges in the VCS itself.