OpenOffice Hacking

OpenOffice.org Hacking
OpenOffice.org documents are ZIP archives containing several XML (and other) files. If you know what you are doing, you can create documents in your own programs which can be read by OpenOffice.org applications.

Why?
Why would you want to do this? Well, one use I found was for a web application which -- based on form input and the contents of a MySQL database -- generates customised documents ready to download straight into OpenOffice.org for printing.

Another use was for printing my holiday photographs -- I wanted to print out pre-defined layouts, such as two pictures 160x120, or five 80x60 and one 160x20, on a single A4 sheet with guillotine marks. I did not want to have to click my way through the file requester and the position-and-size requester for each and every single picture.

And being a hacker myself, I would need a good reason not to attempt something like this!

What With?
Apart from this document, you will need OpenOffice.org; unzip; zip; and a good text editor such as Pico, Vim or Emacs. Graphical editors can be gedit for Gnome and Kate for KDE.

How?
First of all, create any old document you like (but something you can mess around with)  in OpenOffice.org. Now, go to an Xterm, and navigate to the directory where you saved the document. Create a temporary directory and copy your document in there. Change to the temporary directory and unzip the copy of your document.

$ unzip file.sxw $ ls   content.xml meta.xml settings.xml styles.xml META-INF/

There should be a subdirectory called META-INF and depending on the contents of your document, there may be other subdirectories for pictures and other embedded objects. In META-INF will be a file called manifest.xml -- this is important, as it lists all the files which make up the OpenOffice document. If you split or combine any of the files, you will have to edit this manifest file to reflect any changes that you made.

$ cd META-INF/ $ ls  manifest.xml

The real juicy stuff is in a file called content.xml in the main directory. This file has the content of your document the main XML tree will contain a series of namespaces which includes all the common definitions that are used in OpenOffice.org.

The next section of the content.xml are the styles which are different from the one on the styles.xml file. This styles include fonts, paragraph and text properties definitions taken from the stylist in the Writer UI.

Finally the content comes around the tags text:p for pagraph, text:h for headings, text:list for lists and text:a for hyperlinks. This is very similar to HTML except this tags also have unique methods that define paragraph by paragraph. The most important is the name which defines the type of paragraph style it is going to be used an example could be text:p text:style-name="P17" which uses the P17 previously defined on the style section.

More information about the construction of the content.xml file can be found on the OpenDocuments Essentials a book by David Eisenberg.

(todo: expand)

Hint
You may find that these XML files have very long lines, with no breaks between the closing &gt; of one tag and the opening &lt; of the next. Though legitimate XML, it's harder to edit. Use


 * sed -e's/&gt;&lt;/&gt;\n&lt;/g' foo.xml &gt; foo.xml

to sort this out.

OpenOffice.org also includes an option to have the XML properly constructed under Tools > Options > Load and Edit and selecting the option Optimize XML for file size.

This will generate the correct indentation for a better view and edit of the internal XML files.

Ready to go
Once you have done the edits you want, zip up the temporary directory and give it a suitable name (eg.  foo_new.sxw). Now, try loading it into OpenOffice.org and see what happens!

You can also use a gui application such as Gzip for Gnome or ArK for KDE.

Web Drive
A CGI script needs to give a MIME-type as part of the HTML headers, then a double newline to mark the end of the headers, then finally the ZIPped output. This does not need to be base-64 encoded, but can just be in simple 8-bit binary form, since httpd (which is parsing the output you generate)  will take care of whether the connection is binary-safe or not.

If you are running on a fast server and your clients' timeout is not excessively short, you should have plenty of time to send the headers, generate the zipfile on the fly (make a directory under /tmp; put your PID in the dirname to keep it unique; build up files in the directory; use a shell command to zip it)  and just dump it to stdout.

(Todo: Perl example)