Archive for the ‘docx4j’ Category

docx4j v2.0 released

July 22nd, 2008 by Jason

We’re pleased to announce that we’ve released v2.0 of docx4j.

docx4j is an open source Java library for manipulating OpenXML WordprocessingML documents, released under the Apache software licence. docx is the default file format in Word 2007 in Microsoft Office 2007.

docx4j supports the following:

  • Open existing docx (from filesystem, SMB/CIFS, WebDAV using VFS)
  • Create new docx (just one line of code)
  • Programmatically manipulate the docx document (of course), including tables, images
  • Import a binary doc (proof of concept)
  • Import/export Word 2007’s xmlPackage (pkg) format
  • Save docx to filesystem as a docx (ie zipped), or to JCR (unzipped)
  • Apply transforms, including common filters
  • Export as HTML or PDF
  • Diff/compare paragraphs or sdt (content controls), outputting OpenXML with changes marked up
  • Font support (font substitution, and use of any fonts embedded in the document)
  • Use the power of JAXB to do other cool stuff

Get it from here.

What is it about this release that warrants being labeled v2.0?

The new features include image support, diff, and xmlPackage.  A factor is the version numbering convention Microsoft has chosen for their Open XML SDK: its v2.0 which will first contain an API for WordprocessingML.

So think of a “level 1″ API as one which handles the Open Packaging conventions (basically, the unzipping step), but leaves you to handle the document (part) content using low level XML (DOM, SAX, etc).

A “level 2″ API is one which gives you a higher level API to manipulate the part content.  At the very least, this would include objects to represent paragraphs, tables, styles etc.  But you’d also expect it to be easy, for example, to add a paragraph using a specified style (maybe this is “level 3″?  In any case, docx4j can do it)

Given that docx4j brought a “level 2″ WordML API to the Java world 6 months ago, it is appropriate that it be labelled version 2.0.

docx4j now released under Apache License

April 10th, 2008 by Jason

We’re pleased to announce that docx4j is now available under the Apache License (v2).

This is a response to feedback on an earlier post.  This is also the last license change we’ll be making to docx4j. Word documents are mostly manipulated in corporate environments.  This change removes barriers to adoption of docx4j by business and institutions.

docx4j uses org.merlin.io to efficiently turn streams inside out. That package had been available under the GPL.  Its author, Merlin Hughes, today kindly released it under v2 of the Apache License, so we now use it under that license.

There’s a new nightly build of docx4j available from the downloads page if you want to grab it.  This build can load/save to/from a WebDAV server - more on that in another post.

Click to try docx4all

February 21st, 2008 by Jason

We’ve now got a proof of concept of docx4all, our cross-platform WYSIWYG docx editor, ready for you to try. Here is the launch page to run it from your browser.  Give it a try in Linux, XP, Vista, or (if you are game) OSX.

This proof of concept includes:

  • file: new | open | save
  • text formatting (font, size, bold, italics, underline)
  • paragraph formatting (alignment)
  • cut/copy/paste
  • styles
  • printing

There is still a lot to do, but with the introduction this week of support for styles, we’ve got a basic feature set which you can use to do actual work (not that we’d recommend that just yet) - assuming you can live without tables (for the moment at least).

It is set up to run offline - it should offer to create a shortcut or a menu item so you can run it again later.

Let us know what you think of it, here or in the forums. Cheers.

.docx to HTML or PDF using Java

January 13th, 2008 by Jason

Doug Mahugh recently mentioned someone using the DocX2Html.xsl that ships with SharePoint to preview DOCX files in HTML.

As it happens, we’ve just implemented HTML and PDF output in docx4j using a similar approach. We’re using the earlier WordML2HTML XSLT stylesheet available from Oleg Tkachenko. (It would be great if Microsoft also made the presumably newer DocX2Html.xsl that ships with SharePoint freely available).

To create the HTML, we use Sun’s xhtmlrenderer (thanks Sun!). See the obligatory tutorial.

To create the PDF, we take the HTML, and run it through Sun’s pdf-renderer (thanks again, Sun). And again, the tutorial.

The icing on the cake is the PDF Viewer which comes with pdf-renderer. That will give us print preview and printing in docx4all.

Finally, thanks Lars for bringing pdf-renderer to my attention.

Howto: create a new document with docx4j

January 11th, 2008 by Jason

I’ve added a page to the wiki, showing how easy it is to programmatically create a new document from scratch.

Tutorial: opening an existing document with docx4j

January 11th, 2008 by Jason

I’ve added a page to the wiki, showing how easy it is to programmatically open and edit an existing document.

docx4j license change

January 11th, 2008 by Jason

A note for the record that we’ve changed the docx4j license from the GPL v3 to the Affero General Public License v3.   All users of which we are aware are happy with this change.

The logic for the change is the same as the logic for licensing plutext-server under the Affero GPL.  That is, to ensure that people who use docx4j in a SAAS environment are treated the same as people who distribute docx4j to end users.

Licensing docx4j under an Apache style licence also has its attractions - let us know if this would make a difference to you.

OOXML, boolean values and binding

January 6th, 2008 by Jason

ST_OnOff is used extensively in the XML Schema. Here is the openiso.org link (nice resource!).

Basically, it is used for things which should use the built in boolean schema data type:

This simple type specifies a set of values for any binary (on or off) property defined in a WordprocessingML document.

For example, the b (bold) element has an attribute @val of type ST_OnOff.

There are several problems with how this is done.

The first is that its possible values are “on, 1, or true”. OOXML should just use the XSD boolean data type, which doesn’t allow “on” (or “off”). For related comments, see here, here, and here. Denmark and France seem to be the strongest advocates of the use of xsd:boolean, and I hope they get their way.

The second is that it is left to the specification text to say that if the attribute is omitted, its value is implied to be true. That should be expressed as part of the schema.

For CT_OnOff, it would be:

<xsd:complexType name=”BooleanDefaultTrue”>
<xsd:attribute name=”val” type=”xsd:boolean” default=”true” />
</xsd:complexType>

I don’t think Denmark or anyone else made this second point.

The schema we are using in docx4j to generate classes uses these sorts of definitions instead of ST_OnOff or CT_OnOff.  For CT_OnOff, this results in a BooleanDefaultTrue type, which is used in fields like (for bold):

protected BooleanDefaultTrue b

Which brings me to the the third problem with ST_OnOff (and the schema in general), which is that it generates ugly code in JAXB and other binding frameworks (presumably .NET included). The built in schema data types produce much nicer code.

As a general remark, running the schema through JAXB is a good way to find places where the schema can be improved. Schema design goals should include:

  1. that it can be processed out of the box by binding frameworks (since that makes it easier for people to pick up a schema and start using it). [This is not currently the case]
  2. that the schema be expressed in such a way as to generate the simplest code.

docx4j trunk now uses JAXB

December 22nd, 2007 by Jason

10 days ago, we created a proof of concept for using JAXB on a subset of wml.xsd (one of the OpenXML schema files).

We’ve declared that a success, and moved it from a branch into the trunk of docx4j. Here be the generated classes.

plutext-server has now been migrated to use it.

And Jo is working with it as he codes docx4all.

So we’re pretty committed at this point!

We’re tidying up bits of the object model as we go (ie editing our xsd to generate Java that we like). So far, paragraphs (p, pPr, r, rPr, t) and structured document tags (sdt, sdtPr, sdtContent) have had our attention.

We’re also making a few changes to the generated classes, so we need to think about how best to prevent those changes from getting lost when the classes are re-generated. There’s a bit of support in XJC for this, and diff may come in handy, but I’d love to hear best practices.

What we have now is an object model for key pieces of the Main Document part (document.xml), in package name
org.docx4.jaxb.document. Next cab off the rank is the Styles part, which we’ll put in org.docx4.jaxb.styles.

Docx4j branch: Using JAXB to unmarshall OOXML to Java

December 12th, 2007 by Jason

docx4j contains classes which represent key parts of a WordprocessingML docx.

For example, we have a paragraph class to represent the p element; another to represent a run, etc.

Each class knows how to unmarshall its docx XML representation, and marshall it again.

It will create specialised objects for things it knows how to handle (for example the paragraph content collection contains run objects). For XML we don’t have strongly typed objects for, the class will simply store that XML node, so that it can be round tripped.

Instead of coding these classes one by one by hand, we wanted to see whether one of the Java-XML data binding frameworks could make our lives easier.

Given there is a standard for doing this (JSR 222 - The Java Architecture for XML Binding), we tried the JAXB reference implementation (JAXB 2.1.5).

The JAXB web presence leaves a lot to be desired. I’ll write a post on that shortly.

Having said that, I’m quite impressed with the spec and the reference implementation.

You feed your schema into xjc, and it generates Java classes.

The @XmlAnyElement annotation allows unknown elements to be round tripped, mimicking our existing code.

Why would there be any unknown elements you ask?

The answer is that we are using a subset of the wml.xsd schema from TC45. So there can be a lot of stuff in a docx document which falls outside the subset.

There are a number of reasons we are using a subset:

  1. running the entire schema through XJC produces lots of errors, both in the parsing phase, and once you overcome those, in the compiling stage
  2. more importantly, we’re unlikely to ever implement the entire WordML spec. So it makes sense to work with the subset of key features which are on our roadmap.
  3. you have to add annotations to the schema to ensure the resulting Java classes use names which make sense (this is called customizing the binding).

Anyway, this approach seems to work well. That is:

  • the JAXB version can read a Word document, edit it, and save it again, and Word 2007 can consume the result. See sample.java
  • the resulting classes can be made quite intuitive (though there is more tweaking to do)
  • unknown elements can be round-tripped

The JAXB version of docx4J is in subversion at the following branch:

http://dev.plutext.org/trac/docx4j/browser/branches/jaxb

You can’t just check out the branch and use it right now, since
classes need to be generated. There are maybe 50 generated, but I have
only committed 3 of them.

Where to from here?

If this approach continues to look promising, we are likely to move the JAXB code into the trunk, and upgrade plutext-server and docx4all to use it.