Archive for the ‘docx4j’ Category

docx4j v2.3.0 released

February 23rd, 2010 by Jason

I’m pleased to announce the release of docx4j v2.3.0

docx4j is an open source (Apache license) project which facilitates the manipulation of Microsoft OpenXML docx (and now pptx) documents in Java, using JAXB.

The main features of this release are support for pptx files, and improvements to HTML export (via NG2), and PDF export (via XSL FO).

For further details, please see the release announcement.

docx4j v2.1.0 released

November 11th, 2008 by Jason

We’re pleased to announce that we’ve released v2.1.0 of docx4j.  Get it from our downloads page.

docx4j is an open source Java library for manipulating OpenXML WordprocessingML documents, released under the Apache software licence. docx is the default file format in Word 2007 in Microsoft Office 2007, and part of an ISO standard (more or less unchanged).

v2.1.0 is mainly a maintenance release.

Attention has been paid to ease of use of hyperlinks, images, and headers/footers.

The HTML output has been redone to use the XSLT from the OpenXMLViewer project; it can be configured to save images as files, and automatic list numbers are handled.

This release should also work under Java 1.5, now that I have re-built fop-fonts.  I had contributed TTC (true type collection) handling code to FOP, and it was accepted, so fop-fonts now uses that (ie the patch which makes fop-fonts is that much smaller).

docx4j v2.0 released

July 22nd, 2008 by Jason

We’re pleased to announce that we’ve released v2.0 of docx4j.

docx4j is an open source Java library for manipulating OpenXML WordprocessingML documents, released under the Apache software licence. docx is the default file format in Word 2007 in Microsoft Office 2007.

docx4j supports the following:

  • Open existing docx (from filesystem, SMB/CIFS, WebDAV using VFS)
  • Create new docx (just one line of code)
  • Programmatically manipulate the docx document (of course), including tables, images
  • Import a binary doc (proof of concept)
  • Import/export Word 2007’s xmlPackage (pkg) format
  • Save docx to filesystem as a docx (ie zipped), or to JCR (unzipped)
  • Apply transforms, including common filters
  • Export as HTML or PDF
  • Diff/compare paragraphs or sdt (content controls), outputting OpenXML with changes marked up
  • Font support (font substitution, and use of any fonts embedded in the document)
  • Use the power of JAXB to do other cool stuff

Get it from here.

What is it about this release that warrants being labeled v2.0?

The new features include image support, diff, and xmlPackage.  A factor is the version numbering convention Microsoft has chosen for their Open XML SDK: its v2.0 which will first contain an API for WordprocessingML.

So think of a “level 1″ API as one which handles the Open Packaging conventions (basically, the unzipping step), but leaves you to handle the document (part) content using low level XML (DOM, SAX, etc).

A “level 2″ API is one which gives you a higher level API to manipulate the part content.  At the very least, this would include objects to represent paragraphs, tables, styles etc.  But you’d also expect it to be easy, for example, to add a paragraph using a specified style (maybe this is “level 3″?  In any case, docx4j can do it)

Given that docx4j brought a “level 2″ WordML API to the Java world 6 months ago, it is appropriate that it be labelled version 2.0.

docx4j now released under Apache License

April 10th, 2008 by Jason

We’re pleased to announce that docx4j is now available under the Apache License (v2).

This is a response to feedback on an earlier post.  This is also the last license change we’ll be making to docx4j. Word documents are mostly manipulated in corporate environments.  This change removes barriers to adoption of docx4j by business and institutions.

docx4j uses org.merlin.io to efficiently turn streams inside out. That package had been available under the GPL.  Its author, Merlin Hughes, today kindly released it under v2 of the Apache License, so we now use it under that license.

There’s a new nightly build of docx4j available from the downloads page if you want to grab it.  This build can load/save to/from a WebDAV server – more on that in another post.

Click to try docx4all

February 21st, 2008 by Jason

We’ve now got a proof of concept of docx4all, our cross-platform WYSIWYG docx editor, ready for you to try. Here is the launch page to run it from your browser.  Give it a try in Linux, XP, Vista, or (if you are game) OSX.

This proof of concept includes:

  • file: new | open | save
  • text formatting (font, size, bold, italics, underline)
  • paragraph formatting (alignment)
  • cut/copy/paste
  • styles
  • printing

There is still a lot to do, but with the introduction this week of support for styles, we’ve got a basic feature set which you can use to do actual work (not that we’d recommend that just yet) – assuming you can live without tables (for the moment at least).

It is set up to run offline – it should offer to create a shortcut or a menu item so you can run it again later.

Let us know what you think of it, here or in the forums. Cheers.

.docx to HTML or PDF using Java

January 13th, 2008 by Jason

Doug Mahugh recently mentioned someone using the DocX2Html.xsl that ships with SharePoint to preview DOCX files in HTML.

As it happens, we’ve just implemented HTML and PDF output in docx4j using a similar approach. We’re using the earlier WordML2HTML XSLT stylesheet available from Oleg Tkachenko. (It would be great if Microsoft also made the presumably newer DocX2Html.xsl that ships with SharePoint freely available).

To create the HTML, we use Sun’s xhtmlrenderer (thanks Sun!). See the obligatory tutorial.

To create the PDF, we take the HTML, and run it through Sun’s pdf-renderer (thanks again, Sun). And again, the tutorial.

The icing on the cake is the PDF Viewer which comes with pdf-renderer. That will give us print preview and printing in docx4all.

Finally, thanks Lars for bringing pdf-renderer to my attention.

Howto: create a new document with docx4j

January 11th, 2008 by Jason

I’ve added a page to the wiki, showing how easy it is to programmatically create a new document from scratch.

Tutorial: opening an existing document with docx4j

January 11th, 2008 by Jason

I’ve added a page to the wiki, showing how easy it is to programmatically open and edit an existing document.

docx4j license change

January 11th, 2008 by Jason

A note for the record that we’ve changed the docx4j license from the GPL v3 to the Affero General Public License v3.   All users of which we are aware are happy with this change.

The logic for the change is the same as the logic for licensing plutext-server under the Affero GPL.  That is, to ensure that people who use docx4j in a SAAS environment are treated the same as people who distribute docx4j to end users.

Licensing docx4j under an Apache style licence also has its attractions – let us know if this would make a difference to you.

OOXML, boolean values and binding

January 6th, 2008 by Jason

ST_OnOff is used extensively in the XML Schema. Here is the openiso.org link (nice resource!).

Basically, it is used for things which should use the built in boolean schema data type:

This simple type specifies a set of values for any binary (on or off) property defined in a WordprocessingML document.

For example, the b (bold) element has an attribute @val of type ST_OnOff.

There are several problems with how this is done.

The first is that its possible values are “on, 1, or true”. OOXML should just use the XSD boolean data type, which doesn’t allow “on” (or “off”). For related comments, see here, here, and here. Denmark and France seem to be the strongest advocates of the use of xsd:boolean, and I hope they get their way.

The second is that it is left to the specification text to say that if the attribute is omitted, its value is implied to be true. That should be expressed as part of the schema.

For CT_OnOff, it would be:

<xsd:complexType name=”BooleanDefaultTrue”>
<xsd:attribute name=”val” type=”xsd:boolean” default=”true” />
</xsd:complexType>

I don’t think Denmark or anyone else made this second point.

The schema we are using in docx4j to generate classes uses these sorts of definitions instead of ST_OnOff or CT_OnOff.  For CT_OnOff, this results in a BooleanDefaultTrue type, which is used in fields like (for bold):

protected BooleanDefaultTrue b

Which brings me to the the third problem with ST_OnOff (and the schema in general), which is that it generates ugly code in JAXB and other binding frameworks (presumably .NET included). The built in schema data types produce much nicer code.

As a general remark, running the schema through JAXB is a good way to find places where the schema can be improved. Schema design goals should include:

  1. that it can be processed out of the box by binding frameworks (since that makes it easier for people to pick up a schema and start using it). [This is not currently the case]
  2. that the schema be expressed in such a way as to generate the simplest code.