Archive for the ‘docx’ Category

docx4j v2.0 released

July 22nd, 2008 by Jason

We’re pleased to announce that we’ve released v2.0 of docx4j.

docx4j is an open source Java library for manipulating OpenXML WordprocessingML documents, released under the Apache software licence. docx is the default file format in Word 2007 in Microsoft Office 2007.

docx4j supports the following:

  • Open existing docx (from filesystem, SMB/CIFS, WebDAV using VFS)
  • Create new docx (just one line of code)
  • Programmatically manipulate the docx document (of course), including tables, images
  • Import a binary doc (proof of concept)
  • Import/export Word 2007’s xmlPackage (pkg) format
  • Save docx to filesystem as a docx (ie zipped), or to JCR (unzipped)
  • Apply transforms, including common filters
  • Export as HTML or PDF
  • Diff/compare paragraphs or sdt (content controls), outputting OpenXML with changes marked up
  • Font support (font substitution, and use of any fonts embedded in the document)
  • Use the power of JAXB to do other cool stuff

Get it from here.

What is it about this release that warrants being labeled v2.0?

The new features include image support, diff, and xmlPackage.  A factor is the version numbering convention Microsoft has chosen for their Open XML SDK: its v2.0 which will first contain an API for WordprocessingML.

So think of a “level 1″ API as one which handles the Open Packaging conventions (basically, the unzipping step), but leaves you to handle the document (part) content using low level XML (DOM, SAX, etc).

A “level 2″ API is one which gives you a higher level API to manipulate the part content.  At the very least, this would include objects to represent paragraphs, tables, styles etc.  But you’d also expect it to be easy, for example, to add a paragraph using a specified style (maybe this is “level 3″?  In any case, docx4j can do it)

Given that docx4j brought a “level 2″ WordML API to the Java world 6 months ago, it is appropriate that it be labelled version 2.0.

Click to try docx4all v0.2

May 3rd, 2008 by Jason

Jo and I are pleased to have just uploaded a new version of docx4all for you to try.

We’ve added quite a few features since I last blogged about docx4all (21 Feb).

New features include:

The VFS file chooser allows docx4all to open documents not just from the local file system, but also from a WebDAV server (such as Alfresco), and potentially, CIFS etc.  To do this, docx4all uses VFSJFileChooser, and webdavclient4j (a project we’ve started to address the gap left when Apache retired Slide, including its WebDAV client).

The incoming document filter is used to convert certain features of WordprocessingML which docx4all can’t yet handle, into something it can.   Examples include proofErr, hyperlink, and lastRenderedPageBreak.  This behaviour relies on a feature of docx4j, which makes it easy to apply a transform to a docx package (by converting it to pkg:package format).

Docx4all can’t yet render tables (let alone edit them), but we’re working on changing that.

docx4j now released under Apache License

April 10th, 2008 by Jason

We’re pleased to announce that docx4j is now available under the Apache License (v2).

This is a response to feedback on an earlier post.  This is also the last license change we’ll be making to docx4j. Word documents are mostly manipulated in corporate environments.  This change removes barriers to adoption of docx4j by business and institutions.

docx4j uses org.merlin.io to efficiently turn streams inside out. That package had been available under the GPL.  Its author, Merlin Hughes, today kindly released it under v2 of the Apache License, so we now use it under that license.

There’s a new nightly build of docx4j available from the downloads page if you want to grab it.  This build can load/save to/from a WebDAV server - more on that in another post.

Microsoft Office Online .. soon?

March 3rd, 2008 by Jason

Nick Carr has sparked speculation that Microsoft will soon unveil its strategy for bringing its Office suite online - which to me means a way of working with Office documents on any computer which has an internet connection.  If you are connected, I’d expect you to be able to collaborate with others in real time; if you are not connected, I’d expect the software to work in offline mode.

When I say “any computer”, I don’t mean to restrict that to any particular operating system (and indeed, Silverlight runs on the Mac, and Microsoft has announce it is working with Novell on a linux implementation).  What good is collaboration software if some of the people you need to collaborate with can’t play?I thought I’d make some predictions about the business model.

There seem to be 2 key questions:

  •  does each end user pay, or does a collaboration originator pay for the right to invite a certain number of collaborators?
  • what support for Mac and Linux users, and when?

Whether each individual user is required to pay, or the originator pays, will reveal much about how Microsoft regards its online offering.  The latter model, that the person who originates a collaboration session pays for a certain number of people to be able to collaborate (ie whatever their platform), would show that their focus is firmly on collaboration.  This is the model we would use for any plutext SAAS offering (available to people who don’t want to install plutext server internally, for free or a fee). 

Here are my predictions:

  1. Enterprise version (ie behind the firewall).  There will be a version an enterprise can install on its Sharepoint server, for those businesses which are not comfortable with their documents being hosted externally.  I’m sure Microsoft can work out how to let people give access to people outside the firewall as necessary.  An enterprise licensee will be able to invite people outside the enterprise without charge.
  2. Cloud version. I expect there will be a cloud version for SMBs.  I think you will be able to use this for free, provided you have a license for the traditional Office product.  You will definitely need this (2007 version) to originate collaboration around a document (ie invite other users) - unless you are prepared to pay a full price for the online offering.  Maybe anyone will be able to accept a collaboration invitation (ie whether or not they are licensed to use Office), making the “who pays” question mute.  To create a new document (or print it?), I expect you will need to have a licence for the traditional Office product, or pay for the SAAS offering.
  3. Mac and Linux support.  I think Microsoft will offer Mac support sooner or later, but delay any hint of support for Linux for as long as possible.  This is because Linux is much more of a threat than OSX (two reasons: (1) Linux is free, and (2) it is very easy to install it on your existing Windows PC).  That said, they might have it “only on Windows” to try to keep people there - until some critical tipping point is reached.  I would say that even now, the only thing stopping Microsoft from seeking revenues from Linux users are the inevitable press headlines along the lines of “Microsoft admits defeat” that would come with this.  The cost of this in terms of perception would surely outweigh any incremental revenues in the short term.  Mac users may be able to use it for free - provided they had an Office license they were able to associate with their online user ID.  
  4. docx only. The documents which come out of this online service will be docx documents, not binary or RTF.  This will help to make the new format ubiquitous.

I wonder whether the collaboration protocols will be published under the recent interoperability initiative?  If they are, the way would be open for a rich world, in which docx4all could potentially play…  I’d be pleasantly surprised if they were, and there was nothing stopping someone from making a client or server of their own.  If anyone else could create a server, then why not get rid of it altogether and go peer-to-peer?  Maybe, just maybe, the thinking is that it would take forever for someone other than Microsoft to create a fully featured server, so third party implementations are to be encouraged (as is presently the case for OpenXML), since Microsoft’s offering will always be the RollsRoyce implementation which attracts the most usage, with the other implementations adding value to the ecosystem.

 The announcement, if/when it comes, will be fascinating!  (more…)

.docx to HTML or PDF using Java

January 13th, 2008 by Jason

Doug Mahugh recently mentioned someone using the DocX2Html.xsl that ships with SharePoint to preview DOCX files in HTML.

As it happens, we’ve just implemented HTML and PDF output in docx4j using a similar approach. We’re using the earlier WordML2HTML XSLT stylesheet available from Oleg Tkachenko. (It would be great if Microsoft also made the presumably newer DocX2Html.xsl that ships with SharePoint freely available).

To create the HTML, we use Sun’s xhtmlrenderer (thanks Sun!). See the obligatory tutorial.

To create the PDF, we take the HTML, and run it through Sun’s pdf-renderer (thanks again, Sun). And again, the tutorial.

The icing on the cake is the PDF Viewer which comes with pdf-renderer. That will give us print preview and printing in docx4all.

Finally, thanks Lars for bringing pdf-renderer to my attention.

Styles and numbering

January 11th, 2008 by Jason

This week, thanks to JAXB, we added strongly typed content models for the Styles part, and the Numbering definitions part of a docx.

Have a look at Styles.java and Numbering.java, used by their respective parts.

Howto: create a new document with docx4j

January 11th, 2008 by Jason

I’ve added a page to the wiki, showing how easy it is to programmatically create a new document from scratch.

Tutorial: opening an existing document with docx4j

January 11th, 2008 by Jason

I’ve added a page to the wiki, showing how easy it is to programmatically open and edit an existing document.