Jan 26 2008

Plutext docx collaboration under Alfresco

Twelve days ago, I checked out Alfresco.

I thought Alfresco would be a good way to get access control sorted out. There are a number of other features in Alfresco which might prove interesting down the track, but access control was the immediate priority. Alfresco provides each user with a home directory, and lets invite other people to access their resources.

I also think that plutext-style document collaboration would be a great fit for many of Alfresco’s customers. Like most other document management systems, Alfresco uses the classic check-out/check-in model (detested by users the world over!). plutext collaboration frees users from that paradigm.

By Monday, a week in, plutext was basically working with Alfresco. This included the Word 2007 add-in authenticating itself when it makes web service calls (something I hadn’t implemented before). I found a few little bugs in Alfresco which I reported (here and here), but everything was going remarkably smoothly.

Sweet, I thought I’d have a relaxed Tuesday, checking the code in, updating the build and wiki, before declaring success in a blog post.

Well, that wasn’t to be. It turns out there are some major issues (here and here) with Alfresco’s JCR support and/or repository which need to be resolved. Its not so easy to identify simple test cases, since they seem to arise when a series of operations are performed in one session after another, and manifest themselves sometime later, but at least the problems are repeatable.

Hopefully the Alfresco guys will get onto these problems quickly. Otherwise I will have to learn more about Alfresco internals and its use of Hibernate than I’d care to!

Early next week (a week later than I expected) I will update the build procedures so you can easily build it for Alfresco, and then I’ll make sure it works with Jackrabbit again (we’d like to have a single content model that works for both repositories - more on that later).

If we can make good headway with the issues in Alfresco over the next week, we’ll probably regard that as our flagship configuration. If not, I’ll take another look at building access control around Jackrabbit. Although Jackrabbit lacks Alfresco’s bells and whistles, my experience with it (in the single user load scenarios which are causing Alfreco problems) was trouble free. That’s not to say I expect it to be perfect under heavy load, but it sounds very promising based on what Jukka wrote recently following the announcement of version 1.4.

Jan 13 2008

.docx to HTML or PDF using Java

Doug Mahugh recently mentioned someone using the DocX2Html.xsl that ships with SharePoint to preview DOCX files in HTML.

As it happens, we’ve just implemented HTML and PDF output in docx4j using a similar approach. We’re using the earlier WordML2HTML XSLT stylesheet available from Oleg Tkachenko. (It would be great if Microsoft also made the presumably newer DocX2Html.xsl that ships with SharePoint freely available).

To create the HTML, we use Sun’s xhtmlrenderer (thanks Sun!). See the obligatory tutorial.

To create the PDF, we take the HTML, and run it through Sun’s pdf-renderer (thanks again, Sun). And again, the tutorial.

The icing on the cake is the PDF Viewer which comes with pdf-renderer. That will give us print preview and printing in docx4all.

Finally, thanks Lars for bringing pdf-renderer to my attention.

Jan 11 2008

Styles and numbering

This week, thanks to JAXB, we added strongly typed content models for the Styles part, and the Numbering definitions part of a docx.

Have a look at Styles.java and Numbering.java, used by their respective parts.

Jan 11 2008

Howto: create a new document with docx4j

I’ve added a page to the wiki, showing how easy it is to programmatically create a new document from scratch.

Jan 11 2008

Tutorial: opening an existing document with docx4j

I’ve added a page to the wiki, showing how easy it is to programmatically open and edit an existing document.

Jan 11 2008

docx4j license change

A note for the record that we’ve changed the docx4j license from the GPL v3 to the Affero General Public License v3.   All users of which we are aware are happy with this change.

The logic for the change is the same as the logic for licensing plutext-server under the Affero GPL.  That is, to ensure that people who use docx4j in a SAAS environment are treated the same as people who distribute docx4j to end users.

Licensing docx4j under an Apache style licence also has its attractions - let us know if this would make a difference to you.

Jan 06 2008

OOXML, boolean values and binding

ST_OnOff is used extensively in the XML Schema. Here is the openiso.org link (nice resource!).

Basically, it is used for things which should use the built in boolean schema data type:

This simple type specifies a set of values for any binary (on or off) property defined in a WordprocessingML document.

For example, the b (bold) element has an attribute @val of type ST_OnOff.

There are several problems with how this is done.

The first is that its possible values are “on, 1, or true”. OOXML should just use the XSD boolean data type, which doesn’t allow “on” (or “off”). For related comments, see here, here, and here. Denmark and France seem to be the strongest advocates of the use of xsd:boolean, and I hope they get their way.

The second is that it is left to the specification text to say that if the attribute is omitted, its value is implied to be true. That should be expressed as part of the schema.

For CT_OnOff, it would be:

<xsd:complexType name=”BooleanDefaultTrue”>
<xsd:attribute name=”val” type=”xsd:boolean” default=”true” />
</xsd:complexType>

I don’t think Denmark or anyone else made this second point.

The schema we are using in docx4j to generate classes uses these sorts of definitions instead of ST_OnOff or CT_OnOff.  For CT_OnOff, this results in a BooleanDefaultTrue type, which is used in fields like (for bold):

protected BooleanDefaultTrue b

Which brings me to the the third problem with ST_OnOff (and the schema in general), which is that it generates ugly code in JAXB and other binding frameworks (presumably .NET included). The built in schema data types produce much nicer code.

As a general remark, running the schema through JAXB is a good way to find places where the schema can be improved. Schema design goals should include:

  1. that it can be processed out of the box by binding frameworks (since that makes it easier for people to pick up a schema and start using it). [This is not currently the case]
  2. that the schema be expressed in such a way as to generate the simplest code.

Dec 22 2007

docx4j trunk now uses JAXB

10 days ago, we created a proof of concept for using JAXB on a subset of wml.xsd (one of the OpenXML schema files).

We’ve declared that a success, and moved it from a branch into the trunk of docx4j. Here be the generated classes.

plutext-server has now been migrated to use it.

And Jo is working with it as he codes docx4all.

So we’re pretty committed at this point!

We’re tidying up bits of the object model as we go (ie editing our xsd to generate Java that we like). So far, paragraphs (p, pPr, r, rPr, t) and structured document tags (sdt, sdtPr, sdtContent) have had our attention.

We’re also making a few changes to the generated classes, so we need to think about how best to prevent those changes from getting lost when the classes are re-generated. There’s a bit of support in XJC for this, and diff may come in handy, but I’d love to hear best practices.

What we have now is an object model for key pieces of the Main Document part (document.xml), in package name
org.docx4.jaxb.document. Next cab off the rank is the Styles part, which we’ll put in org.docx4.jaxb.styles.

Dec 17 2007

“View Page Source” from within Word 2007

When developing software which uses WordprocessingML, you often need to look at the XML.

Wouter’s Package Explorer is a great way to do this, particularly if you want to look at an existing file.

Wouldn’t it be great (well, at least a little bit useful), if you could look at the WordML for a document from within Word? Then you could quickly see the WordML produced when you do something in Word (format some text, create a table, add a comment etc).

ActiveDocument.WordOpenXML provides the OpenXML corresponding to the document. plutext-client-word2007 uses this extensively in C#.

Anyway, we can also use it in VB from within Word to open that in an Internet Explorer window, syntax highlighted and with collapsible sections (similar to IE’s default stylesheet for XML documents).

The result:

word2007-viewpagesource.png

The very straightforward code to do this can be cut/pasted from here -use the “download in other formats” links at the bottom of the page. In Word, from the Developer menu > Visual Basic is used to access Word 2007’s Visual Basic IDE. You can then just paste the code into a new module. Create or open a document, then run the VB. That’s all there is to it.

I specifically chose to do it using VB and not VSTO, so you don’t need Visual Studio installed to get this running.

Also I cobbled this code together quickly, and I know it can be improved. If you’d like me to incorporate your improvements, please feel free to send them in!

Dec 12 2007

Running a community - lessons from jaxb.dev.java.net

As described in my last post, we’re experimenting with using JAXB to unmarshall/marshall docx documents.

The specification is thorough, and the reference implementation (v2.1.5) seems to work well.

Unfortunately, the same can’t be said of jaxb.dev.java.net.

Given that one of my hats is to develop a community around the plutext projects, I’m trying to be aware of what helps or hinders this process.

So in the spirit of constructive criticism (I’d really like to see momentum grow around JAXB-RI), here are some observations:

  1. there are at least two places to go to for discussion (the mailing lists, and the   Metro and JAXB forum).  Where should you post? Which is going to get the better response? Why two options? In this case, the forum seems more active.
  2. its much harder than it needs to be to get the source code. There is no anonymous CVS (or SVN) access.  You need to be registered, and to have applied for the Observer role.  Then the instructions omit the cvs login step.  Eventually it worked, but in the meantime, it took a bit of digging to find a link to the zipped up sources.  There are outdated blog entries to disregard along the way.
  3. once you do have the source code, and given that JDK 1.6 introduced JAXB 2.0 in rt.jar, there should be prominent instructions for using 2.1 in Eclipse (ie use JDK 1.5)
  4. I couldn’t find JAXB 2.1.5 in Maven repositories. Again, outdated blog entries.
  5. the website is pretty slow

Now, none of these problems will stop the determined user. But I’m sure their cumulative effect is to make many others give up.

For those like me who try to get a quick sense of how active a project is by looking at the volume of traffic on the mailing list or forum before making any further commitment, problem #1 above amounts to bad marketing if nothing else.

This is a pity, because as I said, JAXB 2.1.5 is good stuff.