Archive for the ‘Microsoft Word’ Category

Office pptx/xlsx/docx to PDF to in docx4j 8.2.3

September 5th, 2020 by Jason

docx4j 8.2.3 facilitates 3 distinct ways to convert Microsoft Word docx documents to PDF. There are also possibilities for converting pptx or xlsx to PDF.

The three approaches:

  • export-fo: the content is converted to XSL FO, and from there, to PDF (or any of the other formats supported by Apache FOP)
  • documents4j: since 8.2.0, use Microsoft Word to do the conversion
  • via-Microsoft-Graph: new in 8.2.3, use java-docx-to-pdf-using-Microsoft-Graph to do the conversion

So which should you choose? The following table covers some of the things you might want to consider:

export-FO Microsoft Graph documents4j
Overview Conversion of docx to XSL FO, then uses Apache FOP to convert to PDF Uses Microsoft’s cloud Uses your Microsoft Office installation 
Fidelity Suitable for simple documents (text, tables, supported image types, header/footers) 100% (Microsoft’s fidelity) 100% (Microsoft’s fidelity)
Suitability simple docx docx, pptx, xlsx docx, pptx, xlsx
License considerations ASL v2 Refer applicable Microsoft cloud terms Refer Microsoft EULA governing your Office install 
(increasingly restricted with each release)
Cost Free Microsoft cloud costs (Microsoft Office)
Confidentiality documents don’t leave your server documents go to Microsoft cloud documents need not leave your servers
Other advantages – Fast XSL FO/PDF templating for high volume PDF creation
– Open source, so can be extended
– Microsoft encourages this approach
– Microsoft cloud handles scalability
– Can update a docx table of contents
– Can convert RTF and binary .doc
Other disadvantages Two step (docx to XSL FO to PDF) processing is slower (except for XSL FO templating) – Dependency on 3rd party cloud
– Currently can’t update docx table of contents
vote to fix
– Not supported by Microsoft

Tool review: Merging Word documents on the desktop

July 25th, 2016 by Jason

Sometimes Microsoft Word users need to join several Word documents into a single file, without loss of formatting.

An example would be a cover letter, a quote, and a contract.

Or a proposal or contract, plus appendices.

(In legal industry parlance the finalised collection of documents is called a “closing binder”, “electronic bundle” or less commonly, a deal bible.  And its usually in PDF…)

Word itself doesn’t do this for you.  So this blog post is a review of tools you might download/install to try to get the job done.

TLDR: I don’t want to be negative, but bottom line is they won’t help … the 5 tools I found and tested each do a poor job at this.  If you, dear reader, know of a better tool, please share in a comment!   Most people seem to convert their documents to PDF first, then merge the PDFs – for good reason!

By way of background/disclosure, our Docx4j Enterprise product is good at this (if I may say so myself), but we don’t sell it to end users.  This background allowed me to make a couple of very simple documents to test the products.

Without further ado, the 5 products I tested were:

My first test was to merge 2 documents which define Heading 1 differently.  Does the merged document keep the distinct appearance of the 2 headings?

The only product which was able to pass this basic test was Icestand’s.

Unfortunately Icestand failed my second test: it lost section formatting (page orientation, headers/footers).

And my third test: list numbering.  If doc 1 contains “1,2,3” and so does doc 2, in the merged output, do you get “1,2,3  1,2,3” or “1,2,3 4,5,6”?

Since all the others failed test 1, I didn’t subject them to tests 2 or 3.  But I found:

  • most/all of these programs require Word to be installed.  That’s probably OK for a desktop utility.
  • with the exception of kutools, which appears in Word’s ribbon, each presents as standalone/free-standing software with its own UI.
  • you’d expect to be able to arrange your input documents in the order you want, but 3steps lacked even that!

Since these products all lose formatting, unless there’s a better product out there somewhere, you’d get better results by converting to PDF, then merging the PDF files.   In this respect, bundledocsCaseLines and others look interesting; they do the PDF conversion as well, removing a step from the process.

Using PDF might be fine, if the deal is done.  And interestingly, in some cases, PDF might even be a requirement.  For example, the UK Supreme Court “recommends” it.

But if the documents are still being finalised, Word is preferable, since its an editing format. Anyone tried converting the resulting merged PDF file back to Word?!

 

 

High fidelity PDF output

February 10th, 2015 by Jason

This post introduces our new commercial component for docx to PDF output.

The background is that docx4j’s standard method of producing PDF output has been via XSL FO, using Apache FOP.

This has worked well enough for some docx4j users, but it has certain limitations which can bite you, for example lack of tab and tab stop support in XSL FO.

And because there are differences between FOP’s layout engine and Word’s, page breaks may fall in different places.

This means the FO based PDF output in docx4j is about as good as its going to get (short of enhancing the FO renderer).

To do better, we’ve had to invest in a non-FO approach, using layout algorithms specifically designed to give the same results Word does.

You can try it now.

A side benefit is that this new approach is much faster than the FO approach.

The component is actually independent of docx4j.  This means it’ll also work great  if you need to convert docx to PDF from C# (without Word), Python, PHP etc.

Pricing is at plutext.com

Docx4jHelper Word AddIn

December 4th, 2014 by Jason

The dream:

  • View Open XML right from within Word, and see what happens when you edit it.
  • Or generate corresponding docx4j Java code, with deep links into the corresponding docx4j source code and Open XML spec.

Regular users of docx4j will be aware of our webapp, which amongst other things, generates docx4j Java code for the specified Open XML in your sample docx/pptx/xlsx.

The webapp is useful, but it has a few draw backs:

  • you have to upload your docx/pptx/xlsx, which takes time
  • if your docx/pptx/xlsx contains sensitive data, you probably want to remove that first
  • the webapp might be down

To address these issues, we’re now offering the code gen functionality as a Word AddIn.

If you install the Word AddIn, this means you can now generate code without your docx leaving your computer.

This is all feasible because docx4j can run as a DLL in a .NET project, thanks to IKVM!

Where to get it

You can download the installer.  After you complete the landing form (using your corporate email address, not gmail etc), you’ll be sent a download link.

Getting Started

After a successful installation, after restarting Word, you should see a “Docx4j” menu, containing:

To generate code, first press the “Load Helper” button.

You’ll see the following form:

Its inviting you to start a local web server which will run the same code as the existing webapp.  Just choose a port you aren’t already using.  If for some reason you want to browse using Internet Explorer (as opposed to your default browser), check the box.

It’ll take a little while to start the server; you’ll see a dialog when its started.

Now you can generate code.  To do so, select something in your docx, then click the “Generate Code” button.

After a while, a window will open in your web browser, and you’ll see:

That’s the view of the docx package, which will be familiar if you’ve used the webapp.   For how to generate code from here, see our earlier post.

Code generating is done on your computer.  (But note, the links on that page to docx4j source code and the OpenXML spec are external links)

What about the “Edit OpenXML” button?

If you select something in your docx, then click that button, after a while (maybe 30 secs the first time!), you’ll see the corresponding XML in an editor window:

You can go ahead and edit it, then click the “Apply” button.

If Word likes your XML, you’ll see your changes on the document surface.  Ctrl Z should work for undo.

So there are 2 ways to see the underlying XML

The first way we described uses your web browser; the second is a Windows Form.

These two views have different features; maybe a later release will unify them?

What about pptx, xlsx?

There’s no reason in principle we couldn’t make a similar AddIn for Powerpoint and Excel.  In fact, we plan to make these, once any teething issues have been ironed out in the WordAddIn.

In the meantime, for pptx and xlsx, you can continue to use the webapp.

Help, Suggestions and other Discussion

If you are a Plutext customer experiencing an issue, please email support@plutext.com

Otherwise, please check the Docx4jHelper AddIn forum.

We’ve got some ideas for where the AddIn goes from here, but we’d love to hear yours.

Web-based docx editing?

October 4th, 2014 by Jason

Following on from the previous post on content tracking, some people have been asking about how to edit a docx in a web browser.

So I thought I’d link to a proof of concept we did a year or so ago.

The idea is:

  • use docx4j to convert the docx to XHTML
  • use CKEditor to edit that XHTML in the web browser
  • on submit, convert the XHTML back to docx content

The general problem with converting to/from XHTML is the “impendance mismatch”.  That is, losing stuff during round trip.  This will be a familiar problem to anyone who has ever edited a docx in Google Docs or LibreOffice.

This demo addresses that problem by identifying docx content which CKEditor would mangle, and then on submit/save, using the original docx content for those bits.

In this demo, the problematic content is replaced with visual placeholders, so you can see it is there.

The intent is that you can add/edit text content in the browser, without other document content (headers/footers, text boxes etc) getting lost.

To give it a try, go to the upload page and choose a docx file from your computer

You should see your docx open with the CKEditor toolbars above it:

(In the demo and screenshot above, the grey “B” image represents a bookmark)

Make some edits, then hit the Submit button (at the bottom).

The docx will be streamed back to your computer as a download in your browser.

Now open it in Word, and compare it to the original.

Feedback

If you want to add this type of functionality to your application, please let us know by emailing jharrop@plutext.com

We’d love to hear:

  • a bit more about your use case,
  • where you see your users doing their web-based editing:- on your intranet, extranet, or the web at large?
  • what kind of editing? is it proof reading,  customising particular sections, a step in a workflow..?
  • do you need to cater for iPads or Android tablets?  And if so, is a dedicated app on your roadmap?
  • any additional requirements you might have!

Update (Oct 2015)

Source code is available at https://github.com/plutext/docx-html-editor

 

C#/.NET: Import XHTML into docx without Word

September 5th, 2014 by Jason

How to convert import HTML into a Word document without using Microsoft Word?

Honouring the CSS, so the Word document looks similar to the input XHTML.  Alternatively, converting @class values to Word styles.

Its a common requirement in our increasingly web-centric world.

docx4j-ImportXHTML.NET is open source (LGPL v2.1 or later), identical to the Java version, but made into a DLL using IKVM.  Currently we’re at v3.2.0, released last week.

It is easy to test; with very little effort, you can run it from a sample project in Visual Studio.  Its very easy, because docx4j-ImportXHTML.NET is in the NuGet.org repository:

To create your sample project:

  1. make sure you have NuGet Package Manager installed
    • for VS 2012 and later, its installed by default
    • for VS 2010, NuGet is available through the Visual Studio Extension Manager; see the above link.
  2. create a new project in Visual Studio (File > New > Project).  A Console Application is fine.  I chose that from the .NET 3.5 list.
  3. from the Tools menu, choose NuGet Package Manager > Package Manager Console
  4. type Install-Package docx4j-ImportXHTML.NET

You should see something like:

And then, your project/solution will be populated to look like:

We’re nearly there!  Notice the docx4j-ImportXHTML DLL, and the file src/samples/c_sharp/docx/ConvertInXHTMLFragment.cs.  Most of the rest of the stuff comes from the docx4j dependency, which NuGet fetches.

If you have a look at ConvertInXHTMLFragment.cs, you’ll see it contains

Let’s run it, to convert that xhtml to docx content.

Click on your project in Solution Explorer, then right click (or hit Alt+Enter) to get the properties pane:

Then set the “startup object” as shown in the above image.

Now you can hit Ctrl+F5 (“Start without Debugging”) – you don’t want to debug, since that’s really slow.

You should see some logging in the console window, culminating in something like:

You can see there the WordML equivalent for the tail of the XHTML list we were converting.

Obviously, you can modify src/samples/c_sharp/Docx4NET/DocxToPDF.cs to read your own XHTML.

A few comments.

Well formed XML! Only well formed XML works, ie XHTML, not tag-soup HTML.  If you have tag soup, its your responsibility to convert that to XHTML with some tidy tool.   You’ll get a SAXParseException if your input is not well formed.

Word styles: if the target docx contains a style matching @class, it can be used.  This’ll be the subject of a separate blog post.

Other examples: the Java repository on GitHub contains examples for reading from a file etc.  Converting these to C# is left as an exercise for the reader.  If you do that, we’d be delighted to receive a pull request on https://github.com/plutext/docx4j-ImportXHTML.NET

Logging, Commons Logging. Logging is via Commons Logging.  In the demo, it is configured programmatically (ie in  DocxToPDF.cs).  Alternatively, you could do it in app.config.

OpenXML SDK interop: src/main/c_sharp/Plutext/Docx4NET contains code for converting between a docx4j representation of a docx package, and the Open XML SDK’s representation.

Improving XHTML import support. To implement a new feature in the XHTML import, typically you’d make the improvement to docx4j-ImportXHTML first (ie the Java version), then create a new DLL using the ant build target dist.NET.   docx4j-ImportXHTML is on GitHub, and is most easily setup using Maven (see earlier blog post).

Alternatives. There are a couple of projects on CodePlex you could try:

I’d be interested in feedback on how they compare.

Help/support/discussion. You can post in the docx4j XHTML import forum, or on StackOverflow (be sure to use tag docx4j, plus some/all of c#, docx, xhtml etc as you think appropriate).  Please don’t cross post at both!


docx to PDF in C#/.NET

September 5th, 2014 by Jason

How to convert docx to PDF without using Microsoft Word?

If you docx is mainly text, tables and images, docx4j.NET may work well for you.  Edit (Feb 2015): if not, you may be interested in our new commercial high fidelity PDF renderer.

docx4j.NET is open source (Apache software license v2), identical to the Java version, but made into a DLL using IKVM.  Currently we’re at v3.2.0, released last week.

It is easy to test; you can upload your docx to the docx4j demo webapp

Or with very little effort, you can run it from a sample project in Visual Studio.  Its very easy, because docx4j.NET is in the NuGet.org repository:

To create your sample project:

  1. make sure you have NuGet Package Manager installed
    • for VS 2012 and later, its installed by default
    • for VS 2010, NuGet is available through the Visual Studio Extension Manager; see the above link.
  2. create a new project in Visual Studio (File > New > Project).  A Console Application is fine.  I chose that from the .NET 3.5 list.
  3. from the Tools menu, choose NuGet Package Manager > Package Manager Console
  4. type Install-Package docx4j.NET

You should see something like:

And then, your project/solution will be populated to look like:

We’re nearly there!  Notice the file src/samples/c_sharp/Docx4NET/DocxToPDF.cs

Click on your project in Solution Explorer, then right click (or hit Alt+Enter) to get the properties pane:

Then set the “startup object” as shown in the above image.

Now you can hit Ctrl+F5 (“Start without Debugging”) – you don’t want to debug, since that’s really slow.

You should see some logging in the console window, culminating in “done! Press any key to continue..”

What just happened?  All being well, the sample docx “src\samples\resources\sample-docx.docx” was saved as a PDF “OUT_sample-docx.pdf” in your project directory.

You can modify src/samples/c_sharp/Docx4NET/DocxToPDF.cs to read your own test docx.

A few comments.

XSL FO; Apache FOP. docx4j creates PDF via XSL FO.  It generates XSL FO, then uses Apache FOP (v1.1) to convert the XSL FO to PDF.  FOP also supports other output formats (the subject of another blog post).

Logging, Commons Logging. Logging is via Commons Logging.  In the demo, it is configured programmatically (ie in  DocxToPDF.cs).  Alternatively, you could do it in app.config.

OpenXML SDK interop: src/main/c_sharp/Plutext/Docx4NET contains code for converting between a docx4j representation of a docx package, and the Open XML SDK’s representation.

Improving PDF support. To improve the quality of the PDF output, typically you’d make the improvement to docx4j first (ie the Java version), then create a new DLL using the ant build target dist.NET.   docx4j is on GitHub, and is most easily setup using Maven (see earlier blog post).

Help/support/discussion. You can post in the docx4j PDF output forum, or on StackOverflow (be sure to use tag docx4j, plus some/all of c#, docx, pdf, fop, xslfo as you think appropriate).  Please don’t cross post at both!


SQL Server Reporting Services (SSRS) emits dodgy Word docx documents

May 12th, 2014 by Jason

By now we’re used to products which emit docx files which are umm, not .. quite .. right.

But its more noteworthy when the product in question is from Microsoft.  After all, its their file format (ECMA etc standardisation notwithstanding).

The product in question here is SQL Server Reporting Services 2012 and its Word export.

It seems they didn’t bother to validate their documents (eg using Open XML SDK 2.0 Productivity Tool):

Apparently there’s a reason for this:

“Word and SSRS treat page headers and footers differently. Word actually positions them inside the page margins, whereas SSRS positions them inside the area that the margins surround. As a result, in Word, the page margins do not control the distance between the top edge of the page and that of the page header (or similarly for the page footer). Instead, Word has separate “Header from Top” and “Footer from Bottom” properties to control those distances. Since RDL does not have equivalent properties, the Word renderer sets these properties to zero.”
But the problem is that it is actually setting them to blank (as opposed to zero), which is not valid.

Another problem:

JAXB doesn’t like invalid documents, so docx4j has to fix these sorts of things before it can construct a content model.  (Maybe that’s why SSRS calls it Word export, not docx export:- they just check Word can open the document, then call it job done)

There are other problems with SSRS docx which the Productivity Tool doesn’t report.

Take a look at the styles part:

Notice anything wrong?  It’d be better if the EmptyCellLayoutStyle had @w:styleId and @w:type, like so:

It’d also be nice if it defined the “Normal” style it is basedOn!

docx4j and other consumers could/should detect such problems and degrade gracefully in the face of them, but Microsoft (of all companies!) should exercise better quality control.

Hello Maven Central

October 29th, 2011 by Jason

With version 2.7.1, docx4j – a library for manipulating Word docx, Powerpoint pptx, and Excel xlsx xml files in Java – and all its dependencies, are available from Maven Central.

This makes it really easy to get going with docx4j.  With Eclipse and m2eclipse installed, you just add docx4j, and you’re done.  No need to mess around with manually installing jars, setting class paths etc.

This post demonstrates that, starting with a fresh OS (Win 7 is used, but these steps would work equally well on OSX or Linux).

Step 1 – Install the JDK

For the purposes of this article, I used JDK 7, but docx4j works with Java 6 and 1.5.

Step 2 – Install Eclipse Indigo (3.7.1)

I normally download the version for J2EE developers. Unzip it and run eclipse

Step 3 – Install m2eclipse.

In Eclipse, click Help > Install New Software.

Type “http://download.eclipse.org/technology/m2e/releases” in the “Work with” field as shown:

then follow the prompts.

Step 4 – Create your Maven project

In Eclipse, File > New > Project.., then choose Maven project

You should see:

Check “Create a simple project (skip archetype selection)” then press next.

Allocate group and artifact id (what you choose as your artifact id will become the name of your new project in Eclipse):

Press finish

This will create a project with directories using Maven conventions:

(Note: If your starting point is a new or existing Java project in Eclipse, you can right click on the project, then choose Configure > Convert to Maven project)

Step 5 – Add docx4j to your POM

Double Click on pom.xml

Next click on the dependencies tab, then click the “add dependency” button, and enter the docx4j coordinates as shown in the image below:

The result is this pom:


<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>mygroup</groupId>
  <artifactId>myartifact</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  <dependencies>
  	<dependency>
  		<groupId>org.docx4j</groupId>
  		<artifactId>docx4j</artifactId>
  		<version>2.7.1</version>
  	</dependency>
  </dependencies>
</project>

Ctrl-S to save it.

m2eclipse may take some time to download the dependencies.

When it has finished, you should be able to see them:

Step 6 – Create HelloMavenCentral.java

If you made a Maven project as per step 4 above, you should already have src/main/java on your build path.

If not, create the folder and add it.

Now add a new class:

import org.docx4j.openpackaging.packages.WordprocessingMLPackage;

public class HelloMavenCentral {

	public static void main(String[] args) throws Exception {
		
		WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.createPackage();
		
		wordMLPackage.getMainDocumentPart()
			.addStyledParagraphOfText("Title", "Hello Maven Central");

		wordMLPackage.getMainDocumentPart().addParagraphOfText("from docx4j!");
				
		// Now save it 
		wordMLPackage.save(new java.io.File(System.getProperty("user.dir") + "/helloMavenCentral.docx") );
		
	}	
}

Step 7 – Click Run

When you click run, all being well, a new docx called helloMavenCentral.docx will be saved.

You can open it in Word (or anything else which can read docx), or unzip it to inspect its contents.

Step 8 – Adding docx4j.properties

One final thing. If you plan on creating documents from scratch using docx4j, it is useful to set paper size etc, via docx4j.properties. Put something like the following on your path:

# Page size: use a value from org.docx4j.model.structure.PageSizePaper enum
# eg A4, LETTER
docx4j.PageSize=LETTER
# Page size: use a value from org.docx4j.model.structure.MarginsWellKnown enum
docx4j.PageMargins=NORMAL
docx4j.PageOrientationLandscape=false

# Page size: use a value from org.pptx4j.model.SlideSizesWellKnown enum
# eg A4, LETTER
pptx4j.PageSize=LETTER
pptx4j.PageOrientationLandscape=false

# These will be injected into docProps/app.xml
# if App.Write=true
docx4j.App.write=true
docx4j.Application=docx4j
docx4j.AppVersion=2.7.1
# of the form XX.YYYY where X and Y represent numerical values

# These will be injected into docProps/core.xml
docx4j.dc.write=true
docx4j.dc.creator.value=docx4j
docx4j.dc.lastModifiedBy.value=docx4j

#
#docx4j.McPreprocessor=true

# If you haven't configured log4j yourself
# docx4j will autoconfigure it.  Set this to true to disable that
docx4j.Log4j.Configurator.disabled=false

And that’s it. For more information on docx4j, see our Getting Started document.

Please click the +1 button if you found this article helpful.

docx4j 2.7.0 released

July 8th, 2011 by Jason

I’m pleased to announce the release today of docx4j 2.7.0.

What is docx4j?

docx4j is an open source (Apache v2) library for creating, editing, and saving OpenXML “packages”, including docx, pptx, and xslx.  it is similar to Microsoft’s OpenXML SDK, but for Java rather than .NET.   It uses JAXB to create the Java objects out of the OpenXML parts.

Notable features for docx include export as HTML or PDF, and CustomXML databinding for document generation (including our OpenDoPE convention support for processing repeats and conditions).

The docx4j project started in October 2007.

What’s new?

This is mainly a maintenance release; things of note include:

  • Improvements to Maven build
  • ContentAccessor interface
  • AlteredParts: identify parts in this pkg which are new or altered; Patcher
    which adds new or altered parts.
  • Support for .glox SmartArt package (/src/glox/)
  • JAXB RI 2.2.3 compatibilty
  • OpenDoPE support improvements

Where do you get it?

Binaries: You can download a jar alone or a tar.gz with all deps or pick and choose.

Source: Checkout the source from SVN (use the pom.xml file to satisfy the dependencies eg with m2eclipse, or download them from one of the links above)

Maven: Please see forum for details (since XML doesn’t paste nicely here right now).

Dependency changes

Antlr is now required for OpenDoPE processing; this gives us better XPath processing.  The required jars are:

Getting Started

See the “Getting Started” guide.

Thanks to our contributors

A number of contributions have made this release what it is; thanks very much to those who contributed.

Contributors to this release and a more complete list of changes may be found in README.txt

A request to docx4j users

If you are happily using docx4j, it would be great if you could reply to this post with some words of recommendation for others who might be wondering whether docx4j is a good choice. I know there are thousands of you out there :-)

Some users have been kind enough to make such statements already; these may be found on the trac homepage.

Of course, there are a number of other ways you can contribute back.  Please consider doing so, especially if you think you might find yourself looking for support from volunteers in the docx4j forums.