Going further with w2x

When you execute the following command:

..\..\bin\w2x –o docbook5 manual.docx out\manual.xml

you execute in fact a sequence of 3 conversion steps:

  1. Convert the DOCX file to a styled, valid, XHTML 1.0 Transitional document, looking very much like the input DOCX file.
  2. Apply a number of XED scripts to this document to convert CSS styles into semantic tags. For example, numbered paragraphs are converted to proper ordered lists .

    The entry point of these “semantic” XED scripts is found in w2x_install_dir/xed/main.xed.

    The XED scripts edit in place the input XHTML document. Therefore, the result of this step is the same XHTML document, still valid, but this time, containing no CSS styles whatsoever.

  3. Apply an XSLT 1.0 stylesheet to the unstyled, valid, XHTML 1.0 Transitional document in order to generate the desired semantic XML format.

    The XSLT stylesheets are all found in w2x_install_dir/xslt/. In the above case, we want to generate DocBook v5, therefore we use w2x_install_dir/xslt/docbook5.xslt.

This sequence of conversion steps can be made visible in every detail by specifying the –vv option (very verbose) :

..\..\bin\w2x –vv –o docbook5 manual.docx out\manual.xml

VERBOSE: Converting "manual.docx" to XHTML...
DEBUG: convert.xhtml-file=C:\w2x-1_12_0\doc\manual\out\manual.xhtml

VERBOSE: Editing XHTML document using "C:\w2x-1_12_0\xed\main.xed"...
DEBUG: edit.xed-url-or-file=file:/C:/w2x-1_12_0/xed/main.xed
DEBUG: Loading script "file:/C:/w2x-1_12_0/xed/main.xed"...
DEBUG: Loading script "file:/C:/w2x-1_12_0/xed/after-translate.xed"...
[...]
DEBUG: Loading script "file:/C:/w2x-1_12_0/xed/before-save.xed"...

VERBOSE: Transforming document using "C:\w2x-1_12_0\xslt\docbook5.xslt" then saving it to "C:\w2x-1_12_0\doc\manual\out\manual.xml"...
DEBUG: transform.out-file=C:\w2x-1_12_0\doc\manual\out\manual.xml transform.xslt-url-or-file=file:/C:/w2x-1_12_0/xslt/docbook5.xslt
[...]

In fact, option –o docbook5 is a shorthand for the following w2x command-line options:

If you need to learn about the details of the conversion steps to be executed, the simplest is to use the –liststeps command-line option.
Example: w2x –o docbook5 –liststeps.

The order of the –c, -e and –t options is significant because it means: first convert, then edit and finally transform. The order of the –p (and –pu) options is not important, as a parameter name must be prefixed by the name of the step to which it applies.

The Convert, Edit and Transform steps are the most important steps. There are other conversion steps though, which are all documented in chapter Conversion step reference. Moreover a Java™ programmer may implement its own custom conversion steps[5] and instruct the w2x command-line to give them names (required to pass them parameters) and to execute them. See option –step.

A w2x processor executes a sequence of conversion steps whatever the output format. Simply the conversion steps, their order, number and parameters, depend on the desired output format. This is depicted in the figure below.

Anatomy of a w2x processor

???

The first sequence of in the above figure reads as follows: in order to convert a DOCX file to styled XHTML, first convert the DOCX file to a XHTML+CSS document, then “polish up” this document (e.g. process consecutive paragraphs having identical borders) using XED script w2x_install_dir/xed/main-styled.xed, and finally save the possibly modified XHTML+CSS document to disk.


[5]A custom conversion step derives from abstract class com.xmlmind.w2x.processor.ProcessStep.