The “Word To XML” servlet

The “Word To XML” servlet is a Java™ Servlet (server-side standard component) which has the same functions as the w2x-app desktop application.

Because it’s a server-side component and not a desktop application, please do not attempt to deploy the “Word To XML” servlet if you are an end-user of “Word To XML”. Please ask your IT personnel to do that for you.

Contents of the servlet software distribution

The “Word To XML” servlet comes in a software distribution of its own: w2x_servet-1_12_0.zip. This distribution contains a ready-to-deploy binary w2x.war, as well as the full Java™ source code of the servlet.

w2x.war
Ready-to-deploy Web application ARchive (WAR) containing the servlet.
src/
src/build.xml
The Java™ source code of the servlet. Run ant in src/ in order to use src/build.xml to rebuild w2x.war.
w2x/
Directory containing unpacked w2x.war. Needed to rebuild w2x.war.
lib/
Contains Java™ libraries needed to rebuild w2x.war.

Installing the servlet

File w2x.war may be easily installed in any servlet container implementing at least the Servlet 2.3 standard. Example of such servlet containers: Apache Tomcat, Jetty, Caucho Resin.

About Apache Tomcat version 10 and above

Beware that there is a major breaking change between latest versions of Apache Tomcat (>= 10) and older versions (<= 9). This is documented in this migration article.

To make a long story short, if you need to deploy the “Word To XML” servlet on Tomcat version 10+, then you first must create a webapps-javaee/ folder next to TOMCAT_INSTALL_DIR/webapps/ then copy w2.war to this TOMCAT_INSTALL_DIR/webapps-javaee/.

Though copying file w2x.war to the webapps/ folder of the servlet container and then restarting the servlet container is generally sufficient to deploy the “Word To XML” servlet, please refer to the documentation your servlet container to learn about the best deployment procedure.

On Windows, the .dll files contained in w2x_servlet_deployment_dir\WEB-INF\lib\ must be copied to a directory referenced by the PATH environment variable of the computer running the servlet.

Configuring the servlet

The “Word To XML” servlet is configured by specifying a number of init-param parameters. These parameters are found in WEB-INF/web.xml, where folder WEB-INF/ is contained in w2x.war.

All these init-param parameters are documented in web.xml. Example, parameter workDir:

<!-- workDir =============================================================
     Uploaded files and files generated during the conversion process 
     are stored in temporary subdirectories of this directory.
     If specified directory does not exist, it will be created.

     Value: this directory and its contents must be readable and writable
     by the operating system account used to run the Word To XML servlet.

     Default: dynamic; supplied by the Servlet Container.
====================================================================== -->

<init-param>
  <param-name>workDir</param-name><param-value></param-value>
</init-param>

Using the servlet to convert DOCX files

Let’s suppose your servlet container runs on host localhost and uses 8080 as its port. In order to use the “Word To XML” servlet, please point your Web browser to http://localhost:8080/w2x/. This will cause the browser to display a page containing a simple DOCX convert form.

The Convert DOCX form (servlet container running on host 192.168.1.202 and using port 8080)

???

In order to convert a DOCX file to another format:

  1. Click “Choose File” to select the DOCX file to be converted.
  2. Select the desired output format using the “Output format” combobox.
  3. Click Convert to download a .zip (or .epub) archive containing the result of the conversion. Generating this .zip (or .epub) file may take several seconds to several minutes depending on the size of the DOCX input file.

If the name of the DOCX input file contains non-ASCII characters (e.g. accented characters), please make sure to use Zip extractor software supporting .zip files having UTF-8 encoded filenames.

Note that most Zip extractor software do not support .zip files having UTF-8 encoded filenames[1]. Such extractors will succeed in unpacking the .zip file, but will generate files having incorrect names.

Non interactive requests

It’s also possible to use the conversion services of the “Word To XML” servlet by sending URL /w2x/convert an HTTP POST request having a multipart/form-data encoding.

cURL[2] example:

curl -s -S -o manual_docbook5.zip \
  -F "docx=@manual.docx;type=application/vnd.openxmlformats-officedocument.wordprocessingml.document" \
  -F "conv=docbook5" \
  http://localhost:8080/w2x/convert

Other example:

curl -s -S -o manual.epub \
 -F "docx=@manual.docx;type=application/vnd.openxmlformats-officedocument.wordprocessingml.document" \
 -F "conv=epub" \
 -F "params=-p epub.identifier urn:x-mlmind:w2x:manual -p epub.split-before-level 8" \
 http://localhost:8080/w2x/convert

The conversion request has three emulated form fields:

docx
Emulated <input type=”file”> field. Required. Contains the DOCX input file.
conv
Emulated <input type=”text”> field. Required. Contains the name of one of the conversionN.name init-param defined in WEB-INF/web.xml.
The stock WEB-INF/web.xml defines the following conversions to styled HTML:
xhtml_css (single page styled HTML), frameset (multi-page styled HTML, split on Heading 1), frameset2 (multi-page styled HTML, split on Heading 1, 2), frameset3 (multi-page styled HTML, split on Heading 1, 2, 3), webhelp (split on Heading 1), webhelp2 (split on Heading 1, 2), webhelp3 (split on Heading 1, 2, 3), epub (split on Heading 1), epub2 (split on Heading 1, 2), epub3 (split on Heading 1, 2, 3)
and also the following conversions to “semantic” XML:
docbook, docbook5, topic, map, bookmap, xhtml_strict, xhtml_loose, xhtml1_1, xhtml5.
params
Emulated <input type=”text”> field. Optional. Contains some w2x command-line options, generally -p parameters. These options are appended to the options of the conversion specified in the conv emulated form field.

The response to a successful conversion request is a .zip (or .epub) archive containing the result of the conversion.


[1]However, “jar xvf converted.zip” works fine. jar is a command-line utility which comes with all Java Development Kits (JDK).

[2]curl is an open source command line tool and library for transferring data with URL syntax.