About Document Conversion Service

The Document Conversion Service provides a REST API to convert binary data from one format to another. For example, a JPEG image could be converted to a PDF file. Another use case is text extraction from different document types. The actual content conversion is not implemented by the Conversion Service. The service offers an interface for plugins that perform the actual conversion.

Configuration

There might exist several plugins that may fulfill a given request, so you can configure which plugin to use for a given request. By default, the service just selects the first available plugin that claims to support the requested source and target mimetypes.

In order to configure this, specify a list of use cases in your application.yaml in the key rendering (for render plugins) or in the key extraction (for fulltext plugins).

Use cases for render plugins

Each use case consists of

sourceType
- a regular expression matching the mimetype to be converted
targetType
- a regular expression matching the mimetype to be converted to
plugin
- the name of the plugin to use (by default the fully qualified class name, but can be specified by each plugin)

For a given request the service will search the list in the given order and use the first plugin whose configured sourceType and targetType match the requests.

Example of a rendering configuration

rendering:
  - sourceType: application/pdf
    targetType: image/jpe?g
    plugin: my-pdf-to-image-plugin
  - sourceType: .*
    targetType: .*
    plugin: my-fallback-plugin

The configuration above will instruct the service to use the plugin with the name my-pdf-to-image-plugin for requests to render PDFs to JPG or JPEG files. In any other case it uses the plugin my-fallback-plugin. Note that if my-fallback-plugin would be listed before the other one every request would be handled by my-fallback-plugin.

Use cases for extraction plugins

Each use case consists of

sourceType
- a regular expression matching the mimetype to be converted
plugin
- the name of the plugin to use (by default the fully qualified class name, but can be specified by each plugin)

For a given request the service will search the list in the given order and use the first plugin whose configured sourceType matches the requests.

Example of an extraction configuration

extraction:
  useCases:
    - sourceType: application/pdf
      plugin: de.eitco.commons.conversion.plugins.oss.TikaFulltextOcrExtractionPlugin

The configuration above will instruct the service to use the plugin with the name my-extraction-plugin for requests to extract text from pdf files.

General configuration settings

Limiting the size of uploaded data

To limit the size of the uploaded data to be rendered, the following configuration properties can be used:

server:
  undertow:
    max-http-post-size: 30MB
spring:
  servlet:
    multipart:
      max-file-size: 30MB
      max-request-size: 30MB

Plugins Overview

The service provides two plugin interfaces: RenderPlugin and FulltextPlugin. A RenderPlugin converts from one document type to another. It defines a list of supported source mimetypes and a list of supported target mimetypes. A FulltextPlugin can extract text from a document. It defines a list of supported source mimetypes. The target mimetype is always text/plain.

Open source plugins

The document-conversion-plugins-oss library provides several plugins based on open source libraries. To use the plugins, the jar file of the library must be added to the classpath of the service. The plugins will be registered automatically.

A ZIP containing the library and additional dependencies can be downloaded from nexus using the following maven coordinates:

<dependency>
    <groupId>de.eitco.commons</groupId>
    <artifactId>document-conversion-plugins-oss</artifactId>
    <type>zip</type>
    <classifier>zip</classifier>
    <version>1.0.0</version>
</dependency>

Plugins contained in document-conversion-plugins-oss

de.eitco.commons.conversion.plugins.oss.OpenHtmlRenderPlugin

Renders from XHTML to PDF by using com.openhtmltopdf.

Source media types	Target media types
application/xhtml+xml	application/pdf

Source media types

Target media types

application/xhtml+xml

application/pdf

de.eitco.commons.conversion.plugins.oss.PdfToMultiPageTiffRenderPlugin

Renders from pdf to TIFF Pages by using apache pdfbox.

Source media types	Target media types
application/pdf	image/tiff

Source media types

Target media types

application/pdf

image/tiff

de.eitco.commons.conversion.plugins.oss.PdfToImagesRenderPlugin

Renders from pdf to JPG,PNG.GIF,TIF by using apache pdfbox.

Source media types	Target media types
application/pdf	image/jpeg image/png image/gif image/tiff

Source media types

Target media types

application/pdf

image/jpeg
image/png
image/gif
image/tiff

de.eitco.commons.conversion.plugins.oss.ImagesToPdfRenderPlugin

Renders from JPG,PNG,GIF to PDF by using apache pdfbox.

Source media types	Target media types
image/jpeg image/png image/gif image/bmp	application/pdf

Source media types

Target media types

image/jpeg
image/png
image/gif
image/bmp

application/pdf

de.eitco.commons.conversion.plugins.oss.PdfMergingContainerPlugin

Merges pdf files.

Source media types	Target media types
application/pdf	application/pdf

Source media types

Target media types

application/pdf

application/pdf

de.eitco.commons.conversion.plugins.oss.OpenPdfRenderPlugin

Renders txt to pdf with openpdf.

Source media types	Target media types
text/plain	application/pdf

Source media types

Target media types

text/plain

application/pdf

de.eitco.commons.conversion.plugins.oss.TiffToPdfRenderPlugin

Renders TIF to Pdf using IText.

Source media types	Target media types
image/tiff	application/pdf

Source media types

Target media types

image/tiff

application/pdf

de.eitco.commons.conversion.plugins.oss.TikaFulltextOcrExtractionPlugin

Extracts text by using Apache Tika and Tesseract OCR.

Source media types	Target media types
application/pdf application/xml text/html application/msword application/vnd.openxmlformats-officedocument.wordprocessingml.document application/vnd.ms-excel application/vnd.openxmlformats-officedocument.spreadsheetml.sheet application/vnd.ms-powerpoint application/vnd.openxmlformats-officedocument.presentationml.presentation application/epub+zip application/vnd.ms-outlook application/rtf application/vnd.oasis.opendocument.presentation application/vnd.oasis.opendocument.spreadsheet application/vnd.oasis.opendocument.text text/plain	text/plain

Source media types

Target media types

application/pdf
application/xml
text/html
application/msword
application/vnd.openxmlformats-officedocument.wordprocessingml.document
application/vnd.ms-excel
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
application/vnd.ms-powerpoint
application/vnd.openxmlformats-officedocument.presentationml.presentation
application/epub+zip
application/vnd.ms-outlook
application/rtf
application/vnd.oasis.opendocument.presentation
application/vnd.oasis.opendocument.spreadsheet
application/vnd.oasis.opendocument.text
text/plain

text/plain

de.eitco.commons.conversion.plugins.oss.TikaFulltextExtractionPlugin

Extracts text by using Apache Tika.

Source media types	Target media types
application/pdf application/xml text/html application/msword application/vnd.openxmlformats-officedocument.wordprocessingml.document application/vnd.ms-excel application/vnd.openxmlformats-officedocument.spreadsheetml.sheet application/vnd.ms-powerpoint application/vnd.openxmlformats-officedocument.presentationml.presentation application/epub+zip application/vnd.ms-outlook application/rtf application/vnd.oasis.opendocument.presentation application/vnd.oasis.opendocument.spreadsheet application/vnd.oasis.opendocument.text text/plain	text/plain

Source media types

Target media types

application/pdf
application/xml
text/html
application/msword
application/vnd.openxmlformats-officedocument.wordprocessingml.document
application/vnd.ms-excel
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
application/vnd.ms-powerpoint
application/vnd.openxmlformats-officedocument.presentationml.presentation
application/epub+zip
application/vnd.ms-outlook
application/rtf
application/vnd.oasis.opendocument.presentation
application/vnd.oasis.opendocument.spreadsheet
application/vnd.oasis.opendocument.text
text/plain

text/plain

de.eitco.commons.conversion.plugins.oss.MsgAttachmentsExtractionPlugin

Extracts attachments from Outlook Message files and renders them to a single PDF.

Source media types	Target media types
application/vnd.ms-outlook	application/pdf

Source media types

Target media types

application/vnd.ms-outlook

application/pdf

The OpenHtmlRenderPlugin has a problem with xml-structures who doesn’t close the tags. This will give you an exception.

Fulltext plugins

The following fulltext extraction plugins exist:

de.eitco.commons.conversion.plugins.oss.TikaFulltextExtractionPlugin
- extracts text from pdf, doc(x), xls(x), epub, html, msg, odp, ods, odt, pptx, rtf and xml files using open source java solutions (namely apache tika)
de.eitco.commons.conversion.plugins.oss.TikaFulltextOcrExtractionPlugin
- extracts text with ocr from pdf, doc(x), xls(x), epub, html, msg, odp, ods, odt, pptx, rtf and xml files using open source java solutions (namely apache tika and tesseract for ocr)

Enable tesseract ocr:

If you want to use the TikaFulltextOcrExtractionPlugin to extract text from images, you need to install tesseract. In the following steps the installation will be explained.

Download and install tesseract from this page https://github.com/tesseract-ocr/tessdoc/blob/main/Downloads.md.
- For Ubuntu for example: sudo apt install tesseract-ocr
Add tesseract to the path enviroment variables. For example in Windows you must add the following lines in the path:
- {path to tesseract}\Tesseract-OCR
- {path to tesseract}\Tesseract-OCR\tessdata
Add new language packages to tesseract tessdata directory. Download the packages from the following site https://ocrmypdf.readthedocs.io/en/latest/languages.html
- The default language is english (tesseract shortname = eng)
- If you add a new language you must add they also in the yaml
- Also you can change the dpi for tessearct image extraction

Example of adding english and germany to tesseract and change the dpi

extraction:
  tikaOcrLanguage: "eng+deu"
  tikaOcrDpi: 300

Levigo Jadice plugins

The document-conversion-plugins-jadice library provides several plugins based on Microsoft Graph libraries. To use the plugins, the jar file of the library must be added to the classpath of the service. The plugins will be registered automatically.

A ZIP containing the library and additional dependencies can be downloaded from Nexus using the following maven coordinates:

<dependency>
    <groupId>de.eitco.commons</groupId>
    <artifactId>document-conversion-plugins-jadice</artifactId>
    <type>zip</type>
    <classifier>zip</classifier>
    <version>1.0.0</version>
</dependency>

Plugins contained in document-conversion-plugins-jadice

de.eitco.commons.conversion.plugins.jadice.JadiceToPdfPlugin

Renders various formats to PDF using Jadice.

Source media types	Target media types
application/msword application/vnd.openxmlformats-officedocument.wordprocessingml.document application/vnd.ms-excel application/vnd.openxmlformats-officedocument.spreadsheetml.sheet application/vnd.ms-powerpoint application/vnd.openxmlformats-officedocument.presentationml.presentation text/plain message/rfc822 application/xhtml+xml text/html image/jpeg image/gif image/png image/bmp	application/pdf

Source media types

Target media types

application/msword
application/vnd.openxmlformats-officedocument.wordprocessingml.document
application/vnd.ms-excel
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
application/vnd.ms-powerpoint
application/vnd.openxmlformats-officedocument.presentationml.presentation
text/plain
message/rfc822
application/xhtml+xml
text/html
image/jpeg
image/gif
image/png
image/bmp

application/pdf

Please note that Microsoft Office documents (Powerpoint and Word) are here not confronted with any formatting related issues.

Configuration considerations of Levigo Jadice usage

A RenditionUseCase describes the source- and target-mimetype of a conversion supported by a specific plugin. The configuration file below shows how the source-mimetype 'application/msword' to the target-mimetype 'application/pdf' is configured using the JadiceToPdfPlugin.

rendering:
  useCases:
    - plugin: de.eitco.commons.conversion.plugins.jadice.JadiceToPdfPlugin
      sourceType: "application/msword"
      targetType: "application/pdf"
  containerUseCases:
    - targetType: "application/pdf"
    - plugin: de.eitco.commons.conversion.plugins.jadice.JadiceToPdfPlugin

Microsoft Graph plugins

The document-conversion-plugins-msgraph library provides several plugins based on Microsoft graph libraries. To use the plugins, the jar file of the library must be added to the classpath of the service. The plugins will be registered automatically.

If you want to use this plugin you need a Microsoft 365 account which can use sharepoint and azure.

A ZIP containing the library and additional dependencies can be downloaded from nexus using the following maven coordinates:

<dependency>
    <groupId>de.eitco.commons</groupId>
    <artifactId>document-conversion-plugins-ms-graph</artifactId>
    <type>zip</type>
    <classifier>zip</classifier>
    <version>1.0.0</version>
</dependency>

Plugins contained in document-conversion-plugins-ms-graph

de.eitco.commons.conversion.plugins.msgraph.GraphRenderPlugin

Renders Microsoft Office Documents and some other formats to PDF using the Microsoft Graph API

Source media types	Target media types
application/msword application/vnd.openxmlformats-officedocument.wordprocessingml.document application/vnd.ms-excel application/vnd.openxmlformats-officedocument.spreadsheetml.sheet application/vnd.ms-powerpoint application/vnd.openxmlformats-officedocument.presentationml.presentation application/vnd.ms-outlook application/rtf text/html application/xhtml+xml application/vnd.oasis.opendocument.presentation application/vnd.oasis.opendocument.spreadsheet application/vnd.oasis.opendocument.text message/rfc822 image/tiff	application/pdf

Source media types

Target media types

application/msword
application/vnd.openxmlformats-officedocument.wordprocessingml.document
application/vnd.ms-excel
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
application/vnd.ms-powerpoint
application/vnd.openxmlformats-officedocument.presentationml.presentation
application/vnd.ms-outlook
application/rtf
text/html
application/xhtml+xml
application/vnd.oasis.opendocument.presentation
application/vnd.oasis.opendocument.spreadsheet
application/vnd.oasis.opendocument.text
message/rfc822
image/tiff

application/pdf

Setup microsoft graph render plugins

If you want to use the graph render plugins you will need a technical user and a sharepoint drive. A technical user in azure is named "app registration". In the following we will explain how to get an azure app registration and the sharepoint drive id.

App Registration

Login into https://portal.azure.com/. And search for the App Registration now you will see the following windows and you can create a new app registration:

Figure 1. Azure - App-Registration

If you go into your new app registration you will find in the overview the client-id and the tenant-id.

Figure 2. App-Registration details

Now you can go to the api-authorization. The minimum you will need for the microsoft graph render plugin is the authorization named "Files.ReadWrite.All".

Figure 3. App-Registration api authorization

Now you can also set the authentication security for the app-registration. You can allow the app-registration to work in every tenant in azure or only in the tenant of yourself.

Figure 4. App-Registration authentication

The last we will need is a secret this can generate in the following window.

Figure 5. App-Registration secrets

SharePoint

At first you need to create a sharepoint site and a drive. If you have create this you will need the SharePoint drive id. In the following we will explain how to get the drive id.

First you will need a SharePoint Teamwebsite. After you create the website, you can create a new document library [Optional].

A helper for the following request is the following Graph Explorer.

If you use this tool you musst be login with a user and you will set the permissions in "Modify permissions".

Get sharepoint site id
GET https://graph.microsoft.com/v1.0/sites/root:/sites/{site_url_name}
Example https://graph.microsoft.com/v1.0/sites/root:/sites/xxxxxxx

Response:

{
    "@odata.context": "https://graph.microsoft.com/v1.0/$metadata#sites/$entity",
    "id": "xxxxxxxx.sharepoint.com,0bbbfad6-xxxx-xxxx-xxxx-xxxxxxxxxxxx,3c5f2d82-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    "name": "xxxxxxxx",
    "displayName": "arveo"
}

Get sharepoint drive id
GET https://graph.microsoft.com/v1.0/sites/{site_id}/drives
Example https://graph.microsoft.com/v1.0/sites/xxxxxxxx.sharepoint.com,0bbbfad6-xxxx-xxxx-xxxx-xxxxxxxxxxxx,3c5f2d82-xxxx-xxxx-xxxx-xxxxxxxxxxxx/drives

Response:

{
    "@odata.context": "https://graph.microsoft.com/v1.0/$metadata#drives",
    "value": [
        {
            "id": "b!XXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
            "name": "Conversion"
        }
    ]
}

YAML configuration

In the yaml of the document conversion service you must set the following to use the graph render plugin:

graph:
    clientId: {app_registration_client_id}
    clientSecret: {app_registration_client_secret}
    tenantGuid: {app_registration_tenant_id}
    driveId: {sharepoint_drive_id}
    timeout: 120s # Default: 60s

Microsoft Azure Plugins

The document-conversion-plugins-ms-azure library provides several plugins based on microsoft azure libraries. To use the plugins, the jar file of the library must be added to the classpath of the service. The plugins will be registered automatically.

A ZIP containing the library and additional dependencies can be downloaded from nexus using the following maven coordinates:

<dependency>
    <groupId>de.eitco.commons</groupId>
    <artifactId>document-conversion-plugins-ms-azure</artifactId>
    <type>zip</type>
    <classifier>zip</classifier>
    <version>1.0.0</version>
</dependency>

Plugins contained in document-conversion-plugins-ms-azure

de.eitco.commons.conversion.plugins.msazure.AzureCognitiveOcrExtractionPlugin

Extracts text from images and pdfs by using azure cognitive services.

Source media types	Target media types
image/jpeg image/png image/bmp application/pdf image/tiff	text/plain

Source media types

Target media types

image/jpeg
image/png
image/bmp
application/pdf
image/tiff

text/plain

Setup microsoft azure fulltext plugin

If you want to use the azure fulltext plugins you will need a cognitive service in your azure portal.

Setup cognitive service

Login into https://portal.azure.com/. And search for the Cognitive Services (Computer Vision) now you will see the following window and you can create a new computer vision service:

Figure 6. Azure - Create computer vision service

After creating a computer vision service, you need the endpoint. The endpoint can you find in the overview of the service.

Figure 7. Computer vision service overview

Now you need an access key of your computer vision service.

Figure 8. App-Registration api authorization

YAML configuration

In the yaml of the document conversion service you must set the following to use the azure fulltext plugin:

azure:
  cognitive:
    key: "XXX"
    endpoint: "XXX"
    ocrDetectionLanguage: "de"
    modelVersion: "latest"

The default value for the "ocrDetectionLanguage" value is de (german). The other language you can choose is english. If you want to use english, you can write in your yaml "en".

Azure computer version models can you find here. === Amazon AWS plugins

The document-conversion-plugins-amazon-aws library provides several plugins based on amazon aws. To use the plugins, the jar file of the library must be added to the classpath of the service. The plugins will be registered automatically.

If you want to use this plugin you need an Amazon AWS account (also named IAM account).

A ZIP containing the library and additional dependencies can be downloaded from nexus using the following maven coordinates:

<dependency>
    <groupId>de.eitco.commons</groupId>
    <artifactId>document-conversion-plugins-amazon-aws</artifactId>
    <type>zip</type>
    <classifier>zip</classifier>
    <version>1.0.0</version>
</dependency>

Plugins contained in document-conversion-plugins-amazon-aws

de.eitco.commons.conversion.plugins.aws.AwsTextractPdfToSearchablePdfPlugin

Generates a searchable pdf from a scanned pdf with aws textract

Source media types	Target media types
application/pdf	application/pdf

Source media types

Target media types

application/pdf

application/pdf

The AwsTextractPdfToSearchablePdfPlugin has a dependency on the PdfToImagesRenderPlugin (de.eitco.commons.conversion.plugins.oss)

Setup IAM account and credentials

If you don’t want to use a technical user for this plugin you can continue with add permission.

1 Create User: Login into AWS Console and search for AWS IAM. Here you can create a new user with the Add users button.

Figure 9. AWS IAM create user

2 Add Permission: After you create a new user you must add permissions. To do that, go to the newly created user account.

Figure 10. AWS IAM add permissions

Click on Add permission, then on Attach existing policies and add the AmazonTextractFullAccess permission to the user.

Figure 11. AWS IAM add permissions for textract

3 Create Access Key: Now you need to create an access key. Switch to the tab Security credentials.

Figure 12. AWS IAM security credentials

Here you can find acces keys. Create a new access key.

Figure 13. AWS IAM add create access key

YAML configuration

In the yaml of the Document Conversion Service you must set the following to use the aws textract render plugin:

Here you can find the aws region list.

aws:
    accessKey: "XXXXXXXXXXXXXXXXXXXX"
    secretKey: "X1XX2XXXXXXXXXXX3XXXXXXXX4XXXXXXX567XXX8"
    region: "EU_WEST_1"

e-iceblue plugins

If you want to use the e-iceblue Plugins, you must have an iText license.

The document-conversion-plugins-e-iceblue library provides several plugins based on e-iceblue libraries. To use the plugins, the jar file of the library must be added to the classpath of the service. The plugins will be registered automatically.

A ZIP containing the library and additional dependencies can be downloaded from Nexus using the following maven coordinates:

<dependency>
    <groupId>de.eitco.commons</groupId>
    <artifactId>document-conversion-plugins-e-iceblue</artifactId>
    <type>zip</type>
    <classifier>zip</classifier>
    <version>1.0.0</version>
</dependency>

Plugins contained in document-conversion-plugins-e-iceblue

de.eitco.commons.conversion.plugins.eiceblue.EIcebluePdfToPdfAPlugin

Generates a subtype of pdfa from a pdf with e-iceblue.

Source media types	Target media types
application/pdf	application/pdf

Source media types

Target media types

application/pdf

application/pdf

YAML configuration

In the yaml of the document conversion service you must set the following to use the aws textract render plugin:

Here you can find the aws region list.

e-iceblue:
  pdfType: "PdfA1A"
  license: ""

iText Plugins

If you want to use the iText Plugins, you must have an iText license.

The document-conversion-plugins-itext library provides several plugins based on iText libraries. To use the plugins, the jar file of the library must be added to the classpath of the service. The plugins will be registered automatically.

A ZIP containing the library and additional dependencies can be downloaded from Nexus using the following maven coordinates:

<dependency>
    <groupId>de.eitco.commons</groupId>
    <artifactId>document-conversion-plugins-itext</artifactId>
    <type>zip</type>
    <classifier>zip</classifier>
    <version>1.0.0</version>
</dependency>

Plugins contained in document-conversion-plugins-itext

de.eitco.commons.conversion.plugins.itext.ITextPdfToSearchablePdfaPlugin

Generates a searchable pdfa out of a scanned pdf with itext.

Source media types	Target media types
application/pdf	application/pdf

Source media types

Target media types

application/pdf

application/pdf

The ITextPdfToSearchablePdfaPlugin has a dependency to the PdfToImagesRenderPlugin (de.eitco.commons.conversion.plugins.oss)

Installing Tesseract

If you want to use the ITextPdfToSearchablePdfaPlugin you need to install tesseract on you operating system.

Download and install tesseract from this page https://github.com/tesseract-ocr/tessdoc/blob/main/Downloads.md.
- For Ubuntu for example: sudo apt install tesseract-ocr
Add tesseract to the path enviroment variables. For example in Windows you must add the following lines in the path:
- {path to tesseract}\Tesseract-OCR\tessdata

YAML configuration

In the yaml of the document conversion service you must set the following:

itext:
  pathToTessData: "{path to tesseract}/Tesseract-OCR/tessdata"
  pdfLang: "de"
  iTextLicensePath: "{path to itext license}/itextkey.json"

Usage

The following examples show how to perform different conversions using the client API of the conversion service. The following dependency is required to get access to the http client API:

<dependency>
  <groupId>de.eitco.commons</groupId>
  <artifactId>document-conversion-http-client-spring-boot-starter</artifactId>
  <version>${project.version}</version>
</dependency>

The client API instances can be obtained using injectable factory classes.

@Autowired
private FulltextResourceClientFactory fulltextClientFactory;
@Autowired
private DocumentConversionResourceClientFactory conversionClientFactory;

The utility class de.eitco.commons.io.ContentAnalyzer contained in cmn-commons-io can be used to determine the mime type of a file.

Converting an image to PDF

This example shows how to convert an image of type JPEG to PDF.

DocumentConversionResourceClient client = conversionClientFactory.newClient(); (1)
File jpeg = new File("src/test/data/source/source_inputstreamlist_images/jpgSample.jpg");

try (FileInputStream inputStream = new FileInputStream(jpeg)) {
    InputStream rendition = client.render(MediaType.IMAGE_JPEG_VALUE, MediaType.APPLICATION_PDF_VALUE, inputStream); (2)
    try (FileOutputStream out = new FileOutputStream("target/rendition.pdf")) {
        IOUtils.copy(rendition, out); (3)
    }
}

1	Creates a new client instance using the DocumentConversionResourceClientFactory
2	Sends a request for the rendition to the service. The source- and target-mimetypes are strings and can be obtained from any utility class containing standard mime type strings.
3	Saves the rendition to a file using Apache Commons IO IOUtils

Extracting text from a PDF file

This example shows how the API can be used to extract text content from a PDF file.

FulltextResourceClient client = fulltextClientFactory.newClient(); (1)
File pdf = new File("src/test/fulltext-data/source/test.pdf");

try (FileInputStream inputStream = new FileInputStream(pdf)) {

    String text = client.extractText(MediaType.APPLICATION_PDF_VALUE, inputStream); (2)
}

1	Creates a new client instance using the FulltextResourceClientFactory
2	Sends a request for the extraction to the service. The source-mimetype is a string and can be obtained from any utility class containing standard mime type strings.

Combining multiple images to one PDF

This exmaple shows how to combine multiple images to a single PDF file.

DocumentConversionResourceClient client = conversionClientFactory.newClient(); (1)

File image1 = new File("src/test/data/source/source_inputstreamlist_images/jpgSample.jpg");
File image2 = new File("src/test/data/source/source_inputstreamlist_images/pngSample2.png");

List<ConversionMultipartBodyElement> inputStreamAndMediaTypeList = new ArrayList<>();

FileInputStream fileInputStream1 = null;
FileInputStream fileInputStream2 = null;

try {
    fileInputStream1 = new FileInputStream(image1);
    fileInputStream2 = new FileInputStream(image2);

    inputStreamAndMediaTypeList.add(new ConversionMultipartBodyElement(MediaType.IMAGE_JPEG_VALUE, fileInputStream1)); (2)
    inputStreamAndMediaTypeList.add(new ConversionMultipartBodyElement(MediaType.IMAGE_PNG_VALUE, fileInputStream2));

    InputStream combinedPdf = client.combineToPdf(inputStreamAndMediaTypeList); (3)

} finally {
    fileInputStream1.close();
    fileInputStream2.close();
}

1	Creates a new client instance using the DocumentConversionResourceClientFactory
2	Create the body elements for the multipart request that will be sent to the service
3	Sends the request to the service

Monitoring

Metrics

The Document Conversion Service provides metrics that can be used to monitor the performance. The following metrics are available:

dcs.render.time: Records the time it took to perform one rendering request as well as a counter for the number of performed requests.
dcs.render.errors: A counter for errors that occurred while rendering.
dcs.extract.time: Records the time it took to perform one fulltext extraction request as well as a counter for the number of performed requests.
dcs.extract.errors: A counter for errors that occurred while performing fulltext extractions.

Each of these metrics contains tags for the source and (if applicable) the target mime type in standard string representation (for example: image/jpeg) and for the size of the original content. The size metric uses size ranges: 0-1MB, 1-10MB, 10-100MB, 100-1000MB and >1000MB.

The recording of those metrics can be disabled by setting the parameters management.metrics.enable.dcs.render or management.metrics.enable.dcs.extract to false.

Open Telemetry

The Document Conversion Service supports the usage of Open Telemetry. Spans are created for the methods of each render plugin. The outermost span will contain the ID and the tenant (if available) of the user who performed the request.