Universal Import Service 2.0.0-SNAPSHOT

The Universal Import Service is used to import data into arveo from third party systems. Currently, the service supports reading data from CSV files and from the filesystem.

The service is based on Apache Camel and monitors configurable directories for files to import. It provides plugin interfaces for CsvLineMappers and FileMappers. CsvLineMappers are used to create one or more arveo entities from one line in a CSV file. FileMappers do the same for arbitrary files. A highly configurable default mapper implementation for both use cases is provided, which should be sufficient for most import scenarios.

Configuration

The service offers some generic settings that apply to all mapping configurations. Because the service is based on Apache Camel, several configuration options for the Camel components apply.

The Camel routes used to import files use a URI-parameter to configure the start of the route. This makes it possible to select and configure the Camel component for the start of the route by URI scheme.

A URI must be configured for each directory to import files from. The keys in the configuration maps define the IDs of the Camel routes. In the following example, three CSV import routes and one file import route are configured.

configuring the URIs for CSV imports
universal-import-service:
  csv:
    csv-import-1:
      uri: "file://${project.build.testOutputDirectory}/csv?antInclude=**/test-Demo.csv&noop=false"
    csv-import-2:
      uri: "file://${project.build.testOutputDirectory}/csv?antInclude=**/test-100.csv&noop=false"
    csv-import-3:
      uri: "file://${project.build.testOutputDirectory}/csv?antInclude=**/test-100_2.csv&noop=false"
configuring the URI for file imports
universal-import-service:
  file:
    file-import-1:
      uri: "file://${project.build.testOutputDirectory}/content?noop=true&moveFailed=.error"

In the example, the file component of Apache Camel is activated by the file: scheme of the URI. The Camel documentation contains information about the available parameters for the file endpoint as well as the other available endpoints.

The way CSV files are processed is controlled by the CSV data format of Apache Camel. It offers various configuration properties, that are listed in the Camel documentation. The properties can be used in the configuration file for the Universal Import Service as shown below:

configing the CSV data format
camel:
  dataformat:
    csv:
      delimiter: ";"

In the example, the delimiter for the CSV columns is set to ;.

The configuration settings for the CSV data format apply to all CVS import routes.

Generic CSV mapper

The generic CSV mapper is the default implementation of the CsvLineMapper interface contained in the service. The mapper maps each column in the CSV file to an attribute of an arveo entity. Content to import is read from a configurable column, which can contain zero or more file names to import. The file names can either contain a fully qualified path, or just the name of the file. In the later case, the directory containing the files with the actual content can be configured (see below).

The mapper offers three different modes:

  • SIMPLE: This is the default. Each line in the CSV file is mapped to one arveo document, which might contain zero or more content elements. The mapping of files to content elements is fixed and can either map file names to content element names or positions in the list of files to content element names.

  • COUNTING: Each line is mapped to one or more arveo documents. Each document contains either one content element or no content at all. A counter can be used as prefix or suffix for any imported attribute to distinguish the documents created for one line.

  • REFERENCING: Each line is mapped as a simple record structure consisting of a container entity containing all attributes and zero or more document components referenced by a foreign key, each containing one content element.

The generic CSV mapper supports individual configurations for each configured import route as shown in the following example:

configuring the generic mapper
universal-import-service:
  csv:
    csv-import-1: (1)
      uri: "file://..."
  generic-csv-mapper:
    settings:
      csv-import-1: (2)
        type-definition-name: "demo_document"
1 The ID of the import route for the configured directory
2 The ID of the configuration for the generic CSV mapper. Must match the ID of the import route.

Attribute mapping

The mapping of CSV columns to arveo attributes works the same in each mode. An attribute mapping must be configured for each CSV column that is supposed to be imported. Attribute mappings are configured in a map, the keys being the names of the columns of the CSV file. An attribute mapping consists of the following parameters:

Table 1. attribute mapping parameters
Parameter Explanation

attribute-name

The name of the arveo attribute (in snake-case)

type

The type of the attribute (SHORT, INTEGER, LONG, DOUBLE, STRING, BOOLEAN, DATE_TIME, DATE or TIME)

array

Whether the attribute is multivalued or not (the default is false).

delimiter

The delimiter of multivalued attributes. Ignored when array is set to false.

date-pattern

The pattern used to parse attributes of type DATE.

time-pattern

The pattern used to parse attributes of type TIME.

date-time-pattern

The pattern used to parse attributes of type DATE_TIME.

zone-id

The time zone ID used when the value in the CSV column for DATE_TIME attributes does not contain time zone information.

local-date

The local data for a date-time attribute used if the value does not contain a date. Must be in ISO-8601 format such as '2011-12-03'.

local-time

The local time for a date-time attribute used if the value does not contain a time. Must be in ISO-8601 format such as '10:15' or '10:15:30'

prefix

An optional prefix added to imported attributes of type STRING.

suffix

An optional suffix added to imported attributes of type STRING.

default-value

The default value used when the line does not contain a value for a configured attribute mapping. Must follow the same format rules as the other values in the column.

Attributes are parsed using the default Java parsers, e.g. Integer.parseInt() for INTEGER, or using the supplied patterns for date, time or date-time values. Booleans can either be Strings ('true', 'false') or integers (0,1).

The default date-time-pattern used by the importer for date-time attributes ([u-M-d]['T'][H:m:s][X]) allows parsing partial values for date-time attributes. The parts in square brackets (the date, the letter 'T', the time and the zone) are optional. If no value for these parts is contained in the parsed value, it is replaced by the configured default local-date, local-time or zone-id.

Attribute mappings are configured for each mapper mode. For example, when the counting mode is used, the attribute mappings would be configured in the setting universal-import-service.generic-csv-mapper.counting.attributes.

The example below shows a mapping configuration for the import of CSV columns called 'sysrowid', 'systimestamp' and 'ispdf'.

attribute mapping
attributes:
  sysrowid:
    attribute-name: "sys_row_id"
    type: STRING
  systimestamp:
    attribute-name: "sys_time_stamp"
    type: DATE_TIME
    date-time-pattern: "u-M-d H:m:s"
    zone-id: "UTC"
  ispdf:
    attribute-name: "pdf"
    type: BOOLEAN
  archive:
    attribute-name: "archive"
    type: STRING
    default-value: "records"
Attributes using a default value do not have to be contained in the CSV file. This makes it possible to add new attributes that were not contained in the original data.

Simple mode

The simple mode is the default operating mode of the generic CSV mapper. In this mode, a 1:1 mapping between file names read from the CSV file and content element names must be configured. The mapping can either be from file name to content element name or from the position of the file name in the list to a content element name. Because the simple mode is the default, it does not habe to be explicitly enabled in the configuration.

simple mode configuration
generic-csv-mapper:
  settings:
    csv-import-1: (1)
      type-definition-name: "demo_document" (2)
      simple:
        content:
          csv-field-name: "filename" (3)
          content-path: "${project.build.testOutputDirectory}/content" (4)
          position-mappings:
            0: "content" (4)
1 The name of the import configuration. Must match the name of the configured route (see configuration])
2 The name of the type definition that will contain the imported documents
3 The name of the field in the CSV file containing the file names
4 The path of the directory that contains the files. In this case, the CSV is expected to contain only the file names.
5 Mapping by position. The first file will be stored in the content element named "content".

A complete configuration example can be found in the system-test module in the file src/test/resource-templates/config/universal-import-service.yaml.

Counting mode

In the counting mode, CSV lines containing more than one filename are mapped to multiple independent document entities. Each document entity will contain one content element. If a line in the CSV file does not contain any file names, one document with no content elements will be created. A counter can be added to the suffix or prefix of any string attribute by using the placeholder $+{contentElementNumber}+.

counting mode configuration
generic-csv-mapper:
  settings:
    csv-import-2:
      type-definition-name: "document" (1)
      mode: COUNTING (2)
      counting:
        content:
          csv-field-name: "filename" (3)
          delimiter: "," (4)
          content-path: "${project.build.testOutputDirectory}/content" (5)
        attributes:
          xhdoc:
            attribute-name: "xhdoc"
            type: STRING
            suffix: "_${contentElementNumber}" (6)
1 The name of the type definition that will contain the imported documents
2 The counting mode must be enabled explicitly
3 The name of the field in the CSV file containing the file names
4 The delimiter used to separate file names
5 The path of the directory that contains the files. In this case, the CSV is expected to contain only the file names.
6 Adds a suffix with the counter (starting at 1) of the file

A complete configuration example can be found in the system-test module in the file src/test/resource-templates/config/universal-import-service-counting.yaml.

Referencing mode

In the referencing mode, a record container is created for each imported document. This record will contain all attributes, but no content. For each imported file, a document is created that contains only the data of the imported file. The documents are referenced by a foreign key containing the ID of the record container.

The imported documents do not contain any custom attributes, but arveo’s inheritance feature can be used to automatically inherit attributes from the referenced record.
referencing mode configuration
generic-csv-mapper:
  settings:
    csv-import-3:
      type-definition-name: "component" (1)
      mode: REFERENCING (2)
      referencing:
        container-type-definition-name: "container" (3)
        reference-field-name: "container_id" (4)
        content:
          csv-field-name: "filename" (5)
          delimiter: "," (6)
          content-path: "${project.build.testOutputDirectory}/content" (7)
1 The name of the type definition that will contain the imported documents
2 The referencing mode must be enabled explicitly
3 The name of the type definition containing the record containers
4 The name of the attribute in the documents containing the foreign key
5 The name of the field in the CSV file containing the file names
6 The delimiter used to separate file names
7 The path of the directory that contains the files. In this case, the CSV is expected to contain only the file names.

A complete configuration example can be found in the system-test module in the file src/test/resource-templates/config/universal-import-service-referencing.yaml.

Configuring the foreign key

The foreign key links the document entities to a container entity. As shown above, the reference field name in the document entities must be configured. The field in the container entity referenced by the foreign key is, by default, the system field id. The value will be retrieved automatically by the batch processing API of the content repository service. It is possible to change the name of the referenced field using the property reference-target-field-name. It is also possible to disable the automatic retrieval of the foreign key value by configuring no value for the reference-field-name property. The value must be provided by the attribute mappings in this case.

Document attributes

By default, the document does not have any attributes except the required system fields and the reference field. In some cases, the document type definition might require additional custom fields. Those can be configured by providing additional attribute mappings for the document entities as shown below:

configuring document attribute mappings
generic-csv-mapper:
  settings:
    csv-import-3:
      mode: REFERENCING
      referencing:
        document-attributes:
          docid:
            attribute-name: "doc_id"
            type: STRING
          contentRep:
            attribute-name: "repository_id"
            type: STRING
          componentName:
            attribute-name: "component_id"
            type: STRING
            default-value: "data_${documentNumber}"

The generic CSV mapper provides one placeholder $+{documentNumber}+ that contains the document’s number (starting at 1) and that can be used either in a prefix or suffix or in a default value.

Content preprocessing

The generic CSV mapper provides an additional extension mechanism that can be used to preprocess the binary content for each individual line in the CSV file before it is handed over to the mapper. This way it is possible to perform tasks like decryption, conversion or merging of content.

The extension mechanism works by registering a bean of type de.eitco.uis.mappers.common.ContentPreprocessor (for example, by creating a custom Spring Boot starter). Each ContentPreprocessor must implement the preprocess method, which has the following parameters:

  • fileNames: A list of file names that was parsed from the current line in the CSV file (never null)

  • csvFilePath: The directory that contains the CSV file (never null)

  • contentPath: The configured path used to find the actual files (might be null)

The ContentPreprocessor returns a list of PreprocessedContent instances, which consist of an InputStream, the file name that will be used in arveo and the position of the file in the original list of file names. How many PreprocessedContent instances are contained in the returned list depends on the preprocessor. For example, an implementation might take all files and merge them to a single PDF file. Other implementations might just wrap the returned streams for decryption. Custom implementation can extend the class de.eitco.uis.mappers.common.AbstractContentPreprocessor which contains a utility method to open new FileInputStreams.

It is possible to register several ContentPreprocessor beans. The processor used for a specific import route is selected by calling the processor’s usedForRoute method with the ID of the route. The first processor found that returns true in this method will be used.

The interfaces to implement as well as the other classes required to implement a ContentPreprocessor are contained in the following artifact:

<dependency>
    <groupId>de.eitco.uis</groupId>
    <artifactId>universal-import-mappers-common</artifactId>
    <version>${import.service.version}</version>
</dependency>

To enable custom extensions, the Jar containing the Spring Boot Starter (and any additional Jar, if required) must be placed in the directory configured as loader path of the service using the -Dloader.path parameter.

Writing a custom line mapper

Custom line mappers have to implement the interface CsvLineMapper. Mappers can use the typed or the generic API of arveo. A mapper that uses the generic API has to return true in the isGeneric method implementation and has to implement the mapLineGeneric method. Typed mappers have to return false in the isGeneric method and have to implement the mapLine method.

The custom mapper implementation has to be registered as a Spring bean in a custom Spring Boot starter. To replace the provided default mapper, the custom auto starter either has to run before the auto configuration class de.eitco.uis.mappers.csv.GenericCsvMapperAutoConfiguration or the default mapper bean registrations have to be disabled by setting the property universal-import-service.generic-csv-mapper.enabled to false.

Line mappers return a list of batch operations. The arveo entities created for one line can be created using the respective batch operation(s). The operations will be executed in the order in which they are contained in the list.

The custom mapper can be activated by adding the jar of the custom starter to the service’s libs directory configured by the parameter -Dloader.path.

Generic file mapper

The generic file mapper is the default implementation of the FileMapper interface contained in the service. It can import files from a configurable directory into arveo. Properties of the files, for example parts of the path of file name, can be extracted as attributes of the arveo entities.

The mapper operates in two phases:

  • Phase 1: Collect properties of the file to import. These can be parts of the path, the length or type of the file.

  • Phase 2: Map collected properties to arveo attributes.

Property collection

Which properties to collect is configurable. Properties can be extracted from the path and file name by position or by using a regular expression. The path is split using the system’s path separator. The file name can be split using a configurable separator. The value of each property is a string that can be mapped to an attribute using the attribute mappings described below.

Positional properties

Positional properties are collected from the path or file name, after it is split into an array of strings either by using the path separator or the configured file name separator character. The position can be given as a positive or negative integer. A positive integer (starting at 1) defines the position from left to right. A negative integer (starting at -1) defines the position from right to left. For example, in the path /path/to/my/file.txt, the position 1 would match path and the position -2 would match my. Likewise, when the file name is used to collect a property and the file name separator is configured as _, the position 2 in the file name invoice_20250982.pdf would match 20250982 (the extension is stripped from the file name by default).

Regular expression properties

Properties can be extracted from the path or file name using a regular expression. The expression can use capturing groups. The number of the group that contains the property can be configured. For example, the regular expression [-,_]([A-Z]{2})$ could be used to parse the language DE as two upper-case characters at the end of the file name EITCO_arveo-secom_Produktportfolio_DE.pdf. The extension is removed from the filename by default before the expression is matched against the file name.

Property mapping

Properties to collect are configured using a map, where the keys are the names of the properties. These names can then be used in the attribute mappings. Each map entry can contain the following settings. Note that only one way to collect a property can be used for each individual property.

Table 2. property mapping options
Option Explanation

position-in-path

Position in the path split by path separator char. Positive or negative integer.

position-in-file-name

Position in the file name split by file name separator char. Positive or negative integer.

path-regex

Regular expression matched on the path

file-name-regex

Regular expression matched on the file name

remove-extension

Whether to remove the extension from the path or file name. Default is true.

Each regular expression has two settings:

Table 3. regular expression settings
Option Explanation

expression

The regular expression.

group

the number of the group containing the value (default is 0 for entire expression).

Predefined properties

The mapper provides some predefined properties that can be used in attribute mappings without additional configuration in the property mappings.

Table 4. predefined properties
Property Content

_path

Absolute path of the file (string)

_name

File name (string including extension)

_parent

Path of the file’s parent directory (string)

_length

Length of the file (long)

_extension

Extension of the file name only (string)

_last_modified

Last modified timestamp of the file (in IS0-8601 format)

Attribute mapping

The collected properties (and the predefined properties) are mapped to arveo attributes in the attribute mappings. The key of each attribute mapping must be a collected or predefined property name. Each attribute mapping can use the following parameters:

Table 5. attribute mapping parameters
Parameter Explanation

attribute-name

The name of the arveo attribute (in snake-case)

type

The type of the attribute (SHORT, INTEGER, LONG, DOUBLE, STRING, BOOLEAN, DATE_TIME, DATE or TIME)

array

Whether the attribute is multivalued or not (the default is false).

delimiter

The delimiter of multivalued attributes. Ignored when array is set to false.

date-pattern

The pattern used to parse attributes of type DATE.

time-pattern

The pattern used to parse attributes of type TIME.

date-time-pattern

The pattern used to parse attributes of type DATE_TIME.

zone-id

The time zone ID used when the value in the CSV column for DATE_TIME attributes does not contain time zone information.

local-date

The local data for a date-time attribute used if the value does not contain a date. Must be in ISO-8601 format such as '2011-12-03'.

local-time

The local time for a date-time attribute used if the value does not contain a time. Must be in ISO-8601 format such as '10:15' or '10:15:30'

prefix

An optional prefix added to imported attributes of type STRING.

suffix

An optional suffix added to imported attributes of type STRING.

default-value

The default value used when the line does not contain a value for a configured attribute mapping. Must follow the same format rules as the other values in the column.

Attributes are parsed using the default Java parsers, e.g. Integer.parseInt() for INTEGER, or using the supplied patterns for date, time or date-time values. Booleans can either be Strings ('true', 'false') or integers (0,1).

The default date-time-pattern used by the importer for date-time attributes ([u-M-d]['T'][H:m:s][X]) allows parsing partial values for date-time attributes. The parts in square brackets (the date, the letter 'T', the time and the zone) are optional. If no value for these parts is contained in the parsed value, it is replaced by the configured default local-date, local-time or zone-id.

Example configuration

The generic file mapper supports individual configurations for each configured import route as shown in the following example:

configuring the generic mapper
universal-import-service:
  file:
    file-import-1: (1)
      uri: "file://..."
  generic-file-mapper:
    settings:
      file-import-1: (2)
        type-definition-name: "files"
1 The ID of the import route for the configured directory
2 The ID of the configuration for the generic file mapper. Must match the ID of the import route.

The following example shows how to configure the generic file mapper. The example collects three properties and maps these and one of the predefined properties to attributes.

example configuration
generic-file-mapper:
  settings:
    file-import-1: (1)
      type-definition-name: "files" (2)
      properties: (3)
        parent-name:
          position-in-path: -2
        drive-name:
          position-in-path: 1
        language:
          file-name-regex:
            expression: "[-,_]([A-Z]{2})$"
            group: 1
      attributes: (4)
        _name:
          attribute-name: file_name
          type: STRING
        parent-name:
          attribute-name: parent_name
          type: STRING
        language:
          attribute-name: language
          type: STRING
        drive-name:
          attribute-name: drive_name
          type: STRING
1 Name of the configuration. Must match the name of the configured import route.
2 Name of the type definition to import into
3 Map of collected properties
4 Mapping from properties to attributes

Extensions

The generic file importer provides an extension mechanism that allows preprocessing of imported files, for example to perform decryption. To use this extension mechanism, a bean of type de.eitco.uis.mappers.common.FilePreprocessor must be registered with a custom Spring Boot Starter. FilePreprocessors can influence the way the actual content of the file is read, as well as properties like file length, name and path. This mechanism can also be used to provide attributes that will be added to the arveo entity created for the file. This allows to implement an import that reads files from a directory that do not contain the actual content to import, but a link to another file containing the actual content and a collection of attributes to import.

A second extension mechanism allows to resolve attributes from arbitrary sources for an imported file. For example, the extension could read a YAML file containing additional attributes. To use this extension mechanism, a bean of type de.eitco.uis.mappers.common.AttributesProvider must be registered. The attributes returned by an AttributesProvider are added after the regular attribute mapping was performed. So it is possible to overwrite already resolved attributes using an AttributesProvider.

It is possible to register several FilePreprocessor and AttributesProvider beans. The bean used for a specific import route is selected by calling the bean’s usedForRoute method with the ID of the route. The first bean found that returns true in this method will be used.

The required classes for the extensions are contained in the following artifact:

<dependency>
    <groupId>de.eitco.uis</groupId>
    <artifactId>universal-import-mappers-common</artifactId>
    <version>${import.service.version}</version>
</dependency>

To enable custom extensions, the Jar containing the Spring Boot Starter (and any additional Jar, if required) must be placed in the directory configured as loader path of the service using the -Dloader.path parameter.

Writing a custom file mapper

Custom file mappers have to implement the interface FileMapper. Mappers can use the typed or the generic API of arveo. A mapper that uses the generic API has to return true in the isGeneric method implementation and has to implement the mapFileGeneric method. Typed mappers have to return false in the isGeneric method and have to implement the mapFile method.

The custom mapper implementation has to be registered as a Spring bean in a custom Spring Boot starter. To replace the provided default mapper, the custom auto starter either has to run before the auto configuration class de.eitco.uis.mappers.file.GenericFileMapperAutoConfiguration or the default mapper bean registration has to be disabled by setting the property universal-import-service.generic-file-mapper.enabled to false.

File mappers return a list of batch operations. The arveo entities created for one line can be created using the respective batch operation(s). The operations will be executed in the order in which they are contained in the list.

The custom mapper can be activated by adding the jar of the custom starter to the service’s libs directory configured by the parameter -Dloader.path.

Unresolved directive in index.adoc - include::configuration-properties.adoc[]