Universal Import Service 2.0.0-SNAPSHOT
The Universal Import Service is used to import data into arveo from third party systems. Currently, the service supports reading data from CSV files and from the filesystem.
The service is based on Apache Camel and monitors configurable directories for files to import. It provides plugin interfaces
for CsvLineMappers
and FileMappers
. CsvLineMappers
are used to create one or more arveo entities from one line in
a CSV file. FileMappers
do the same for arbitrary files. A highly configurable default mapper implementation for both
use cases is provided, which should be sufficient for most import scenarios.
Configuration
The service offers some generic settings that apply to all mapping configurations. Because the service is based on Apache Camel, several configuration options for the Camel components apply.
The Camel routes used to import files use a URI-parameter to configure the start of the route. This makes it possible to select and configure the Camel component for the start of the route by URI scheme.
A URI must be configured for each directory to import files from. The keys in the configuration maps define the IDs of the Camel routes. In the following example, three CSV import routes and one file import route are configured.
universal-import-service:
csv:
csv-import-1:
uri: "file://${project.build.testOutputDirectory}/csv?antInclude=**/test-Demo.csv&noop=false"
csv-import-2:
uri: "file://${project.build.testOutputDirectory}/csv?antInclude=**/test-100.csv&noop=false"
csv-import-3:
uri: "file://${project.build.testOutputDirectory}/csv?antInclude=**/test-100_2.csv&noop=false"
universal-import-service:
file:
file-import-1:
uri: "file://${project.build.testOutputDirectory}/content?noop=true&moveFailed=.error"
In the example, the file component of Apache Camel is activated by the file:
scheme of the URI. The
Camel documentation contains information about the available
parameters for the file endpoint as well as the other available endpoints.
The way CSV files are processed is controlled by the CSV data format of Apache Camel. It offers various configuration properties, that are listed in the Camel documentation. The properties can be used in the configuration file for the Universal Import Service as shown below:
camel:
dataformat:
csv:
delimiter: ";"
In the example, the delimiter for the CSV columns is set to ;
.
The configuration settings for the CSV data format apply to all CVS import routes. |
Generic CSV mapper
The generic CSV mapper is the default implementation of the CsvLineMapper
interface contained in the service.
The mapper maps each column in the CSV file to an attribute of an arveo entity. Content to import is read from a
configurable column, which can contain zero or more file names to import. The file names can either contain a fully
qualified path, or just the name of the file. In the later case, the directory containing the files with the actual
content can be configured (see below).
The mapper offers three different modes:
-
SIMPLE: This is the default. Each line in the CSV file is mapped to one arveo document, which might contain zero or more content elements. The mapping of files to content elements is fixed and can either map file names to content element names or positions in the list of files to content element names.
-
COUNTING: Each line is mapped to one or more arveo documents. Each document contains either one content element or no content at all. A counter can be used as prefix or suffix for any imported attribute to distinguish the documents created for one line.
-
REFERENCING: Each line is mapped as a simple record structure consisting of a container entity containing all attributes and zero or more document components referenced by a foreign key, each containing one content element.
The generic CSV mapper supports individual configurations for each configured import route as shown in the following example:
universal-import-service:
csv:
csv-import-1: (1)
uri: "file://..."
generic-csv-mapper:
settings:
csv-import-1: (2)
type-definition-name: "demo_document"
1 | The ID of the import route for the configured directory |
2 | The ID of the configuration for the generic CSV mapper. Must match the ID of the import route. |
Attribute mapping
The mapping of CSV columns to arveo attributes works the same in each mode. An attribute mapping must be configured for each CSV column that is supposed to be imported. Attribute mappings are configured in a map, the keys being the names of the columns of the CSV file. An attribute mapping consists of the following parameters:
Parameter | Explanation |
---|---|
attribute-name |
The name of the arveo attribute (in snake-case) |
type |
The type of the attribute ( |
array |
Whether the attribute is multivalued or not (the default is false). |
delimiter |
The delimiter of multivalued attributes. Ignored when |
date-pattern |
The pattern used to parse attributes of type |
time-pattern |
The pattern used to parse attributes of type |
date-time-pattern |
The pattern used to parse attributes of type |
zone-id |
The time zone ID used when the value in the CSV column for |
local-date |
The local data for a date-time attribute used if the value does not contain a date. Must be in ISO-8601 format such as '2011-12-03'. |
local-time |
The local time for a date-time attribute used if the value does not contain a time. Must be in ISO-8601 format such as '10:15' or '10:15:30' |
prefix |
An optional prefix added to imported attributes of type |
suffix |
An optional suffix added to imported attributes of type |
default-value |
The default value used when the line does not contain a value for a configured attribute mapping. Must follow the same format rules as the other values in the column. |
Attributes are parsed using the default Java parsers, e.g. Integer.parseInt()
for INTEGER, or using the supplied
patterns for date, time or date-time values. Booleans can either be Strings ('true', 'false') or integers (0,1).
The default date-time-pattern used by the importer for date-time attributes ([u-M-d]['T'][H:m:s][X]
) allows
parsing partial values for date-time attributes. The parts in square brackets (the date, the letter 'T', the time and
the zone) are optional. If no value for these parts is contained in the parsed value, it is replaced by the configured
default local-date
, local-time
or zone-id
.
Attribute mappings are configured for each mapper mode. For example, when the counting mode is used, the attribute
mappings would be configured in the setting universal-import-service.generic-csv-mapper.counting.attributes
.
The example below shows a mapping configuration for the import of CSV columns called 'sysrowid', 'systimestamp' and 'ispdf'.
attributes:
sysrowid:
attribute-name: "sys_row_id"
type: STRING
systimestamp:
attribute-name: "sys_time_stamp"
type: DATE_TIME
date-time-pattern: "u-M-d H:m:s"
zone-id: "UTC"
ispdf:
attribute-name: "pdf"
type: BOOLEAN
archive:
attribute-name: "archive"
type: STRING
default-value: "records"
Attributes using a default value do not have to be contained in the CSV file. This makes it possible to add new attributes that were not contained in the original data. |
Simple mode
The simple mode is the default operating mode of the generic CSV mapper. In this mode, a 1:1 mapping between file names read from the CSV file and content element names must be configured. The mapping can either be from file name to content element name or from the position of the file name in the list to a content element name. Because the simple mode is the default, it does not habe to be explicitly enabled in the configuration.
generic-csv-mapper:
settings:
csv-import-1: (1)
type-definition-name: "demo_document" (2)
simple:
content:
csv-field-name: "filename" (3)
content-path: "${project.build.testOutputDirectory}/content" (4)
position-mappings:
0: "content" (4)
1 | The name of the import configuration. Must match the name of the configured route (see configuration]) |
2 | The name of the type definition that will contain the imported documents |
3 | The name of the field in the CSV file containing the file names |
4 | The path of the directory that contains the files. In this case, the CSV is expected to contain only the file names. |
5 | Mapping by position. The first file will be stored in the content element named "content". |
A complete configuration example can be found in the system-test module in the file src/test/resource-templates/config/universal-import-service.yaml
.
Counting mode
In the counting mode, CSV lines containing more than one filename are mapped to multiple independent document entities.
Each document entity will contain one content element. If a line in the CSV file does not contain any file names, one
document with no content elements will be created. A counter can be added to the suffix or prefix of any string attribute
by using the placeholder $+{contentElementNumber}+
.
generic-csv-mapper:
settings:
csv-import-2:
type-definition-name: "document" (1)
mode: COUNTING (2)
counting:
content:
csv-field-name: "filename" (3)
delimiter: "," (4)
content-path: "${project.build.testOutputDirectory}/content" (5)
attributes:
xhdoc:
attribute-name: "xhdoc"
type: STRING
suffix: "_${contentElementNumber}" (6)
1 | The name of the type definition that will contain the imported documents |
2 | The counting mode must be enabled explicitly |
3 | The name of the field in the CSV file containing the file names |
4 | The delimiter used to separate file names |
5 | The path of the directory that contains the files. In this case, the CSV is expected to contain only the file names. |
6 | Adds a suffix with the counter (starting at 1) of the file |
A complete configuration example can be found in the system-test module in the file src/test/resource-templates/config/universal-import-service-counting.yaml
.
Referencing mode
In the referencing mode, a record container is created for each imported document. This record will contain all attributes, but no content. For each imported file, a document is created that contains only the data of the imported file. The documents are referenced by a foreign key containing the ID of the record container.
The imported documents do not contain any custom attributes, but arveo’s inheritance feature can be used to automatically inherit attributes from the referenced record. |
generic-csv-mapper:
settings:
csv-import-3:
type-definition-name: "component" (1)
mode: REFERENCING (2)
referencing:
container-type-definition-name: "container" (3)
reference-field-name: "container_id" (4)
content:
csv-field-name: "filename" (5)
delimiter: "," (6)
content-path: "${project.build.testOutputDirectory}/content" (7)
1 | The name of the type definition that will contain the imported documents |
2 | The referencing mode must be enabled explicitly |
3 | The name of the type definition containing the record containers |
4 | The name of the attribute in the documents containing the foreign key |
5 | The name of the field in the CSV file containing the file names |
6 | The delimiter used to separate file names |
7 | The path of the directory that contains the files. In this case, the CSV is expected to contain only the file names. |
A complete configuration example can be found in the system-test module in the file src/test/resource-templates/config/universal-import-service-referencing.yaml
.
Configuring the foreign key
The foreign key links the document entities to a container entity. As shown above, the reference field name in the document
entities must be configured. The field in the container entity referenced by the foreign key is, by default, the system
field id
. The value will be retrieved automatically by the batch processing API of the content repository service.
It is possible to change the name of the referenced field using the property reference-target-field-name
.
It is also possible to disable the automatic retrieval of the foreign key value by configuring no value for the
reference-field-name
property. The value must be provided by the attribute mappings in this case.
Document attributes
By default, the document does not have any attributes except the required system fields and the reference field. In some cases, the document type definition might require additional custom fields. Those can be configured by providing additional attribute mappings for the document entities as shown below:
generic-csv-mapper:
settings:
csv-import-3:
mode: REFERENCING
referencing:
document-attributes:
docid:
attribute-name: "doc_id"
type: STRING
contentRep:
attribute-name: "repository_id"
type: STRING
componentName:
attribute-name: "component_id"
type: STRING
default-value: "data_${documentNumber}"
The generic CSV mapper provides one placeholder $+{documentNumber}+
that contains the document’s number (starting at 1)
and that can be used either in a prefix or suffix or in a default value.
Content preprocessing
The generic CSV mapper provides an additional extension mechanism that can be used to preprocess the binary content for each individual line in the CSV file before it is handed over to the mapper. This way it is possible to perform tasks like decryption, conversion or merging of content.
The extension mechanism works by registering a bean of type de.eitco.uis.mappers.common.ContentPreprocessor
(for example, by
creating a custom Spring Boot starter). Each ContentPreprocessor must implement the preprocess
method, which has
the following parameters:
-
fileNames
: A list of file names that was parsed from the current line in the CSV file (never null) -
csvFilePath
: The directory that contains the CSV file (never null) -
contentPath
: The configured path used to find the actual files (might be null)
The ContentPreprocessor returns a list of PreprocessedContent
instances, which consist of an InputStream
, the
file name that will be used in arveo and the position of the file in the original list of file names. How many PreprocessedContent
instances are contained in the returned list depends on the preprocessor. For example, an implementation might take
all files and merge them to a single PDF file. Other implementations might just wrap the returned streams for decryption.
Custom implementation can extend the class de.eitco.uis.mappers.common.AbstractContentPreprocessor
which contains a utility
method to open new FileInputStreams
.
It is possible to register several ContentPreprocessor
beans. The processor used for a specific import route is
selected by calling the processor’s usedForRoute
method with the ID of the route. The first processor found that returns
true in this method will be used.
The interfaces to implement as well as the other classes required to implement a ContentPreprocessor
are contained
in the following artifact:
<dependency>
<groupId>de.eitco.uis</groupId>
<artifactId>universal-import-mappers-common</artifactId>
<version>${import.service.version}</version>
</dependency>
To enable custom extensions, the Jar containing the Spring Boot Starter (and any additional Jar, if required) must be
placed in the directory configured as loader path of the service using the -Dloader.path
parameter.
Writing a custom line mapper
Custom line mappers have to implement the interface CsvLineMapper
. Mappers can use the typed or the generic API of arveo.
A mapper that uses the generic API has to return true in the isGeneric
method implementation and has to implement the
mapLineGeneric
method. Typed mappers have to return false in the isGeneric
method and have to implement the mapLine
method.
The custom mapper implementation has to be registered as a Spring bean in a custom Spring Boot starter. To replace the
provided default mapper, the custom auto starter either has to run before the auto configuration class de.eitco.uis.mappers.csv.GenericCsvMapperAutoConfiguration
or the default mapper bean registrations have to be disabled by setting the property universal-import-service.generic-csv-mapper.enabled
to false
.
Line mappers return a list of batch operations. The arveo entities created for one line can be created using the respective batch operation(s). The operations will be executed in the order in which they are contained in the list.
The custom mapper can be activated by adding the jar of the custom starter to the service’s libs directory configured
by the parameter -Dloader.path
.
Generic file mapper
The generic file mapper is the default implementation of the FileMapper
interface contained in the service. It can
import files from a configurable directory into arveo. Properties of the files, for example parts of the path of file
name, can be extracted as attributes of the arveo entities.
The mapper operates in two phases:
-
Phase 1: Collect properties of the file to import. These can be parts of the path, the length or type of the file.
-
Phase 2: Map collected properties to arveo attributes.
Property collection
Which properties to collect is configurable. Properties can be extracted from the path and file name by position or by using a regular expression. The path is split using the system’s path separator. The file name can be split using a configurable separator. The value of each property is a string that can be mapped to an attribute using the attribute mappings described below.
Positional properties
Positional properties are collected from the path or file name, after it is split into an array of strings either by using
the path separator or the configured file name separator character. The position can be given as a positive or negative
integer. A positive integer (starting at 1) defines the position from left to right. A negative integer (starting at -1)
defines the position from right to left. For example, in the path /path/to/my/file.txt
, the position 1
would match
path
and the position -2
would match my
. Likewise, when the file name is used to collect a property and the file
name separator is configured as _
, the position 2
in the file name invoice_20250982.pdf
would match 20250982
(the extension is stripped from the file name by default).
Regular expression properties
Properties can be extracted from the path or file name using a regular expression. The expression can use capturing
groups. The number of the group that contains the property can be configured. For example, the regular expression
[-,_]([A-Z]{2})$
could be used to parse the language DE
as two upper-case characters at the end of the file name
EITCO_arveo-secom_Produktportfolio_DE.pdf
. The extension is removed from the filename by default before the expression
is matched against the file name.
Property mapping
Properties to collect are configured using a map, where the keys are the names of the properties. These names can then be used in the attribute mappings. Each map entry can contain the following settings. Note that only one way to collect a property can be used for each individual property.
Option | Explanation |
---|---|
position-in-path |
Position in the path split by path separator char. Positive or negative integer. |
position-in-file-name |
Position in the file name split by file name separator char. Positive or negative integer. |
path-regex |
Regular expression matched on the path |
file-name-regex |
Regular expression matched on the file name |
remove-extension |
Whether to remove the extension from the path or file name. Default is true. |
Each regular expression has two settings:
Option | Explanation |
---|---|
expression |
The regular expression. |
group |
the number of the group containing the value (default is 0 for entire expression). |
Predefined properties
The mapper provides some predefined properties that can be used in attribute mappings without additional configuration in the property mappings.
Property | Content |
---|---|
_path |
Absolute path of the file (string) |
_name |
File name (string including extension) |
_parent |
Path of the file’s parent directory (string) |
_length |
Length of the file (long) |
_extension |
Extension of the file name only (string) |
_last_modified |
Last modified timestamp of the file (in IS0-8601 format) |
Attribute mapping
The collected properties (and the predefined properties) are mapped to arveo attributes in the attribute mappings. The key of each attribute mapping must be a collected or predefined property name. Each attribute mapping can use the following parameters:
Parameter | Explanation |
---|---|
attribute-name |
The name of the arveo attribute (in snake-case) |
type |
The type of the attribute ( |
array |
Whether the attribute is multivalued or not (the default is false). |
delimiter |
The delimiter of multivalued attributes. Ignored when |
date-pattern |
The pattern used to parse attributes of type |
time-pattern |
The pattern used to parse attributes of type |
date-time-pattern |
The pattern used to parse attributes of type |
zone-id |
The time zone ID used when the value in the CSV column for |
local-date |
The local data for a date-time attribute used if the value does not contain a date. Must be in ISO-8601 format such as '2011-12-03'. |
local-time |
The local time for a date-time attribute used if the value does not contain a time. Must be in ISO-8601 format such as '10:15' or '10:15:30' |
prefix |
An optional prefix added to imported attributes of type |
suffix |
An optional suffix added to imported attributes of type |
default-value |
The default value used when the line does not contain a value for a configured attribute mapping. Must follow the same format rules as the other values in the column. |
Attributes are parsed using the default Java parsers, e.g. Integer.parseInt()
for INTEGER, or using the supplied
patterns for date, time or date-time values. Booleans can either be Strings ('true', 'false') or integers (0,1).
The default date-time-pattern used by the importer for date-time attributes ([u-M-d]['T'][H:m:s][X]
) allows
parsing partial values for date-time attributes. The parts in square brackets (the date, the letter 'T', the time and
the zone) are optional. If no value for these parts is contained in the parsed value, it is replaced by the configured
default local-date
, local-time
or zone-id
.
Example configuration
The generic file mapper supports individual configurations for each configured import route as shown in the following example:
universal-import-service:
file:
file-import-1: (1)
uri: "file://..."
generic-file-mapper:
settings:
file-import-1: (2)
type-definition-name: "files"
1 | The ID of the import route for the configured directory |
2 | The ID of the configuration for the generic file mapper. Must match the ID of the import route. |
The following example shows how to configure the generic file mapper. The example collects three properties and maps these and one of the predefined properties to attributes.
generic-file-mapper:
settings:
file-import-1: (1)
type-definition-name: "files" (2)
properties: (3)
parent-name:
position-in-path: -2
drive-name:
position-in-path: 1
language:
file-name-regex:
expression: "[-,_]([A-Z]{2})$"
group: 1
attributes: (4)
_name:
attribute-name: file_name
type: STRING
parent-name:
attribute-name: parent_name
type: STRING
language:
attribute-name: language
type: STRING
drive-name:
attribute-name: drive_name
type: STRING
1 | Name of the configuration. Must match the name of the configured import route. |
2 | Name of the type definition to import into |
3 | Map of collected properties |
4 | Mapping from properties to attributes |
Extensions
The generic file importer provides an extension mechanism that allows preprocessing of imported files, for example to
perform decryption. To use this extension mechanism, a bean of type de.eitco.uis.mappers.common.FilePreprocessor
must
be registered with a custom Spring Boot Starter. FilePreprocessors
can influence the way the actual content of the
file is read, as well as properties like file length, name and path. This mechanism can also be used to provide
attributes that will be added to the arveo entity created for the file. This allows to implement an import that reads
files from a directory that do not contain the actual content to import, but a link to another file containing the
actual content and a collection of attributes to import.
A second extension mechanism allows to resolve attributes from arbitrary sources for an imported file. For example,
the extension could read a YAML file containing additional attributes. To use this extension mechanism, a bean of type
de.eitco.uis.mappers.common.AttributesProvider
must be registered. The attributes returned by an AttributesProvider
are added after the regular attribute mapping was performed. So it is possible to overwrite already resolved attributes
using an AttributesProvider
.
It is possible to register several FilePreprocessor
and AttributesProvider
beans. The bean used for a specific
import route is selected by calling the bean’s usedForRoute
method with the ID of the route. The first bean found that
returns true in this method will be used.
The required classes for the extensions are contained in the following artifact:
<dependency>
<groupId>de.eitco.uis</groupId>
<artifactId>universal-import-mappers-common</artifactId>
<version>${import.service.version}</version>
</dependency>
To enable custom extensions, the Jar containing the Spring Boot Starter (and any additional Jar, if required) must be
placed in the directory configured as loader path of the service using the -Dloader.path
parameter.
Writing a custom file mapper
Custom file mappers have to implement the interface FileMapper
. Mappers can use the typed or the generic API of arveo.
A mapper that uses the generic API has to return true in the isGeneric
method implementation and has to implement the
mapFileGeneric
method. Typed mappers have to return false in the isGeneric
method and have to implement the mapFile
method.
The custom mapper implementation has to be registered as a Spring bean in a custom Spring Boot starter. To replace the
provided default mapper, the custom auto starter either has to run before the auto configuration class de.eitco.uis.mappers.file.GenericFileMapperAutoConfiguration
or the default mapper bean registration has to be disabled by setting the property universal-import-service.generic-file-mapper.enabled
to false
.
File mappers return a list of batch operations. The arveo entities created for one line can be created using the respective batch operation(s). The operations will be executed in the order in which they are contained in the list.
The custom mapper can be activated by adding the jar of the custom starter to the service’s libs directory configured
by the parameter -Dloader.path
.
Unresolved directive in index.adoc - include::configuration-properties.adoc[]