For the System Requirements, see  Installation and Administrator Guide of the openBIS Server. The openBIS Application server needs to be running when installing the openBIS Data Store Server (DSS).

Installation

For installing DSS the main distribution (naming schema: datastore_server-<version> r<revision>.zip ) and optional plugin distributions (naming schema:datastore_server_plugin <plugin name><version>-r<revision>.zip) are needed. The main distribution contains:

  • datastore_server/datastore_server.sh: Bash script for starting/stopping the server.
  • datastore_server/lib: Folder with libraries needed.
  • datastore_server/etc: Folder with configuration files and key stores.
  • datastore_server/data: Example data folder configured to run the demo.
  • datastore_server/log: Empty folder which will contain log files.

The plugin distribution contain stuff relative to the DSS main folder <some folder>/datastore_server.

Installation steps

  1. Create a service user account, i.e. an unprivileged, regular user account. You can use the same user account for running the Application Server and the Data Store Server. Do not run openBIS DSS as root!
  2. Unzip the main distribution on the server machine on its final location.
  3. If plugins are required go into the folder datastore_server and unzip them there.
  4. Adapt datastore_server/etc/service.properties.
  5. Create a Role in openBIS for the etlserver user: Administration -> Authorization -> Roles -> Assign Role
    Choose as a role 'INSTANCE_ETL_SERVER' and select as a person 'etlserver'
    If you do not do this and start up the Datastore Server you will get an error in the log/datastore_server_log.txt saying
    Authorization failure: No role assignments could be found for user 'etlserver'.
  6. Start up the server as follows:

    prompt> ./datastore_server.sh start
    

    Have look into the log files: They should be free of exception stack traces.

Configuration file

The configuration file datastore_server/etc/service.properties is an Extended Properties File. It can have a lot of properties. Most defining plugins which can be extracted as a plugin configuration into the core plugins. For more see Core Plugins. Nevertheless, here is a typical example which still contains everything:

service.properties
# Unique code of this Data Store Server. Not more than 40 characters.
data-store-server-code = DSS1

# The root directory of the data store
storeroot-dir = data/store

# The directory where the command queue files are located; defaults to storeroot-dir 
commandqueue-dir =

# Comma-separated list of definitions of additional queues for processing processing plugins.
# Each entry is of the form <queue name>:<regular expression>
# A corresponding persistent queue is created. All processing plugins with a key matching the corresponding
# regular expression are associated with the corresponding queue.
#
# The key of a processing plugin is its core-plugin name which is the name of the folder containing 
# 'plugin.properties'. 
#
# In case of archiving is enabled the following three processing plugins are defined:
# 'Archiving', 'Copying data sets to archive', and 'Unarchiving'
#data-set-command-queue-mapping = archiving:Archiving|Copying data sets to archive

# Cache for data set files from other Data Store Servers
# cache-workspace-folder = ../../data/dss-cache
# Maximum cache size in MB
# cache-workspace-max-size = 1024
# cache-workspace-min-keeping-time = 

# Port
port = 8444

# Session timeout in minutes
session-timeout = 720

# Path to the keystore
keystore.path = etc/openBIS.keystore

# Password of the keystore
keystore.password = changeit

# Key password of the keystore
keystore.key-password = changeit

# The check interval (in seconds)
check-interval = 60

# The time-out for clean up work in the shutdown sequence (in seconds).
# Note that that the maximal time for the shutdown sequence to complete can be as large
# as twice this time.
# Remark: On a network file system, it is not recommended to turn this value to something
# lower than 180.
shutdown-timeout = 180

===============================
# Data Set Registration Halt:
#
# In order to prevent the data store from having no free disk space a limit (so called highwater mark) can be
# specified. If the free disk space of the associated share goes below this specified value, 
# DSS halts to register data sets. Also a notification log and an email will be produced. 
# When the free disk space is again above the limit registration will be continued.

# The value must be specified in kilobytes (1048576 = 1024 * 1024 = 1GB). If no high water mark is
# specified or if the value is negative, the system will not be watching. There are 2 different kinds
# of highwater mark supported: the one 'highwater-mark' that is checking the space on the store, and 
# one 'recovery-highwater-mark' that is checking the amount of free space for recovery state (on the local filesystem).
# 
# Core plugins of type drop box and ingestion services (special type of reporting-plugins) can override the
# highwater mark value individually by specifying the property 'incoming-share-minimum-free-space-in-gb' 
# in their plugin.properties. 
highwater-mark = -1
recovery-highwater-mark = -1


# If a data set is successfully registered it sends out an email to the registrator.
# If this property is not specified, no email is sent to the registrator. This property
# does not affect the mails which are sent, when the data set could not be registered.
notify-successful-registration = false

# The URL of the openBIS server
server-url = https://localhost:8443/openbis/openbis

# The username to use when contacting the openBIS server
username = etlserver

# The password to use when contacting the openBIS server
password = etlserver

# The base URL for Web client access.
download-url = https://localhost:8889

# SMTP properties (must start with 'mail' to be considered).
# mail.smtp.host = localhost
# mail.from = datastore_server@localhost
# If this property is set a test e-mail will be sent to the specified address after DSS successfully started-up.
# mail.test.address = test@localhost

# ---------------- Timing parameters for file system operations on remote shares.

# Time (in seconds) to wait for any file system operation to finish. Operations exceeding this
# timeout will be terminated.
timeout = 60
# Number of times that a timed out operation will be tried again (0 means: every file system
# operation will only ever be performed once).
max-retries = 11
# Time (in seconds) to wait after an operation has been timed out before re-trying.
failure-interval = 10

# The period of no write access that needs to pass before an incoming data item is considered
# complete and ready to be processed (in seconds) [default: 300].
# Valid only when auto-detection method is used to determine if an incoming data are ready to be processed.
# quiet-period = <value in seconds>

# Globally used separator character which separates entities in a data set file name
data-set-file-name-entity-separator = _

# ---------------------------------------------------------------------------
# maintenance plugins configuration
# ---------------------------------------------------------------------------

# Comma separated names of maintenance plugins.
# Each plugin should have configuration properties prefixed with its name.
# Mandatory properties for each <plugin> include:
#   <plugin>.class - Fully qualified plugin class name
#   <plugin>.interval - The time between plugin executions (in seconds)
# Optional properties for each <plugin> include:
#   <plugin>.start - Time of the first execution (HH:mm)
#   <plugin>.execute-only-once - If true the task will be executed exactly once,
#                                interval will be ignored. By default set to false.
maintenance-plugins = demo-maintenance, auto-archiver, archive-cleanup, hierarchical-storage-updater

demo.class = ch.systemsx.cisd.etlserver.plugins.DemoMaintenancePlugin
demo.interval = 60
demo.start = 23:00

# ----- Automatic archiver configuration ------------------------------------
# Class of a task that performs automatic archivization of 'AVAILABLE' data sets based on their properties.
auto-archiver.class = ch.systemsx.cisd.etlserver.plugins.AutoArchiverTask
auto-archiver.interval = 10
auto-archiver.start = 23:00
# following properties are optional
# only data sets of specified type will be archived
auto-archiver.data-set-type = UNKNOWN
# only data sets that are older than specified number of days will be archived (default = 30)
auto-archiver.older-than = 90
# Indicates whether data sets will be removed from the data store upon archiving
# NOTE: You can configure two different auto-archiver tasks - one with 'remove-datasets-from-store'
# set to 'false'  to enable eager archiving and another one with the flag set to 'true' that will
# free space on the datastore server
auto-archiver.remove-datasets-from-store=false
# fully qualified class name of a policy that additionally filters data sets to be filtered
auto-archiver.policy.class = ch.systemsx.cisd.etlserver.plugins.DummyAutoArchiverPolicy

# use this policy to archive datasets in batches grouped by experiment and dataset type
# auto-archiver.policy.class = ch.systemsx.cisd.etlserver.plugins.ByExpermientPolicy
# use this policy to archive datasets in batches grouped by space
# auto-archiver.policy.class = ch.systemsx.cisd.etlserver.plugins.BySpacePolicy


# Default archival candidate discoverer, using "older-than" criteria
auto-archiver.archive-candidate-discoverer.class = ch.systemsx.cisd.etlserver.plugins.AgeArchiveCandidateDiscoverer
# use this archival dataset candidate discovery to auto-archive by tags. Please note that the "older-than" will take no effect with this one
# auto-archiver.archive-candidate-discoverer.class = ch.systemsx.cisd.etlserver.plugins.TagArchiveCandidateDiscoverer
# auto-archiver.archive-candidate-discoverer.tags = /admin/boo, /admin/foo


# ----- Alternative automatic archiver configuration ------------------------------------
# Performs automatic archiving of 'ACTIVE' data sets grouped by experiments based on the experiment's age
# (which is defined as last modification date of the youngest data set of the experiment).
# It iterates over all experiments, ordered by experiment age, and archives all non-locked, non-excluded
# data sets of an experiment. The estimated size of the archived data sets has to be configured via the
# 'estimated-data-set-size-in-KB.*' properties. The iteration over the experiments stops, when the estimated free
# disk space on the monitored directory is larger than 'minimum-free-space-in-MB'.
auto-archiver-exp.class = ch.systemsx.cisd.etlserver.plugins.ExperimentBasedArchivingTask
# The time between subsequent archivizations (in seconds)
auto-archiver-exp.interval = 86400
# Time of the first execution (HH:mm)
auto-archiver-exp.start = 23:15
# A directory to monitor for free disk space
#auto-archiver-exp.monitored-dir = /some/directory
# The mininum free space on the monitored share to ensure by archiving data sets (optional, default is 1024)
auto-archiver-exp.minimum-free-space-in-MB = 1024
# A coma-separated list of data set type keys to exclude from archiving (optional)
auto-archiver-exp.excluded-data-set-types = TYPE_KEY1, TYPE_KEY2
# Estimated data set size (in KB) for data set of type TYPE_KEY3
auto-archiver-exp.estimated-data-set-size-in-KB.TYPE_KEY3=300
# Default data set size estimation (in KB)
auto-archiver-exp.estimated-data-set-size-in-KB.DEFAULT=1200
# A free space provider class to be used
auto-archiver-exp.free-space-provider.class=ch.systemsx.cisd.openbis.dss.generic.shared.utils.PostgresPlusFileSystemFreeSpaceProvider
# PostgresSQL data source to be checked for free space
auto-archiver-exp.free-space-provider.monitored-data-source=data-source
# Whether a VACUUM command should be executed before the database is asked for the available free space
auto-archiver-exp.free-space-provider.execute-vacuum=true


# A task which cleans up deleted data sets from the archive
archive-cleanup.class = ch.systemsx.cisd.etlserver.plugins.DeleteFromArchiveMaintenanceTask
archive-cleanup.status-filename= ${root-dir}/deletion-event-lastseenid.txt
# keep the archive copies of the data set for one week after their deletion in openBIS
# the delay is specified in minutes
archive-cleanup.delay-after-user-deletion=604800
# start up time
archive-cleanup.start = 02:00
# run every day (in minutes)
archive-cleanup.interval = 86400

# the plugin which is run periodically to create a mirror structure of the store with the same files
# but with user-readable structure of directories
hierarchy-builder.class = ch.systemsx.cisd.etlserver.plugins.HierarchicalStorageUpdater
# The time between rebuilding the hierarchical store structure (in seconds)
hierarchy-builder.interval = 86400
# The root directory of the hierarchical data store
hierarchy-builder.hierarchy-root-dir = data/hierarchical-store
# The naming strategy for the symbolic links
hierarchy-builder.link-naming-strategy.class = ch.systemsx.cisd.etlserver.plugins.TemplateBasedLinkNamingStrategy
# The exact form of link names produced by TemplateBasedLinkNamingStrategy is configurable
# via the following template. The variables
#   dataSet, dataSetType, experiment, instance, project, sample, space
# will be recognized and replaced in the final link name.
hierarchy-builder.link-naming-strategy.template = ${space}/${project}/${experiment}/${dataSetType}+${sample}+${dataSet}
# When specified for a given <dataset-type> this store subpath will be used as the symbolic source
hierarchical-storage-updater.link-source-subpath.<dataset-type> = original
# Setting this property to "true" for a given <dataset-type> will treat the first child item (file or folder)
# in the specified location as the symbolic link source. It can be used in conjunction with
# the "link-source-subpath.<dataset-type>" to produce links pointing to a folder with unknown name e.g.
# <data-set-location>/original/UNKNOWN-NAME-20100307-1350
hierarchical-storage-updater.link-from-first-child.<dataset-type> = true


# ---------------------------------------------------------------------------
# (optional) archiver configuration
# ---------------------------------------------------------------------------

# Configuration of an archiver task. All properties are prefixed with 'archiver.'.

# Archiver class specification (together with the list of packages this class belongs to).
archiver.class = ch.systemsx.cisd.openbis.dss.generic.server.plugins.demo.DemoArchiver
# indicates if data should be synchronized when local copy is different than one in archive
archiver.synchronize-archive = false

# ---------------------------------------------------------------------------
# reporting and processing plugins configuration
# ---------------------------------------------------------------------------

# Comma separated names of reporting plugins. Each plugin should have configuration properties prefixed with its name.
# If name has 'default-' prefix it will be used by default in data set Data View.
reporting-plugins = demo-reporter

# Label of the plugin which will be shown for the users.
demo-reporter.label = Show Dataset Size
# Comma separated list of dataset type codes which can be handled by this plugin.
demo-reporter.dataset-types = UNKNOWN
# Plugin class specification (together with the list of packages this class belongs to).
demo-reporter.class = ch.systemsx.cisd.openbis.dss.generic.server.plugins.demo.DemoReportingPlugin
# The property file. Its content will be passed as a parameter to the plugin.
demo-reporter.properties-file =

# Plugin that allows to show content of Main Data Set as an OpenBIS table
# tsv-viewer.label = TSV View
# tsv-viewer.dataset-types = TSV
# tsv-viewer.class = ch.systemsx.cisd.openbis.dss.generic.server.plugins.standard.TSVViewReportingPlugin
# tsv-viewer.properties-file =

# ---------------------------------------------------------------------------
# Data Set Validator Definitions
# ---------------------------------------------------------------------------

# Data set validators used to accept or reject data sets to be registered.
# Comma separated list of validator definitions.
data-set-validators = validator

# Definition of data set validator 'validator'
validator.data-set-type = HCS_IMAGE
validator.path-patterns = **/*.txt
validator.columns = id, description, size
validator.id.header-pattern = ID
validator.id.mandatory = true
validator.id.order = 1
validator.id.value-validator = ch.systemsx.cisd.etlserver.validation.HeaderBasedValueValidatorFactory
validator.id.header-types = compound, gene-locus
validator.id.compound.header-pattern = CompoundID
validator.id.compound.value-type = unique
validator.id.compound.value-pattern = CHEBI:[0-9]+
validator.id.gene-locus.header-pattern = GeneLocus
validator.id.gene-locus.value-type = unique
validator.id.gene-locus.value-pattern = BSU[0-9]+
validator.description.header-pattern = Description
validator.description.value-type = string
validator.description.value-pattern = .{0,100}
validator.size.header-pattern = A[0-9]+
validator.size.can-define-multiple-columns = true
validator.size.allow-empty-values = true
validator.size.value-type = numeric
validator.site.value-range = [0,Infinity)


# Comma separated names of processing threads. Each thread should have configuration properties prefixed with its name.
# E.g. 'code-extractor' property for the thread 'my-etl' should be specified as 'my-etl.code-extractor'
inputs=main-thread

# ---------------------------------------------------------------------------
# 'main-thread' thread configuration
# ---------------------------------------------------------------------------
# The directory to watch for incoming data.
main-thread.incoming-dir = data/incoming

# Specifies what should happen if an error occurs during dataset processing.
# By default this flag is set to false and user has to modify the 'faulty paths file'
# each time the faulty dataset should be processed again.
# Set this flag to true if the processing should be repeated after some time without manual intervention.
# Note that this can increase the server load.
main-thread.reprocess-faulty-datasets = false

# If 'true' then unidentified and invalid data sets will be deleted instead of being moved to 'unidentified' folder
# Allowed values:
#  - false   - (default) move unidentified or invalid data sets to 'unidentified' folder
#  - true    - delete unidentified or invalid data sets
# delete-unidentified = true

# Determines when the incoming data should be considered complete and ready to be processed.
# Allowed values:
#  - auto-detection - when no write access will be detected for a specified 'quite-period'
#  - marker-file		- when an appropriate marker file for the data exists.
# The default value is 'marker-file'.
main-thread.incoming-data-completeness-condition = marker-file

# Path to the script that will be executed before data set registration.
# The script will be called with two parameters: <data-set-code> and <absolute-data-set-path> (in the incoming dropbox).
# NOTE: before starting DSS server make sure the script is accessible and executable.
# main-thread.pre-registration-script = /example/scripts/my-script.sh

# Path to the script that will be executed after successful data set registration.
# The script will be called with two parameters: <data-set-code> and <absolute-data-set-path> (in the data store).
# NOTE: before starting DSS server make sure the script is accessible and executable.
# main-thread.post-registration-script = /example/scripts/my-script.sh

# ---------------- Plugin properties
# The extractor class to use for code extraction

main-thread.data-set-info-extractor = ch.systemsx.cisd.etlserver.DefaultDataSetInfoExtractor
# Separator used to extract the barcode in the data set file name
main-thread.data-set-info-extractor.entity-separator = ${data-set-file-name-entity-separator}
# The space
main-thread.data-set-info-extractor.space-code = TEST
# Location of file containing data set properties
#main-thread.data-set-info-extractor.data-set-properties-file-name = data-set.properties


# The extractor class to use for type extraction
main-thread.type-extractor = ch.systemsx.cisd.etlserver.SimpleTypeExtractor
main-thread.type-extractor.file-format-type = TIFF
main-thread.type-extractor.locator-type = RELATIVE_LOCATION
main-thread.type-extractor.data-set-type = HCS_IMAGE
main-thread.type-extractor.is-measured = true

# The storage processor (IStorageProcessor implementation)
main-thread.storage-processor = ch.systemsx.cisd.etlserver.DefaultStorageProcessor

# ---------------------------------------------------------------------------
# dss-rpc
# ---------------------------------------------------------------------------
# The dss-rpc section configures the RPC put functionality by providing a mapping between data
# set type and input thread parameters.
#
# The default input thread is specified by the dss-rpc.put-default key. If not specified, the first input
# thread will be used.
#
# Mappings are specified by dss-rpc.put.<data-set-type-code> = <thread-name>
#
# If this section is empty, then the first input thread will be used.
#
dss-rpc.put-default = main-thread
#dss-rpc.put.HCS_IMAGE = main-thread

# ---------------------------------------------------------------------------
# Sample dropbox configuration
# ---------------------------------------------------------------------------
# <incoming-dir> will be scanned for tsv files containing samples in standard
#   batch import format. Additionally the file should contain (in the comment)
#   definition of sample type and optionally the default space and registrator.
#   If the 'DEFAULT_SPACE' is defined, codes of the samples will be
#   automatically created, so 'identifier' column is not expected.
#   If 'USER' is defined, he will become a 'Registrator' of the samples.
#
# --- EXAMPLE FILE ---
#! GLOBAL_PROPERTIES_START
#! SAMPLE_TYPE = <sample_type_code>
#! GLOBAL_PROPERTIES_END
# identifier  parent     container    property1    property2
# /SPACE/S1   /SPACE/P1               value11      value21
# /SPACE/S1   /SPACE/P2               value12      value22
# --- END OF FILE ---
#
# --- EXAMPLE FILE (generate codes automatically, registrator specified) ---
#! GLOBAL_PROPERTIES_START
#! SAMPLE_TYPE = <sample_type_code>
#! USER = <user_id>
#! DEFAULT_SPACE = <space_code>
#! GLOBAL_PROPERTIES_END
# parent     container     property1    property2
# /SPACE/P1                value11      value21
# /SPACE/P2                value12      value22
# --- END OF FILE ---
#
# Directory scanned for files with samples
samples.incoming-dir = ${root-dir}/sample-dropbox
#
# Class responsible for handling files with samples definition
samples.dataset-handler = ch.systemsx.cisd.etlserver.SampleRegisteringDropbox
#
# The path to the error logs directory
samples.dataset-handler.error-log-dir = ${root-dir}/error-log
#
# Prefix of samples with automatically created codes. Default value: 'S'.
samples.dataset-handler.sample-code-prefix = AS
#
# Settings not relevant to sample dropbox, but required by DSS
samples.incoming-data-completeness-condition = auto-detection
samples.data-set-info-extractor = ch.systemsx.cisd.etlserver.DefaultDataSetInfoExtractor
samples.type-extractor = ch.systemsx.cisd.etlserver.SimpleTypeExtractor
samples.storage-processor = ch.systemsx.cisd.etlserver.DefaultStorageProcessor

#
# Reasearch Collection - ELN Plugin
#
# rc-exports-api-limit-data-size-megabytes=4000
# rc-exports-api-service-document-url=
# rc-exports-api-user=
# rc-exports-api-password=

#
# Zenodo - ELN Plugin
#
# zenodo-exports-api-limit-data-size-megabytes=4000
# zenodo-exports-api-zenodoUrl=https://zenodo.org/

Because the Data Store Server communicates with the openBIS Server the following properties have to be changed:

  • server-url: Here only the host name has to be changed.
  • username/password: Data Store Server is just a user for the openBIS Server. It should have the role SPACE_ETL_SERVER or INSTANCE_ETL_SERVER.

Production setup with Proxy

A typical production scenario is when a proxy (typically Apache) is used at the front of AS and DSS, this have two main advantages:

  • Users will be accessing the same domain and port, browsers are known to be friendlier to this setup.
  • The proxy can provide an https connector.  

When these requirements are met is possible to comment out the download-url parameter from service.properties. As a consequence of this, the URLs provided by the system will be relative to the current domain and port of the application.

Segmented Store

Below the store root directory (specified by property storeroot-dir) the data is organized in segments called shares. Each share is a sub directory. The name of this sub directory is a number. It is also the share ID. A share can be a symbolic link to some shared directory.

On start up the DSS automatically creates share 1. An existing store is migrated by moving all data into share 1. Administrators can create new shares by creating new sub folders or symbolic links to sub folders readable and writable by the user running DSS.

Incoming directories are automatically associated to shares. These shares are called incoming shares. All data sets are stored in the incoming share associated on DSS start up. For each incoming directory DSS tries to move an empty file from the incoming directory to a share. The first share for which this move is successful becomes the associated incoming share. Each incoming directory has to have an associated incoming share. Several incoming directories can have the same share.

It is possible to force the assignment of an incoming share for a dropbox in the plugin.properties using the optional incoming-share-id property.  The incoming share specified in this way does not have to be on the same disk as incoming folder.

Segmenting the data store allows to add memory when it is needed. An administrator has just to mount a new hard disk or and NFS folder. After creating a symbolic link in the data store root directory the memory is available.

Shuffling Data Sets Manually

The script share-manager.sh (inside <installation directory>/servers/datastore_server) allows to move manually data sets to another share:

./sharemanager.sh move-to <share id> <data set code 1> [<data set code 2> <data set code 3> ...] 

 The user is asked for user ID and password.

The following command lists all shares and their free space:

./sharemanager.sh list-shares

Shuffling Data Sets Automatically

In order to be used by DSS the SegmentedStoreShufflingTask maintenance task is needed to shuffle data sets from full incoming shares or shares with set withdraw flag to the freshly added share (which is called an external share). Here is a typical configuration for this maintenance task:

service.properties
maintenance-plugins = <other maintenance plugins>, store-shuffler
store-shuffler.class = ch.systemsx.cisd.etlserver.plugins.SegmentedStoreShufflingTask
store-shuffler.shuffling.class = ch.systemsx.cisd.etlserver.plugins.SimpleShuffling
# Data will be moved from the incoming share when the amount of free space on it will be lower than this number:
store-shuffler.shuffling.minimum-free-space-in-MB = 1024
store-shuffler.shuffling.share-finder.class = ch.systemsx.cisd.etlserver.plugins.SimpleShufflingShareFinder
# Nothing is moved to the share if the amount of free space is below that mark.
store-shuffler.shuffling.share-finder.minimum-free-space-in-MB = 1024
store-shuffler.interval = 86400

The maintenance task shuffles data sets from incoming shares which have less then 1GB free space to the share which the most free space. The task is executed every day (86400 seconds) once.

Speed and Speed Hint

Usually different shares are located on different (remotely) mounted disk drives. For that reason the data accessing speed among the shares can be quite different. Shuffling (and also unarchiving) can be controlled by associating a relative speed value to each share. The speed is a number between 0 and 100. Large numbers mean faster reading access. The speed is an arbitrary number which allows to compare data accessing speed between two different share. For example, if share 1 has speed 30 and share 2 has speed 50 then the data access from share 2 is faster than from share 1.

Share properties explains how to configure the speed of a share. If there is no explicit configuration a speed of 50 is assumed.

During data set registration a data set can be provided with a speed hint which can be used during shuffling to find an appropriated share. The speed hint is a number between -100 and +100. A positive value means that shares with that value or higher are preferred. A negative speed hint means that shares with same absolute value or less are preferred. For example, a data set with speed hint -30 prefers to be shuffled into a slow share of speed 30 or less.

Currently speed hints can be set only in jython drop-boxes. The default speed hint is -50.

Share properties

Each share can be configured via a share.properties file stored the share's root folder.

Property Name

Default Value

Meaning

speed

50

The share's speed. Must be a number between 0 and 100

shuffle-priority

SPEED

Taken into account by the algorithm of StandardShareFinder  . Valid values are SPEED and MOVE_TO_EXTENSION. The property configures which criteria is more important when considering a data set reciding on an incoming share - MOVE_TO_EXTENSION means the data set will always be moved to extension share even if the current incoming share is a better speed match for the data set. SPEED instructs the algorithm to find the best share with regards to the data set speed hint even if it is an incoming share.

withdraw-share

false

If set to true indicates that this share should be emptied by moving all its data sets to other shares. This flag is useful if a share should be withdrawn. It can be removed if it is empty (DSS admin will be informed by an e-mail). Note, that in case of incoming shares, new data sets might be added afterwards.

ignored-for-shufflingfalseIf set true this share will not be taken into account for shuffling from/to this share.
unarchiving-scratch-sharefalseIf set true this share can only be used as a scratch share for unarchiving using MultiDataSetArchiver. For more details see Multi data set archiving.
experiments
Comma-separated list of experiment identifiers. This property is only used by ExperimentBasedShareFinder.
data-set-types
Comma-separated list of dataset type codes. This property is only used by DataSetTypeBasedShareFinder.

Share Finder

For shuffling as well as unarchiving a strategy is need to find an appropriated share for the data set. It can consider the speed hint of the data set. The finding strategy is a Java class implementing ch.systemsx.cisd.openbis.dss.generic.shared.IShareFinder. It is specified by the properties starting with <prefix>.share-finder. The concrete finder is specified by the fully-qualified class name defined by the property <prefix>.share-finder.class. The following share finders are available:

ch.systemsx.cisd.openbis.dss.generic.shared.SimpleShareFinder

First it searches for shares with speed matching exactly the absolute value of the speed hint. If nothing found it does the same search for shares with speed above/below depending on the sign of the speed hint. If nothing found in this step it does the search for the third time but ignoring the speed hint.

The search is the following:
It first tries to find the external share with most free space which has enough space for the data set. If no such share is found it does the same for incoming shares.

ch.systemsx.cisd.etlserver.plugins.SimpleShufflingShareFinder

As in SimpleShareFinder it tries to find first a share with matching speed, second a share respecting speed hint, and third a share ignoring speed hint.

Each time it searches for a share with most free space which has at least the space for the amount specified by the configuration parameter minimum-free-space-in-MB (default value 1024).

ch.systemsx.cisd.openbis.dss.generic.shared.SpeedOptimizedShareFinder

First it searches for an extension share with matching speed and most free space. If nothing found it searches for an extension share with a speed respecting the speed hint. If this isn't successful it uses ch.systemsx.cisd.openbis.dss.generic.shared.SimpleShareFinder but ignoring speed hint.

ch.systemsx.cisd.openbis.dss.generic.shared.StandardShareFinder

The search algorithm of this ShareFinder considers all shares with enough free space and with reset withdraw flag as potential "candidates" (The free space of the data set "home" share is increased by the data set size). The best candidate is elected by the following rules:

  1. An extension share is preferred above an incoming share.
  2. A share whose speed matches the speed requirements of the data set is preferred. If there is more than one share matching in the same way then choose the one with speed closest to absolute value of speed hint.
  3. If all candidates have the same parameters for (1) and (2) choose the share with most free space.

The priority of points (1) and (2) can be swapped if the current location of the data set is an incoming share and it has a shuffle priority SPEED.

Generally the StandardShareFinder tends to move data sets from incoming to extension shares. A data set can only be moved from extension to incoming share by an unarchiving operation if at the time of unarchiving all extension shares (regardless of their speeds) are full.

The configuration parameter incoming-shares-minimum-free-space-in-MB can be specified for the StandardShareFinder. It allows to configure a threshold of free space that should always be available on the incoming shares.

ch.systemsx.cisd.openbis.dss.generic.shared.ExperimentBasedShareFinder

This share finder looks for a share which is associated with one or more experiments. If the data set to moved belongs to one of the specified experiments and the share has enough space the finder returns this share. The association of a share with some experiments is done by the property experiments of the properties file share.properties. The value is a comma-separated list of experiment identifiers.

ch.systemsx.cisd.openbis.dss.generic.shared.DataSetTypeBasedShareFinder

This share finder looks for a share which is associated with one or more data set types. If the data set to be moved is of one of the specified types and the share has enough space the finder returns this share. The association of a share with some data set type is done by the property data-set-types of the properties file share.properties. The value is a comma-separated list of data set type codes.

ch.systemsx.cisd.openbis.dss.generic.shared.MappingBasedShareFinder

This share finder reads a mapping file specified by property mapping-file. It is used to get a list of share IDs. The first share from this list is returned which fulfills the following conditions: The share exists and its free space is larger than the size of the data set. For more details on the mapping file see Mapping File for Share Ids and Archiving Folders.

Data Sources

Often a DSS needs also relational databases. Such databases can be internal ones feeded when data sets are registered or external ones providing additional information for data set registration, processing or reporting plugins. Data sources should be defined as core plugins of type data-sources. The following properties are understood:

Property nameDescription
factory-classOptional fully-qualified name of a class implementing ch.systemsx.cisd.openbis.generic.shared.util.IDataSourceFactory. The properties below are understood if the default factory class ch.systemsx.cisd.openbis.dss.generic.sharedDefaultDataSourceFactory is used.
version-holder-classOptional fully-qualified name of a class implementing ch.systemsx.cisd.openbis.dss.generic.shared.IDatabaseVersionHolder. This property is only used if this is a data source of an internal database where DSS takes care of creating and migrating the database.
databaseEngineCodeMandatory property specifying the database engine. Currently only postgresql is supported.
basicDatabaseNameMandatory property specifying the first part of the database name.
databaseKindMandatory property specifying the second part of the database name. The full database name reads <basicDatabaseName>_<databaseKind>.
scriptFolderFolder containing database schema SQL scripts. This property is mandatory for internal databases. For external databases it will be ignored.
urlHostPartOptional host part of the database URL. This is the host name with an optional port number. Default value is a standard value for the selected database engine assuming that database server and DSS running on the same machine.
ownerOwner of the database <basicDatabaseName>_<databaseKind>. Default: User who started up DSS
passwordOwner password.
adminUserAdministrator user of the database server. Default is defined by the selected database engine.
adminPasswordAdministrator password.

Example:

version-holder-class = ch.systemsx.cisd.openbis.dss.etl.ImagingDatabaseVersionHolder
databaseEngineCode = postgresql
basicDatabaseName = imaging
urlHostPart = ${imaging-database.url-host-part:localhost}
databaseKind = ${imaging-database.kind:prod}
scriptFolder = ${screening-sql-root-folder:}sql/imaging

Simple data sources

If the database used in data store server is not managed by openBIS and doesn't follow openBIS conventions on versioning etc. it is possible to specify it with SimpleDataSourceFactory like in the following example

factory-class = ch.systemsx.cisd.openbis.dss.generic.shared.SimpleDataSourceFactory
database-driver = oracle.jdbc.driver.OracleDriver
database-url = jdbc:oracle:thin:@test.test.com:1111:orcldatabase-driver
database-username = test
database-password = test
# database-max-idle-connections = 
# database-max-active-connections =
# database-max-wait-for-connection = # value in miliseconds
# database-active-connections-log-interval = # value in miliseconds
# validation-query = 

Reporting and Processing Plugins

The property reporting-plugins and processing-plugins are comma-separated lists of reporting/processing plugin names. Each reporting plugin should be able to take a set of data sets chosen by the user and produce a tabular result in a short time. The result table is shown to the user and can be exported to a tab separated file. A processing plugin is similar but it does not creates data immediately presented to the user. It can be used to do some time-consuming processing for the data sets selected by the user. Often, processing plugins inform the user via e-mail after finishing.

The name of all configuration properties for a particular reporting/processing plugin starts with <plugin name>.. The following properties are understood:

Property name

Description

label

Label of the plugin which will be shown for the users. This property is mandatory.

dataset-types

Comma separated list of data set type codes which can be handled by this plugin. This property is mandatory.

class

Fully qualified Java class name of the plugin. It has to implemented IReportingPluginTask/IProcessingPluginTask. This property is mandatory.

properties-file

The property file. Its content will be passed as a parameter to the plugin.

servlet.class

Fully qualified Java class name of an optional Servlet need.

servlet.path

Path pattern relative to the DSS application URL to which the servlet is bound. This property is mandatory if servlet.class is specified.

servlets

Comma separated list of names of servlets needed by the plugin. For each name a set of properties with prefix <name>. is assumed. All of them have the same sematics as the properties starting with servlet..

Default Plugin used in Data View

In detail view of a dataset in openBIS Application the Data View section by default shows 'Files (Smart View)'. If there are reporting plugins specified for dataset's type users can query a reporting plugin which results will be shown in the Data View. To show reporting plugin results by default for specific dataset types (instead of 'Files (Smart View)') name of the plugin should start with default- prefix, e.g.: default-tsv-viewer could be specified as defult plugin for datasets of type TSV.

  • If for a certain dataset type there is more than one reporting plugin specified as default then the first plugin using plugin labels in alphabetical order will be chosen by default.
  • If you have two data set types, e.g.: TSV and CSV, and want to be able to use the same plugin implementation for both, but choose it by default only for dataset type TSV you need to configure two separate plugins. Both configurations should have the same class property value but different dataset-types, and additionally the plugin configuration for datasets of type TSV should have a name starting with default- prefix.

Existing reporting plugins

TSV Viewer
Description

Allows to show content of Main Data Set as an OpenBIS table. To become a Main Data Set , file must be either the only file in the data set directory or must match a given regular expression specific to given data set type. Such a regular expression can be configured by the administrator Data Set->Types->Edit->Main Data Set Pattern. This is one of a few reporting plugins working on files with tabular data such as TSV (tab separated value) files, CSV (comma separated value) files or Excel files (supported extensions: XSL, XSLS).

Configuration
# Add plugin id to the reporting-plugins, in our case: tsv-viewer
reporting-plugins = tsv-viewer
# Set the label, that will be visible to the OpenBIS users
tsv-viewer.label = TSV View
# Specify data set types that should be viewable with TSV Viewer
tsv-viewer.dataset-types = TSV
tsv-viewer.class = ch.systemsx.cisd.openbis.dss.generic.server.plugins.standard.TSVViewReportingPlugin
tsv-viewer.properties-file =
# Optional properties:
# - separator of values in the file, significant for TSV and CSV files; default: tab
#tsv-viewer.separator = ;
# - whether lines beginning with '#' should be ignored by the plugin; default: true
#tsv-viewer.ignore-comments = false
# - excel sheet name or index (0 based) used for the excel file (.xsl or .xslx); default: 0 (first sheet)
#tsv-viewer.excel-sheet = example_sheet_name
Jython Based Reporting Plugin
Description

Creates a report based on a jython script. For more details see Jython-based Reporting and Processing Plugins.

Example
plugin.properties
label = My Report
class = ch.systemsx.cisd.openbis.dss.generic.server.plugins.jython.JythonBasedReportingPlugin
dataset-types = MY_DATA_SET
script-path = script.py


Jython Based Aggregation Reporting Plugin
Description

Creates an aggregation report based on a jython script. Note, that property dataset-types isn't needed and will be ignored. Aggregation reporting plugins can be used only via the Query API. For more details see Jython-based Reporting and Processing Plugins.

Example
plugin.properties
label = My Report
class = ch.systemsx.cisd.openbis.dss.generic.server.plugins.jython.JythonBasedAggregationServiceReportingPlugin
script-path = script.py


Reporting Plugin Decorator
Description

A reporting plugin which modifies the table produced by another reporting plugin. The modification is done by a transformation class which implements ch.systemsx.cisd.openbis.generic.shared.basic.ITableModelTransformation. Currently there is only the transformation ch.systemsx.cisd.openbis.dss.generic.server.plugins.standard.EntityLinksDecorator which turns cells of some columns into cells with links to a material or a sample.

Configuration

Here is an example of a configuration which decorates a TSV viewer plugin which produces a table with a material and a sample column:

label = Example with Materials and Samples
dataset-types = MY_DATA_SET_TYPE
class = ch.systemsx.cisd.openbis.dss.generic.server.plugins.standard.DecoratingTableModelReportingPlugin
# The actual reporting plugin which creates a table
reporting-plugin.class = ch.systemsx.cisd.openbis.dss.generic.server.plugins.standard.TSVViewReportingPlugin
reporting-plugin.separator = ,
# The transformation applied to the table return by the actual reporting plugin
transformation.class = ch.systemsx.cisd.openbis.dss.generic.server.plugins.standard.EntityLinksDecorator
# Link columns is a list of comma-separated column IDs which should be decorated. 
transformation.link-columns = GENE_ID, BARCODE
# Entity kind is either MATERIAL or SAMPLE
transformation.GENE_ID.entity-kind = MATERIAL
# The type of the material. Note, it is assumed that the column contains only the material code.
transformation.GENE_ID.material-type = GENE
transformation.BARCODE.entity-kind = SAMPLE
# Optional default property to be used if the column contains only sample codes instead of full identifiers.
transformation.BARCODE.default-space = DEMO


Archiver Plugin

Archiver is an optional plugin used to archive and unarchive data sets. It receives a set of datasets chosen by the user or automatic archiver maintenance task performs its task on Data Store Server and changes status of data sets in openBIS database in the end.

The archiver plugin is also needed if freshly registered data sets should immediately be archived (mainly for backup purposes). For more details see Post Registration and ArchivingPostRegistrationTask.

The name of all configuration properties for a particular reporting plugin start with archiver..

Property name

Description

archiver.class

The archiver class, see the sections below for examples.

archiver.share-finder.class

(optional) A #Share Finder strategy selecting a destination share when unarchiving datasets. The default value of this property is ch.systemsx.cisd.openbis.dss.generic.shared.StandardShareFinder.
Specific share finder properties must start with archiver.share-finder. e.g. archiver.share-finder.minimum-free-space-in-MB

archiver.synchronize-archive

(optional) indicates if data should be synchronized when local copy is different than one in archive (default: true)

archiver.batch-size-in-bytes

(optional) Datasets will be archived in batches. The datasets will be split in batches of roughly the same size (in bytes) controlled by the value of this property. Default is 1 gigabyte.

archiver.pause-filePath (absolute or relative to store root) of an empty file. If this file is present starting archiving/unarchiving will be paused until this file has been removed. This property is useful for archiving media/facilities with maintenance downtimes. Default value: pause-archiving
archiver.pause-file-polling-timeTime interval between two checks whether pause file still exists or not. Default value: 10 min

Rsync Archiver

Rsync Archiver (ch.systemsx.cisd.openbis.dss.generic.server.plugins.standard.RsyncArchiver) is an example archiver implementation that stores the archived data in a specified destination folder. The destination folder doesn't have to be on the same file system as the data store. It can also be:

  • a mounted remote folder,
  • a remote folder accessible via SSH (add <hostname>: prefix to folder path in configurarion),
  • a remote folder accessible via an rsync server (add <hostname>:<rsync module name> prefix to folder path in configurarion).

Apart from standard archiver task properties the following specific properties are understood:

Property name

Description

destination

Path to the destination folder where archived data sets will be stored.

rsync-password-file

(optional) Path to password file used when rsync module is specified.

find-executable

(optional) Path to GNU find executable that is used to verify data integrity. If not specified default find executable found in the system, which should be the right one on most Linux distributions, but not on Mac OS.

timeout

(optional) Network I/O timeout in seconds. The default value for this parameter is 15 seconds.

only-mark-as-deleted

This is a flag which tells the archiver whether to delete data sets or only to mark data sets as deleted in the archive. In the second case a marker file (file name is the data set code) will be added to the folder <destination>/DELETED. Default value for this parameter is true.

verify-checksums

(optional) This flag specifies if CRC32 checksum check should be performed. The default is true.

batch-size-in-bytes(optional) This allows to control when the archiving status of just archived data sets will be updated. This is the case when the sum of data set sizes exceeds the specified threshold. Default value is 1GB.
temp-folder

(optional) Temporary folder to be used in sanity check of archived data sets on an unmounted remote archive location. Default is just the data store itself.

In case of TarArchiver this temporary folder is also used to unpack the TAR archive file for sanity check.

Example configuration:

archiver.class = ch.systemsx.cisd.openbis.dss.generic.server.plugins.standard.RsyncArchiver
archiver.share-finder.class = ch.systemsx.cisd.openbis.dss.generic.shared.SimpleShareFinder
archiver.destination = hostname:/path/to/destination/folder
archiver.timeout = 20

Zip Archiver

This archiver (ch.systemsx.cisd.openbis.dss.generic.server.plugins.standard.ZipArchiver) archives data sets in ZIP files together with meta data (including properties) of the data set, the experiment and optional the sample to which the data set belongs. Also the meta data of the container data set (if present) are included (with properties, experiment and sample). The meta data are stored in the tab-separated file meta-data.tsv inside the ZIP file. Each data set will be stored in one ZIP file named <data set code>.zip.

The location of the ZIP file is specified by a mapping file which allows to specify archive folders in accordance with experiment/project/space identifier of the data set. Note, that all archive folders have to be on mounted disks. For more details about syntax and resolving rules see Mapping File for Share Ids and Archiving Folders.

Note, that this archiver doesn't work properly if HDF5 files are treated as folders.

Apart from standard archiver task properties the following specific properties are understood:

Property name

Description

only-mark-as-deleted

This is a flag which tells the archiver whether to delete data sets or only to mark data sets as deleted in the archive. In the second case a marker file (file name is the data set code) will be added to the folder <archive folder>/DELETED. Default value for this parameter is true.

verify-checksums(optional) This flag specifies if CRC32 checksum check should be performed. The default is true.
default-archive-folderThis is the path to the archive which is used if mapping hasn't been specified or an appropriated archive folder couldn't be found in the mapping file. This is a mandatory property.

default-small-data-sets-archive-folder

This is the path to the archive folder which is used for small data sets only. Which data sets are considered "small" is controlled by " small-data-sets-size-limit" property. When the small data sets folder is defined then the small data sets limit has to be set as well. This folder is used only when t he mapping file hasn't been specified or an appropriated archive folder couldn't be found in the mapping file. This is an optional property.

small-data-sets-size-limit

Controls which data sets are considered "small". Data sets which size is smaller or equal to this property value are treated as "small". Data sets which size is greater than this property value are treated as "big". The limit is expressed in kilobytes. E.g. 1024 value of this property means that data sets which size is smaller or equal to 1MB are considered "small". The limit is used when choosing between "default-archive-folder" and "default-small-data-sets-archive-folder" or between "big" and "small" folders defined in the mapping file.
mapping-filePath to the mapping file. This is an optional property. If not specified the default-archive-folder will be used for all data sets.
mapping-file.create-archivesIf true the archive folders specified in the mapping file will be created if they do not exit. Default value is false.
compressingIf true compression is used when creating the archived data set. Otherwise an uncompressed ZIP file is created. Default value is true.
with-shardingIf true the path of the ZIP file is <archive folder>/<data store UUID>/<sharding levels>/<data set code>/<data set code>.zip. Otherwise the path reads <archive folder>/<data set code>.zip. Default value is false.
ignore-existingIf true then data sets that already exist in the archive (zip file exists and it is not empty) are ignored and not copied to the archive again. If false then data sets are always copied to the archive without checking if they exist. Default value is false.

Note, that the property synchronize-archive will be ignored: An already archived data set will be archived again if archiving has been triggered.

It is recommended to use as the share finder MappingBasedShareFinder for unarchiving.

Tar Archiver

This archiver (ch.systemsx.cisd.openbis.dss.generic.server.plugins.standard.TarArchiver) archives data sets in TAR files. It is very similar to the ZIP archiver. It accepts all the configuration properties that the ZIP archiver accepts, except for the "compressing" property.

Automatic Archiver

Automatic Archiver ch.systemsx.cisd.etlserver.plugins.AutoArchiverTask is a maintenance task scheduled for repeated execution beginning at the specified time (optional - by default it is executed without delay after server starts) with regular intervals between subsequent executions.

Apart from standard maintenance task properties the following specific properties (all optional) are understood:

Property name

Description

data-set-type

Dataset type code of datasets that will be archived. By default all types will be archived.

older-than

Only data sets that are older than specified number of days will be archived (default = 30).

policy.class

Fully qualified class name of a policy that additionally filters data sets to be archived.
Specific policy can use additional properties with policy prefix.

remove-datasets-from-store

Indicates whether data sets will be removed from the data store upon archiving (default = false).

By default all data sets with status 'AVAILABLE' will be archived. Power users can prevent auto archiving of data sets by changing their status to 'LOCKED'.

Automatic Experiment Archiver

This a maintenance task ch.systemsx.cisd.etlserver.plugins.ExperimentBasedArchivingTask which archives whole experiments if free disk space is below a configured threshold. Archiving a whole experiment means that all data sets of this experiments are archived.

As any maintenance task it is scheduled for repeated execution beginning at the specified time (optional - by default it is executed without delay after server starts) with regular intervals between subsequent executions.

The archiver archives one or more experiments starting with the oldest one. It stops if either a fixed number of experiments have been archived or the free disk space is above the threshold. Experiments are not archived if at least on data set is in the state LOCKED. The age of an experiment is defined by the youngest not archived data set.

Apart from standard maintenance task properties the following specific properties are understood:

Property name

Description

monitored-dir

This mandatory property specified the directory of the store whom's disk space is monitored. In the case the configured free space provider is PostgresPlusFileSystemFreeSpaceProvider, the monitored directory must be on the same physical hard disk as the monitored database.

minimum-free-space-in-MB

This mandatory property specifies the threshold (in MB) for free disk space on the share defined by monitored-share. If free disk space is below this threshold the archiving task will start archiving.

excluded-data-set-types

A comma-separated list of data set types. Data sets of those types will not be archived. They will not be used for calculating the age of an experiment.

free-space-provider.class

The classname of the free space provider to be used. The default is ch.systemsx.cisd.common.filesystem.SimpleFreeSpaceProvider

free-space-provider.*

Properties for the configured free space provider

estimated-data-set-size-in-KB.<data-set-type>

Provides an estimation of the average size in kilobytes for a given data set type.

estimated-data-set-size-in-KB.DEFAULT

Default estimation of data set size. If no specific estimated-data-set-size-in-KB.<data-set-type> is present, then this property will be used to determine a data set size prior archiving. The property is optional.

Archive clean-up

Archives are not automatically updated when deleting data sets in openBIS. To purge archives from deleted data sets one has to configure a clean-up maintenance task - ch.systemsx.cisd.etlserver.plugins.DeleteFromArchiveMaintenanceTask.

Apart from the standard maintenance task properties the following specific properties will be accepted by the task:

Property name

Description

delay-after-user-deletion

Only data sets that have been deleted before more than delay-after-user-deletion minutes will be deleted from archive. Defaults to 0 (immediate deletion).

status-filename

A mandatory property keeping a filename, where the task tracks the last processed deletion. The path should be outside the DSS installation in order to survive DSS update installations.

Free space providers

SimpleFreeSpaceProvider

The class ch.systemsx.cisd.common.filesystem.SimpleFreeSpaceProvider returns the free space on the file system of a hard drive. It works similarly to the command

$ df -h

PostgresPlusFileSystemFreeSpaceProvider

The class ch.systemsx.cisd.openbis.dss.generic.shared.utils.PostgresPlusFileSystemFreeSpaceProvider returns the free space on a hard drive as the sum from the file system free space and the free space detected in a PostgreSQL data source. Note that the data source is required to have the PostgreSQL extension pgstattuple installed.

The following configuration properties are available for PostgresPlusFileSystemFreeSpaceProvider

Property Name

Description

monitored-data-source

The name of the data source to monitor

execute-vacuum

When set to true a PostgreSQL VACUUM command will be executed before the provider tries to calculate the free space. More

Data Set Validators

Data Set Validators are used to accept or reject incoming data sets. They are specific for a data set type.

The property data-set-validators is a comma-separated list of validators for data sets. They are the names of the validators. The name of all configuration properties for a particular validator starts with <validator name>.. For an example, see the above-mentioned example of a service.properties. The following properties are understood:

Property name

Default value

Description

data-set-type


Mandatory data set type. The validator will be used only for data sets of this type.

validator

ch.systemsx.cisd.etlserver.validation.DataSetValidatorForTSV

Fully-qualified name of a Java class implementing IDataSetValidator. This class must have a public constructor with a Properties object as an argument.

Data Set Validator for TSV Files

The default validator is DataSetValidatorForTSV. It is able to validate TAB-separated value (TSV) files. It is assumed, that the first line of such files contain the column headers. All headers should be unique (ie. no duplicated column headers). This validator understands the following properties:

Property name

Default value

Description

path-patterns

*

List of comma separated wild-card patterns for paths of files in the data set to be validated. The characters * and ? are used in file names and parent folder names with the usual meaning (i.e. * = any number of any character, ? = exactly one unspecified character). The character sequence * / is used to indicate any number of subdirectories. For example, * /*.txt means all files of file type txt anywhere in the data set.

columns


List of comma separated names of column definitions. These column definitions are used

  • to identify the columns in a TSV file and
  • to validate the values of the column.
    For each name configuration properties are defined. The property names start with <column definition name>..

A column definition understands the following properties:

Property name

Default value

Description

mandatory

false

If true a column as specified should be in the TSV file.

order

undefined

Optional property. It specifies a column in the TSV file at a particular position. 1 means first column, 2 means second column etc.

can-define-multiple-columns

false

If true more than one column with this definition can appear in the TSV file. This property will be ignored if order is defined or if mandatory = true.

header-validator

undefined

Fully-qualified name of a Java class implementing ch.systemsx.cisd.etlserver.validation.IColumnHeaderValidator. It is used to either check that the header (in case order is specified) is valid or to find for a header a matching column definition. This class must have a public constructor with a Properties object as an argument.

header-pattern


This property is mandatory if header-validator isn't specified. It specifies a regular expression for a column header validator based on regular expression as specified in http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Pattern.html.

value-validator

ch.systemsx.cisd.etlserver.validation.DefaultValueValidatorFactory

Fully-qualified name of a Java class implementing ch.systemsx.cisd.etlserver.validation.IValidatorFactory. It generates the validator which is used to validate the values of a column in the TSV file. This class must have a public constructor with a Properties object as an argument.

DefaultValueValidatorFactory

The default factory can generate four different types of value validators. It understands the following properties:

Property name

Default value

Description

value-type

any

Type of the value validator. The following values are allowed:

  • any: Any value is allowed.
  • numeric: Value has to be a number.
  • string: Value has to match a regular expression.
  • unique: As string. In addition the column values have to be unique.
  • unique_groups: As string. In addition selected parts of column values have to be unique.

allow-empty-values

false

If true empty cell values are allowed. This property is ignored if value-type is any, unique or unique_groups.

empty-value-synonyms


An optional property of a comma-separated list of strings which are handled as synonyms for an empty value. This property is only used if allow-empty-values is set to true.

value-pattern


Regular expression for a validator of type string, unique, and unique_groups. It is mandatory only for string and unique_groups.

value-range


Optional range definition for validators of type numeric. The syntax is as follows

('('|'[') (-Infinity|<floating point number>) ',' (Infinity|<floating point number>) (']'|')')

Examples:

(0,3.14159]
[0,Infinity) (-Infinity,1e-18|CISDDoc:Installation and Administrators Guide of the openBIS Data Store Server]

groups


Mandatory property for validators of type unique_groups. Part of the value-pattern is a group when it's inside the brackets.

value-type = unique_groups
value-pattern = (a*)(b*)c*(d*)
groups = 1,3

The number of the group can be calculated as the number of '(' present in the regex before chosen group.

pattern: (A(B))C(D(E))
available groups:
1 (A(B))
2 (B)
3 (D(E))
4 (E)
HeaderBasedValueValidatorFactory

This value validator factor (fully qualified class name: ch.systemsx.cisd.etlserver.validation.HeaderBasedValueValidatorFactory) is a collection of column header specific validator factories. The factory is chosen by the first regular expression matching the header. HeaderBasedValueValidatorFactory understands the following properties:

Property name

Description

header-types

List of comma-separated unique names of value validator factories. For each name configuration properties are defined. The property names start with <factory name>.. For an example see service.properties example above.

<factory name>.header-pattern

Regular expression to be matched by column header.

Remote DSS

More than one DSS may be connected to the same AS. In this case a DSS might request files remotely from another DSS. This can happen in FTP/SFTP server and aggregation/ingestion services. The files downloaded from a remote DSS will be cached. By default the cache is located in the data folder of a standard installation (data/dss-cache). Its default maximum size is 1GB. This parameters can be changed by the following properties in service.properties:

Property NameDefault ValueDescription
cache-workspace-folder../../data/dss-cacheFolder which will contain the cached files.
cache-workspace-max-size1024

Maximum size of cache in MB.

cache-workspace-min-keeping-time1 day

Minimum time a data set is kept in the cache. Can be specified with one of the following time units: ms, msec, s, sec, m, min, h, hours, d, days. Default time unit is sec.

Files are removed from the cache if the downloaded file would exceeds the size specified by cache-workspace-max-size. Removal is guided by the following rules:

  • Removal happens after the downloading of a file which isn't in the cache.
  • Non or all cached files of a data set are removed.
  • The "oldest" data set is removed first. The age of a data set is determined by the last time a cached file in this data set has been requested.
  • Data sets are not removed when they are younger than specified by cache-workspace-min-keeping-time.


ETL Threads

The property inputs is a comma-separated list of ETL threads. Each ETL thread registers incoming data set files/folders at openBIS and store them in the data store (property storeroot-dir).

The name of all configuration properties for a particular ETL thread starts with <thread name>.. The following properties are understood:

Property name

Default value

Description

incoming-dir


The drop box for incoming data sets.

incoming-data-completeness-condition

marker-file

Condition which determines when an incoming data set should be considered to be complete and ready for processing. Only the following values are allowed:

  • auto-detection: Assumes that data set is complete if after some quite period (global property quiet-period) no writing operation has been detected.
  • marker-file: A marker file has been detected. The marker file should start with .MARKER_is_finished_ followed by the name of the data set file/folder.

delete-unidentified

false

If true incoming data sets are deleted if they couldn't be identified. If false unidentified data sets are moved to the folder unidentified inside the store.

data-set-info-extractor


Properties determine an extractor which extracts from the data set file/folder information necessary for registering in openBIS. For more details, see below.

type-extractor


Extractor which extracts various type information from the data set file/folder. For more details, see below.

storage-processor


Processor which processes the incoming data set file/folder and stores it in the store. For more details, see below.

pre-registration-script


Path to the script that will be executed before data set registration. The script will be called with two parameters: <data-set-code> and <absolute-data-set-path> (in the incoming dropbox).

post-registration-script


Path to the script that will be executed after successful data set registration. The script will be called with two parameters: <data-set-code> and <absolute-data-set-path> (in the data store).

incoming-share-id
Id of the incoming share the dropbox drops into. The share can be on a different disk than the incoming folder.

Data Set Info Extractor

Each thread has a property <thread name>.data-set-info-extractor which denotes a Java class implementing IDataSetInfoExtractor.

The most important implementation is ch.systemsx.cisd.etlserver.DefaultDataSetInfoExtractor which extracts most information from the name of the data set file/folder. It understands the following properties (all with prefix <thread name>.data-set-info-extractor.):

Property name

Default value

Description

space-code


Space code of the sample. If unspecified a shared sample is assumed.

strip-file-extension

false

If true the file extension will be removed before extracting information from the file name.

entity-separator

.

Character which separates entities in the file name. Whitespace characters are not allowed.

sub-entity-separator

&

Character which separates sub entities of an entity. Whitespace characters are not allowed.

index-of-sample-code

-1

Index of the entity which is interpreted as the sample code. Data set belongs to a sample.

index-of-experiment-identifier


Index of the entity which is interpreted as the experiment identifier. It contains the project code, experiment code, and optionally the space code (if different from the property space-code). These codes are concatenated by the sub entity separator. If index-of-experiment-identifier is specified index-of-sample-code will be ignored and data set belongs directly to an eperiment.

index-of-parent-data-set-codes


Index of the entity which is interpreted as sequence of parent data set codes. The codes have to be separated by the sub entity separator. If not specified no parent data set code will be extracted. Parent data set codes will be ignored if index-of-experiment-identifier isn't spcified.

index-of-data-producer-code


Index of the entity which is interpreted as the data producer code. If not specified no data producer code will be extracted.

index-of-data-production-date


Index of the entity which is interpreted as the data production date. If not specified no data production date will be extracted.

data-production-date-format

yyyyMMddHHmmss

Format of the data production date. For the correct syntax see SimpleDateFormat

data-set-properties-file-name


Path to a file inside a data set folder which contains data set properties. This file has to be a tab-separated file. The first line of the file should contain columns definition: property and value. Example:

data-set-properties.tsv
property	value
DESCRIPTION	Description of data set series
SCHEMA_VERSION	21.3
SERIES	2009-04-01

Entity indexes count 0, 1, 2, etc. from left and -1, -2, -3, etc. from right.

Type Extractor

Each thread has a property <thread name>.type-extractor which denotes a Java class implementing ITypeExtractor.

ch.systemsx.cisd.etlserver.SimpleTypeExtractor is one of the available type extractors. Actually it doesn't extract anything from the incoming data set. It took all information from the following properties (all with prefix <thread name>.type-extractor.):

Property name

Default value

Description

data-set-type


A data set type as registered in openBIS.

file-format-type


A file type as registered in openBIS.

is-measured

true

Whether the data set is a measured one or a derived/calculated one.


Storage Processor

Each thread has a property <thread name>.storage-processor which denotes a Java class implementing IStorageProcessor.

The most important storage processors are ch.systemsx.cisd.etlserver.DefaultStorageProcessor and ch.systemsx.cisd.etlserver.imsb.StorageProcessorWithDropbox. The last one extends a storage processor by the additional behavior of copying the original data set file/folder to a configurable folder (drop box). It understands following properties (all with prefix <thread name>.storage-processor.):

Property name

Default value

Description

processor


Java class of the storage processor which will be extended.

dropbox-dir


Folder to which the original data set file/folder will be copied.

entity-separator

.

Separator character used to build the following file name for the data set file/folder in the drop box:
<original file name><entity-separator><data set code>.<original file type>
where <data set code> is the data set code of the data set after it has been registered in openBIS.

Dataset handler

Each thread has a property <thread name>.dataset-handler which denotes a Java class implementing IDataSetHandler.
This property is optional, the default dataset handler is used if it is not specified.

This class allows to decide on a high level how to handle an incoming data set file or directory.
It can delegate its job to the default dataset handler performing some operations beforehand or afterwards.
It can also make it possible to handle datasets residing in any particular directory structure.

Monitoring Thread Activity

Each thread regularly monitors activity on a directory. In the datastore server installation directory, there is a subdirectory .activity/ which contains an (empty) file for each thread. The file gets 'touched' each time the thread starts a processing round. Thus, by looking at the time stamp of the according directory, an administrator can find when the last processing round of this thread took place. This can help spot long-running data ingestion processes quickly and can also help to check whether a thread is 'hanging', e.g. because of an erroneous dropbox script.

Monitoring User Session Activity

It's possible to monitor the growth of the openbis user sessions. If the number of sessions exceed the specify threshold, the notification is beeing sent to the admin. This behavior is controlled by two properties (defined in openbis service.properties file)

  • session-notification-threshold (by deafult it is 0, means the feature is off)
  • session-notification-delay-period (how much time should pass beetween two notifications expressed in seconds)

Post Registration

It is possible to perform a sequence of tasks after registration of a data set. This is done by the maintenance task ch.systemsx.cisd.etlserver.postregistration.PostRegistrationMaintenanceTask. This tasks figures out freshly registered data sets and runs for each data set a sequence of configurable post registration tasks. It should be configured with a short time interval (e.g. 60 seconds) in order to do post-registration eagerly. It knows the following configuration properties:

Property

Description

interval

Time interval in seconds post-registration is performed.

cleanup-tasks-folder

Path to a folder where serialized cleanup tasks are stored. If DSS crashes the cleanup tasks in this folder will be executed after start up. Default value is clean-up-tasks

last-seen-data-set-file

Path to a file which will store the technical ID of the last data set being post-registered. The path should be outside the DSS installation in order to survive DSS update installations.

ignore-data-sets-before-date

Optional date specifying older data sets to be ignored. The format is <4-digits-year> <2 digits month> <2-digits-day>. Example: 2011-03-29

post-registration-tasks

Comma-separated list of names of post registration tasks. The tasks are performed in the order they are specified. Each name is the sub-key of the properties for the tasks. All tasks should have at least the key <task name>.class which denotes the fully-qualified name of a Java class implementing ch.systemsx.cisd.etlserver.postregistration.IPostRegistrationTask.

Example:

service.properties
maintenance-plugins = <other maintenance plugins>, post-registration

post-registration.class = ch.systemsx.cisd.etlserver.postregistration.PostRegistrationMaintenanceTask
post-registration.interval = 60
post-registration.cleanup-tasks-folder = cleanup-tasks
post-registration.ignore-data-sets-before-date = 2011-03-27
post-registration.last-seen-data-set-file = ../../last-seen-data-set
post-registration.post-registration-tasks = eager-shuffling, notifying, eager-archiving
post-registration.eager-shuffling.class = ch.systemsx.cisd.etlserver.postregistration.EagerShufflingTask
post-registration.eager-shuffling.share-finder.class = ch.systemsx.cisd.openbis.dss.generic.shared.SimpleShareFinder
post-registration.notifying.class = ch.systemsx.cisd.etlserver.postregistration.NotifyingTask
post-registration.notifying.message-template = data-set = ${data-set-code}\n\
                                               injection-volumn = ${property.inj-vol}\n
post-registration.notifying.destination-path-template = targets/ds-${data-set-code}.properties
post-registration.eager-archiving.class = ch.systemsx.cisd.etlserver.postregistration.ArchivingPostRegistrationTask

Currently three different post-registration tasks are available.

EagerShufflingTask

Fully qualified class name: ch.systemsx.cisd.etlserver.postregistration.EagerShufflingTask

This task tries to find a share in the segmented store to which it will move the freshly registered data set.

Following configuration properties are recognized:

Property

Description

share-finder.class

Fully qualified class name of the Java class implementing ch.systemsx.cisd.openbis.dss.generic.shared.IShareFinder. See also section Share Finder for available share finders.

free-space-limit-in-MB-triggering-notification

A free space threshold which leads to a notification of an admin via e-mail if this threshold has been crossed after adding a data set to a share.

stop-on-no-share-found

If true an exception is thrown if no share could be found. This interrupts post-registration task. That is, subsequent tasks for the same data set as well as all task for all younger data sets are not performed.

verify-checksum

If true then checksum verification is performed for the moved data sets (default: true)

NotifyingTask

Fully qualified class name: ch.systemsx.cisd.etlserver.postregistration.NotifyingTask

This task writes a file with some information of the freshly registered data set.

Following configuration properties are recognized:

Property

Description

message-template

A template for creating the file content.

destination-path-template

A template for creating the path of the file.

include-dataset-type-patterns

Optional. Contains comma separated patterns (java regexp). If specified then only datasets which have type matching to any of the patterns will be processed.

The template can have place holders of the form ${<place holder name>}. The following place holder name can be used: data-set-code and property.<data set property type code> where the data set property type code can be in lower case. For newline/tab characters in the template use \n and \t, respectively.

ArchivingPostRegistrationTask

Fully qualified class name: ch.systemsx.cisd.etlserver.postregistration.ArchivingPostRegistrationTask

This task archives the freshly registered data set. The archived data set is still available. This task can be used for creating backups. Note, that ArchiverPlugin has to be configured.

RequestArchivingPostRegistrationTask

Fully qualified class name: ch.systemsx.cisd.etlserver.postregistration.RequestArchivingPostRegistrationTask

This task set the archiving-request flag of the freshly registered data set.

SecondCopyPostRegistrationTask

Fully qualified class name: ch.systemsx.cisd.etlserver.postregistration.SecondCopyPostRegistrationTask

This task creates a copy of all files of the freshly registered data set.

Following configuration properties are recognized:

Property

Description

destination

Path to the folder which will contain the copy. The files are created in the same folder structure as in a share folder.

Path Info Database

Some data sets consists of a large number of files. Navigating in such large file set can be slow especially for NFS-based storage. In order to speed up navigation a so-called path info database can be used. Three or four things have to be configured to run it properly: A data source, a post-registration task, a deletion maintenance task, and an optional migration task. Here is an example configuration:

service.properties
# ---------------------------------------------------------------------------
# Data sources
data-sources = path-info-db, <other data sources>

# Data source for pathinfo database
path-info-db.version-holder-class = ch.systemsx.cisd.openbis.dss.generic.shared.PathInfoDatabaseVersionHolder
path-info-db.databaseEngineCode = postgresql
path-info-db.basicDatabaseName = pathinfo
path-info-db.databaseKind = productive
path-info-db.scriptFolder = datastore_server/sql
# Use this if you need to give the data source db superuser privileges
#path-info-db.owner = postgres

# ---------------------------------------------------------------------------
# Maintenance Tasks
maintenance-plugins = post-registration, path-info-deletion, path-info-feeding, <other maintenance tasks>

# ---------------------------------------------------------------------------
# Post registration task: Feeding path info database with freshly registered data sets.
post-registration.class = ch.systemsx.cisd.etlserver.postregistration.PostRegistrationMaintenanceTask
post-registration.interval = 30
post-registration.cleanup-tasks-folder = ../../cleanup-tasks
post-registration.ignore-data-sets-before-date = 2011-01-27
post-registration.last-seen-data-set-file = ../../last-seen-data-set-for-postregistration.txt
post-registration.post-registration-tasks = pathinfo-feeding
post-registration.pathinfo-feeding.class = ch.systemsx.cisd.etlserver.path.PathInfoDatabaseFeedingTask

# ---------------------------------------------------------------------------
# Deletion task: Remove entries from path info database after data set deletion
path-info-deletion.class = ch.systemsx.cisd.etlserver.plugins.DeleteFromExternalDBMaintenanceTask
path-info-deletion.interval = 120
path-info-deletion.data-source = path-info-db
path-info-deletion.data-set-table-name = data_sets
path-info-deletion.data-set-perm-id = CODE

# ---------------------------------------------------------------------------
# Migration task: Initial feeding of path info database with existing data sets.
path-info-feeding.class = ch.systemsx.cisd.etlserver.path.PathInfoDatabaseFeedingTask
path-info-feeding.execute-only-once = true

Remarks:

  • The name of the data source has to be path-info-db.
  • The migration task can be removed form the configuration file after successful execution.

File system view of the DSS

The DSS can expose its data store contents via a read-only interface that can be defined as FTP/SFTP server. The following chapters describe how to set up FTP/SFTP. The interface is exposing the same view of the dss, which has some basic structure and can be extended using properties.

Basic hierarchy

The structure of the file system view is organized like a classical openbis hierarchy.
/{SPACE}/{PROJECT}/{EXPERIMENT}/{DATA-SET-CODE}/{data-set-files-and-directories}

Plugins

It's possible to register plugins to the file system view. Such a plugin must be able to resolve paths into a response that is either a listing of a directory contents (files and directories) or a file to download.

To create a plugin it's necessary to create a new core plugin of type type file-system-plugins with definition of a custom view. If there are any file system plugins defined than each plugin's hierarchy is visible behind a separate top-level directory and "DEFAULT" is used for the basic hierarchy.

In plugin properties it's necessary to specify a class implementing  IResolverPlugin interface e.g

plugin.properties
resolver-class = ch.systemsx.cisd.openbis.dss.generic.server.fs.plugins.JythonResolver
code = jython-plugin
script-file = script.py

For Jython resolver it's required to implement a method resolve(pathItems, context) with signature as defined in IResolver.

script.py
import time
 
def resolve(subPath, context):
    api = context.getApi() # the way to access objects from openbis v3 api 
    session_token = context.getSessionToken()
	
	if len(subPath) == 0:
		# how to create a directory listing with files and directories
		directory_listing = context.createDirectoryResponse()
		directory_listing.addFile("someFile.txt", 0, time.time())
    	directory_listing.addDirectory("someDirectory", time.time())
		return directory_listing
 
	if len(subPath) == 1 and subPath[0] == "someFile.txt":
		# this is how to handle request to download a file from openbis data set
		content = context.getContentProvider().asContent("DATA_SET_CODE")
		node = content.tryGetNode("original/someFile.txt");
		return context.createFileResponse(node, content)
 
	# this is how to respond to error paths
    return context.createNonExistingFileResponse("Path not supported")

See http://svncisd.ethz.ch/repos/cisd/datastore_server/trunk/source/core-plugins/file-system-plugin-example/1/dss/file-system-plugins/ for examples on how to configure and implement a more complex plugin.


(Since version 20.10.4) When developing a Jython resolver script it is useful to set the property ftp.resolver-dev-mode in DSS service.properties to true. In this case the script is always reloaded and compiled. The default value is false in order to speed up normal operation where the script is loaded and compiled only once.


Obsolete file-system view

In the earlier versions of openbis the structure and options available were different. To see how to enable and configure old-style file system view please see this document:

Old-style file system view for FTP

FTP / SFTP Server

The DSS can expose its data store contents via a read-only FTP / SFTP interface. To automatically start the internal FTP server set the configuration property ftp.server.enable to true and set one of the two properties ftp.server.ftp-port / ftp-server.sftp-port (or both) to a positive value. Please refer to the following table for a more detailed description of the possible configuration properties :

Property

Default value

Description

ftp.server.enable

false

When set to 'true' internal FTP and/or SFTP server will be started as part of the DSS bootstrap. If defined all 'keystore.*' properties become mandatory.

ftp.server.sftp-port


Port that the SFTP server will bind to. Needs to be set to a value other than 0 to switch on the SFTP server. SFTP is a protocol which enables secure file transfer through an SSH tunnel. This protocol is more firewall-friendly than FTP over SSL/TLS. A good value for the port is 2222.

ftp.server.ftp-port


Port that the FTP server will bind to. Needs to be set to a value other than 0 to switch on the FTP server. A good value for the port is 2121.

ftp.server.port


Deprecated name for ftp.server.ftp-port, kept for backward-compatibility. Will be ignored if ftp.server.ftp-port is specified. 

ftp.server.use-ssl

true

Enables explicit FTP over SSL/TLS. Users can switch on connection encryption by issuing the FTP "AUTH" command. Similarly to the global 'use-ssl' parameter, when set to 'true' all 'keystore.*' properties become mandatory. (Ignored by SFTP server.)

ftp.server.implicit-ssl

false

Allows to enable implicit FTP over SSL/TLS. While this method ensures no passwords are send around unencrypted, it is not standardized by an RFC and is therefore poorly supported by clients. (Ignored by SFTP server.)

ftp.server.passivemode.port.range

2130-2140

Specifies a range of ports to be used when FTP server is working in passive mode. The default value is "2130-2140", meaning only 11 ports will be provide to connecting clients. If there are more than 11 concurrent FTP users the configuration will have to be adjusted accordingly. (Ignored by SFTP server.)

ftp.server.activemode.enable

false

When set to true enables the FTP server to run in "active mode". This means data connections will always go to a single pre-configured port on the server. Such a configuration can ease the server firewall administration (only one additional port has to be opened), but However it requires the client machines to be directly visible from the server. (Ignored by SFTP server.)

ftp.server.activemode.port

2122

FTP server active mode port number configuration. (Ignored by SFTP server.)

ftp.server.maxThreads

25

Maximum number of threads used by the internal FTP server. (Ignored by SFTP server.)

Example FTPS client configuration

If FTP over SSL/TLS is enabled, clients connecting to the FTP server must adjust their connection settings accordingly.

LFTP

In case you work outside the ETH network replace bs-openbis04 with openbis-dsu.ethz.ch

Here the steps to setup you connection to openbis via LFTP:

1) Create a config file under the folder .ssh/:

LFTP example
cat .ssh/config
 Host openbis-dsu.ethz.ch
  KexAlgorithms +diffie-hellman-group1-sha1
  HostKeyAlgorithms +ssh-rsa,ssh-dss  

2) Add the following host in the file .ssh/known_hosts:

LFTP example
[bs-openbis04.ethz.ch]:2222 ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCbAcN/0/rmcez4QTwaDPf5bMVog/LqkuyjcEqlI3RKMl+SkHIyhY/9CQLCYWq2+eITNGASseYVC2ZXJwDlRgvkYmtL/zVHUBxee8/s+DGJnqTR6TEQ2vvGFcfc6eeEktj6JkRyFS0oCfmDUWREoJP/8NzcEEWYk7MdV5xxmoBX9C6xTTP9VfWm+vUoNmozfOIywYX611lrCKMeUpdpE4DUpBX1pY4YMIvQ87SXyxhVEAJ+e8928ZUtkB0DGQ4xMHuwmjYLvO3G0Rqi5Vz0492ICYyMFOvKM4IGxf+hV4fqCRfgIBb/krXSO8WNBpzOzNlPPheM8Tdlw7irkNttp2a3
 

Then connect to openBIS via LFTP: 

LFTP example
> lftp sftp://<username>@bs-openbis04.ethz.ch:2222
Password:
lftp <username>@bs-openbis04.ethz.ch:~> ls
dr--r--r--   1 openBIS  openBIS         0 Jul 19 09:54 BSSE_FLOWCELLS
lftp <username>@bs-openbis04.ethz.ch:~> cd BSSE_FLOWCELLS/FLOWCELLS/2016.07
lftp <username>@bs-openbis04.ethz.ch:/BSSE_FLOWCELLS/FLOWCELLS/2016.07> ls
dr--r--r--   1 openBIS  openBIS         0 Jul 19 09:59 20150113231613997-60446751_FASTQ_GZ
dr--r--r--   1 openBIS  openBIS         0 Jul 19 09:59 20150113231614260-60446752_FASTQ_GZ
dr--r--r--   1 openBIS  openBIS         0 Jul 19 09:59 20150113231614261-60446753_FASTQ_GZ
[..]

# Now we recursively download the data from this level
lftp <username>@bs-openbis04.ethz.ch:/BSSE_FLOWCELLS/FLOWCELLS/2016.07> mirror
Total: 23 directories, 28 files, 0 symlinks
New: 28 files, 0 symlinks
334768712 bytes transferred in 23 seconds (13.94M/s)
To be removed: 64 directories, 45 files, 16 symlinks
lftp <username>@bs-openbis04.ethz.ch:/BSSE_FLOWCELLS/FLOWCELLS/2016.07>

# just pick some files using regex
lftp <username>@bs-openbis04.ethz.ch:/BSSE_FLOWCELLS/FLOWCELLS/2016.07> mirror --include-glob="*6043*"
 


We have observed cases where the files are transferred correctly and then 'vanish' on the file system. This is due to file permissions. Please try to use the -p flag of lftp. If you are already finished and it looks like the files are not there please try to chmod -R 755 the downloaded folder.

FileZilla

FileZilla users should create a new site in the Site Manager:


It might also be necessary to increase to time-out from default value of 20 seconds to a larger value (e.g. 60 seconds). This is done in the global preferences:

Example SFTP client configuration

If SFTP port is configured SFTP client can be used.

SFTP
>sftp -oPort=2222 <username>@<openbis server address>

Example:
>sftp -oPort=2222 john.doe@gmail.com@openbis-dsu.ethz.ch

Mac OS Finder as well as most Web browsers do not support the SFTP protocol.


Existing Dataset Handlers

SampleAndDataSetRegistrationHandler

The SampleAndDataSetRegistrationHandler can be used to register/update samples and register data sets. It takes control files in a specific format and registers or updates the samples and registers the data sets specified in the file.

DSS Configuration

  • The plugin can be configured to either accept registration and update requests (ACCEPT_ALL), accept only registration requests (IGNORE_EXISTING), or only update requests (REJECT_NONEXISTING)
    • Default : ACCEPT_ALL
    • Properties Key : SAMPLE_REGISTRATION_MODE
  • The plugin uses a regular expression to determine which files should be treated as control files. By default it uses any file with the ".tsv" extension, but this can be configured.
    • Default :

      .*\.[Tt][Ss][Vv]
    • Properties Key : control-file-regex-pattern
  • By default, the uploaded folders are deleted, even after a failure. This can be configured.
    • Default : true
    • Properties Key : always-cleanup-after-processing
  • The plugin treats a folder that is uploaded, but not specified in the control file, as a failure. This behavior can be configured.
    • Default : true
    • Properties Key : unmentioned-subfolder-is-failure
  • The plugin can be configured with a default sample type for registration. This can be overridden in the control file. If no default is specified, it must be specified in the control file.
    • Default : require the sample type to be specified in the control file
    • Properties Key : sample-type
    • Control File Key : SAMPLE_TYPE
  • The plugin can be configured with a default data set type for registration. This can be overridden in the control file. If no default is specified, it must be specified in the control file.
    • Default : require the sample type to be specified in the control file
    • Properties Key : data-set-type
    • Control File Key : DATA_SET_TYPE
  • There are certain situations in which the plugin will need to inform an administrator of a problem encountered during sample / data set registration by sending an email. By default, this email is set to all instance admins of the openBIS instance. It can, however, be configured to email a specific list of people by providing openBIS user IDs or email addresses.
    • Default : send email to all instance admins
    • Properties Key : error-mail-recipients
service.properties
inputs= [...] sample-dset-reg-thread

sample-dset-reg-thread.incoming-dir = ${root-dir}/incoming-sample-dset-reg
sample-dset-reg-thread.incoming-data-completeness-condition = auto-detection
sample-dset-reg-thread.delete-unidentified = true
# The data set handler
sample-dset-reg-thread.dataset-handler = ch.systemsx.cisd.etlserver.entityregistration.SampleAndDataSetRegistrationHandler

# Controls whether samples may be registered and updated (ACCEPT_ALL), registered only (IGNORE_EXISTING), or updated only (REJECT_NONEXISTING). Default is ACCEPT_ALL
sample-dset-reg-thread.dataset-handler.sample-registration-mode = IGNORE_EXISTING

# Controls which sample type is processed by default. Omit this setting it to force it to be specified in the control file
sample-dset-reg-thread.dataset-handler.sample-type = MY_SAMPLE_TYPE

# Controls which data set type is processed by default. Omit this setting it to force it to be specified in the control file
sample-dset-reg-thread.dataset-handler.data-set-type = HCS_IMAGE

# Controls which files are treated as control files. The default is the extension .tsv. Set to .txt instead
sample-dset-reg-thread.dataset-handler.control-file-regex-pattern = .*\.[Tt][Xx][Tt]

# Data Set Type Information
sample-dset-reg-thread.type-extractor = ch.systemsx.cisd.etlserver.SimpleTypeExtractor
# The file-format-type may be overridden in the control file
sample-dset-reg-thread.type-extractor.file-format-type = TIFF
sample-dset-reg-thread.type-extractor.locator-type = RELATIVE_LOCATION

sample-dset-reg-thread.data-set-info-extractor = ch.systemsx.cisd.etlserver.DefaultDataSetInfoExtractor
sample-dset-reg-thread.storage-processor = ch.systemsx.cisd.etlserver.DefaultStorageProcessor

Control File Format

The control file is a tab-separated-value (TSV) file. Any file in the drop box with the extension ".tsv" is treated as a control file.

The first non-comment line of the TSV file should be the column headers and following the headers are the metadata for samples and data sets, with each line containing the metadata for one sample/data set combination.

The headers for columns that contain sample metadata begin with "S_"; the headers for columns that contain data set metadata begin with "D_". All headers should begin with either S_ or D_, except for a column with the header "FOLDER". This specifies the (relative) path to the data to register as a data set.

In addition to the column headers, some information necessary for the registration process is contained in special comment lines that start with "#!" (normal comments start simply with "#").

Required Comments
  • The key for the user on who's behalf the samples/data sets are being registered
    • Control File Key : USERID
Optional Comments (Overrides)

The sample type and the data set type may be provided in the control file. Values given here override those in the DSS configuration. If there are no values in the DSS configuration, then they must be provided here.

  • The sample type for the samples
    • Control File Key : SAMPLE_TYPE
  • The data set type for the data sets
    • Control File Key : DATA_SET_TYPE
Headers
Commonly Used Headers

The following column headers can be used for any type of sample and dataset:

  • S_identifier
  • S_container
  • S_parent
  • S_experiment
  • D_code
  • D_file_type

The headers to specify are dependent on the sample type and data set type.

Optionally, the file type for each data set can be specified by providing the following header:

  1. D_file_type
Required Headers

The following headers are required:

  1. S_identifier
  2. FOLDER

When creating new samples, the following headers are required:

  1. S_experiment

Example

	# Control Parameters
	#! GLOBAL_PROPERTIES_START
	#!   SAMPLE_TYPE = DILUTION_PLATE
	#!   DATA_SET_TYPE = HCS_IMAGE
	#!   USERID = test
	#! GLOBAL_PROPERTIES_END
	# Data
	S_identifier	S_experiment	S_OFFSET	D_COMMENT	D_GENDER	FOLDER
	/CISD/SD-TEST-1	/CISD/DEFAULT/EXP-REUSE	1	Comment 1	MALE	ds1/
	/CISD/SD-TEST-2	/CISD/DEFAULT/EXP-REUSE	2	Comment 3	MALE	ds2/
	/CISD/SD-TEST-3	/CISD/DEFAULT/EXP-REUSE	3	Comment 2	FEMALE	ds3/

To register this file, it should have a *.tsv extension (e.g. control.tsv) and should be put into the drop box along with the folders ds1, ds2, ds3.

Error Conditions

  • If any of the directories for data sets are empty, the registration of that sample/data set combination will fail.

SampleRegisteringDropbox

The SampleRegisteringDropbox can be used to register new samples without registering any data sets. Instead of data set directories, it expects control files in a specific format and registers the samples specified in this control file. The control file is the same format as used for sample batch registration in the Web GUI, augmented with a global header. The dropbox has a mode called "auto id generation". In this mode, it ignores the identifiers given for the individual samples and creates new ones.

DSS Configuration

  • The plugin has to be given a directory for logging errors:
    • Default: none (required property)
    • Properties Key: error-log-dir
  • The plugin can be given a prefix for auto-generating samples:
    • Default: S
    • Properties Key: sample-code-prefix
service.properties
inputs= [...] sample-reg-thread

sample-reg-thread.incoming-dir = ${root-dir}/incoming-sample-reg
sample-reg-thread.incoming-data-completeness-condition = auto-detection
# The data set handler
sample-reg-thread.dataset-handler = ch.systemsx.cisd.etlserver.SampleRegisteringDropbox
sample-reg-thread.dataset-handler.error-log-dir = ${root-dir}/sample-registration-errors
sample-reg-thread.dataset-handler.sample-code-prefix = ABC

Control File Format

The control file is a tab-separated-value (TSV) file with a header, and, optionally, some comments.

The first non-header and non-comment line of the TSV file should be the column headers and following the headers are the metadata for samples, with each line containing the metadata for one sample in the same format as used for the batch sample registration process in the Web GUI.

Required Control File Keys
  • The control file has to specify the sample type of the samples to register. It is an error if this control file key is missing.
    • Control File Key : SAMPLE_TYPE
Optional Control File Keys
  • The control file can specify the space to register the new samples in. Doing so switches on auto id generation.
    • Default : require the full sample identifiers, including the space, to be specified in the control file
    • Control File Key : DEFAULT_SPACE
  • The key for the user on who's behalf the samples/data sets are being registered
    • Default: The system user
    • Control File Key : USER

Example

	# Control Parameters
	#! GLOBAL_PROPERTIES_START
	#!   SAMPLE_TYPE = GENERAL
	#!   USER = testuser
	#! GLOBAL_PROPERTIES_END
	# Data
	identifier	experiment	COMMENT
	/CISD/SD-TEST-1	/CISD/DEFAULT/EXP-REUSE
	/CISD/SD-TEST-2	/CISD/DEFAULT/EXP-REUSE
	/CISD/SD-TEST-3	/CISD/DEFAULT/EXP-REUSE

Logging Errors

If the samples cannot be registered, e.g. because the control file has an invalid format or because one of the samples does already exist in the database, a file <filename>_error-log.txt is created in the specified error-log-dir. The control file is not deleted, and an error marker <filename>_delete_me_after_correcting_errors is created in the dropbox directory. If the errors in the control file are fixed, the error marker can be deleted and the control file will be automatically reprocessed.

DSS RPC

DSS RPC makes it possible for clients to view and download data sets without going through the GUI and upload data sets without using drop boxes. One aspect of this service that can be configured is how uploaded data sets are processed.

RPC Dropboxes

To have a data set of a particular data set type handed over to a specific dropbox, specify put parameters with a data set type code as the end of the key and the dropbox key as the value, e.g.:

 dss-rpc.put.HCS_IMAGE = main-thread

The RPC Default Dropbox

All data set types which are not mapped, are handed over to the default RPC dropbox. This default RPC dropbox can be specified by setting the dss-rpc.put-default property to the dropbox code you want to be the default. If you don't specify a default RPC dropbox, the first dropbox will become the default RPC dropbox.

If you have more than one Data Store Server registered in an openBIS Application Server, only one DSS can be used to handle RPC Dropboxes. In this case, the property dss-rpc.put.dss-code needs to be set to the appropriate DSS code in the service.properties of the openBIS Application Server. Not doing so leads to ConfigurationFailureException when trying to access an RPC Dropbox.

Dataset Uploader and RPC Dropboxes

The Dataset Uploader uploads its data to one of the RPC Dropboxes, using the decision mechanism described above.

Data Set registration process

DSS Registration Log

DSS Registration Log is a mechanism to help troubleshooting problems that occur during the data set registration. For each file or directory that is placed in the incoming folder of a dropbox, a new log file is created with a name comprised of the timestamp when processing begins, the dropbox name, and the file name. This log file, along with its location in the directory structure, track progress of the registration of data and metadata contained in the file.

Directory structure

The logs are stored in a directory specified by the property 

    dss-registration-log-dir

which defults to datastore_server/log-registrations.

The main log registration directory contains 3 subdirectories in_process, succeded, failed.
The log file is initially created in the in_process directory and moved to succeeded or failed when registration completes.


The log file


The log file is created in the in-process directory when the data set registration process has started. The file is named with the current timestamp, thread name and the dataset filename.

2012-01-18-13-33-02-837_simple-dropbox_image.png.log

While the data set is being processed, the log file is kept in the in_process directory. When the data set registration has finished, the file is moved to the succeeded or failed directory depending on the result of registration.

Log contents

The log is a concise representation of the processing that happend as a result of something being placed in the dropbox. Each line of the log begins with a date and time that an event happend. Events that appear in the log include: preparation of data sets, progress of data through the various directories on the file system, and registration of metadata with the openBIS application server. In the case of failure, information about the problems that caused failure are shown in the log as well.

EventDescription
Prepared registration of N data set(s): Data set codesThe dropbox has defined N data sets to register with openBIS. The codes of the first several data sets are shown.
Data has been moved to the pre-commit directory: DirThe data has been processed and prepared for storage. The location of the data is shown.
About to register metadata with AS: registrationId(N )A call will be made to the application server to register the metadata. The call has the id N.
Data has been registered with the openBIS Application Server.The metadata has been successfully registered.
Storage processors have committed.All post-processing (e.g., storage of data in a secondary data base) has completed.
Data has been moved to the final store.Data is now in its designated storage location in the store directory of the DSS

Storage has been confirmed in openBIS Application Server.

The application server has been informed that the data is its designated location.


 On successful data registration log will look like this:

2012-06-22 10:05:56 Prepared registration of 1 data set:
		20120622100556527-13696
2012-06-22 10:05:56 Data has been moved to the pre-commit directory: {directory}
2012-06-22 10:05:56 About to register metadata with AS: registrationId(99)
2012-06-22 10:05:59 Data has been registered with the openBIS Application Server.
2012-06-22 10:05:59 Storage processors have committed.
2012-06-22 10:05:59 Data has been moved to the final store.
2012-06-22 10:05:59 Storage has been confirmed in openBIS Application Server.

In case of failure the log can look for instance like this:

2012-06-20 12:14:26 Prepared registration of 1 data set:
        20120620121426786-13398
2012-06-20 12:14:26 Data has been moved to the pre-commit directory: {directory}
2012-06-20 12:14:26 About to register metadata with AS: registrationId(89)
2012-06-20 12:14:26 Error in registrating data in application server
2012-06-20 17:36:43 Starting recovery at checkpoint Precommit
2012-06-20 17:36:44 Responding to error [OPENBIS_REGISTRATION_FAILURE] by performing action LEAVE_UNTOUCHED on {directory}
2012-06-20 17:36:44 File has been left untouched {directory}
2012-06-20 17:36:44 Operations haven't been registered in AS - recovery rollback

DSS Registration Failure Email Notifications

Since 20.10.06 is also possible to configure a list of emails on the DSS service.properties that will get notified when failures occur. 

service.properties
# Email addresses of people to get notifications about problems in dataset registrations
mail.addresses.dropbox-errors=admin1@localhost,admin2@localhost

Pre-staging

The registration process can make use of the prestaging directory. When the registration task starts the hardlink copy of the original input data is created in prestaging directory, so that throught whole registration process the original input data can remain untouched. Then on the succesfull registration the original data can be removed depending on the user's configuration.

Property dataset-registration-prestaging-behavior can be used to define whether the registration process should make use of the the prestaging directory. It can be set to the following values:

  • default or use_original - doesn't use the prestaging directory. Use the original input file.
  • delete - Use the prestaging file during the registration process, and delete original input file on successful registration
  • leave-untouched - Use the prestaging file during the registration process, but leave original input file untouched on successful registration.

Start Server

The server is started as follows:

prompt> cd datastore_server
prompt> ./datastore_server.sh start

If mail.test.address property is set a test e-mail will be sent to the specified address after successful start-up of the server.

Stop Server

The server is stopped as follows:

prompt> cd datastore_server
prompt> ./datastore_server.sh stop

Fix "Can't connect to X11 window server" or "Could not initialize class sun.awt.X11GraphicsEnvironment" problem

Some Data Store Server plugins operate on images (e.g. by generating charts or rescaling images before they are displayed or registered).
These operations do not work reliably in headless mode, thus a dummy X11 server like Xvfb needs to be installed and run on the server running the DSS.

Typical symptoms for this problems are when you see messages like

2010-12-10 10:34:07.406::WARN:  Error for /datastore_server/chromatogram
java.lang.NoClassDefFoundError: Could not initialize class sun.awt.X11GraphicsEnvironment
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Unknown Source)
        at java.awt.GraphicsEnvironment.getLocalGraphicsEnvironment(Unknown Source)
        at sun.awt.X11.XToolkit.<clinit>(Unknown Source)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Unknown Source)
        at java.awt.Toolkit$2.run(Unknown Source)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.awt.Toolkit.getDefaultToolkit(Unknown Source)
        at sun.swing.SwingUtilities2$AATextInfo.getAATextInfo(Unknown Source)
        at javax.swing.plaf.metal.MetalLookAndFeel.initComponentDefaults(Unknown Source)
        at javax.swing.plaf.basic.BasicLookAndFeel.getDefaults(Unknown Source)
        at javax.swing.plaf.metal.MetalLookAndFeel.getDefaults(Unknown Source)
        at javax.swing.UIManager.setLookAndFeel(Unknown Source)
        at javax.swing.UIManager.setLookAndFeel(Unknown Source)
        at javax.swing.UIManager.initializeDefaultLAF(Unknown Source)
        at javax.swing.UIManager.initialize(Unknown Source)
        at javax.swing.UIManager.maybeInitialize(Unknown Source)
        at javax.swing.UIManager.getDefaults(Unknown Source)
        at javax.swing.UIManager.getColor(Unknown Source)
        at org.jfree.chart.JFreeChart.<clinit>(JFreeChart.java:261)
        at org.jfree.chart.ChartFactory.createXYLineChart(ChartFactory.java:1748)

in startup_log.txt.
or in DSS log messages like:

2011-10-11 14:43:44,412 ERROR [hcs_image_raw - Incoming Data Monitor] OPERATION.DataSetStorageAlgorithmRunner - Error during dataset registertion: InternalError: Can't connect to X11 window se
rver using ':1.0' as the value of the DISPLAY variable.
java.lang.InternalError: Can't connect to X11 window server using ':1.0' as the value of the DISPLAY variable.
        at sun.awt.X11GraphicsEnvironment.initDisplay(Native Method)
        at sun.awt.X11GraphicsEnvironment.access$200(X11GraphicsEnvironment.java:62)
        at sun.awt.X11GraphicsEnvironment$1.run(X11GraphicsEnvironment.java:178)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.awt.X11GraphicsEnvironment.<clinit>(X11GraphicsEnvironment.java:142)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:186)
        at java.awt.GraphicsEnvironment.getLocalGraphicsEnvironment(GraphicsEnvironment.java:82)
        at java.awt.image.BufferedImage.createGraphics(BufferedImage.java:1152)
        at java.awt.image.BufferedImage.getGraphics(BufferedImage.java:1142)
        at ch.systemsx.cisd.imagereaders.ij.ImageJReaderLibrary.createBufferedImageOfSameType(ImageJReaderLibrary.java:75)
        at ch.systemsx.cisd.imagereaders.ij.ImageJReaderLibrary.access$0(ImageJReaderLibrary.java:70)
        at ch.systemsx.cisd.imagereaders.ij.ImageJReaderLibrary$1.readImage(ImageJReaderLibrary.java:66)
        at ch.systemsx.cisd.openbis.dss.generic.shared.utils.ImageUtil$TiffImageLoader.loadWithImageJ(ImageUtil.java:119)
        at ch.systemsx.cisd.openbis.dss.generic.shared.utils.ImageUtil$TiffImageLoader.load(ImageUtil.java:107)
        at ch.systemsx.cisd.openbis.dss.generic.shared.utils.ImageUtil.loadImageGuessingLibrary(ImageUtil.java:425)
        at ch.systemsx.cisd.openbis.dss.generic.shared.utils.ImageUtil.loadImageGuessingLibrary(ImageUtil.java:398)
        at ch.systemsx.cisd.openbis.dss.generic.shared.utils.ImageUtil.loadUnchangedImage(ImageUtil.java:234)
        ...

To fix it, perform these steps (assume Redhat Linux):

  1. Install package xorg-x11-server-Xvfb:

    #yum install xorg-x11-server-Xvfb
    in RHEL 6.3 the package is called: xorg-x11-server-Xvfb-1.7.7-29.el6.x86_64
  2. Add the line:

    export DISPLAY=:1.0

    to file /etc/sysconfig/openbis.

  3. Copy the attached file Xvfb to /etc/init.d/Xvfb.
  4. Enable service Xvfb:

    # chkconfig Xvfb on
  5. Start service Xvfb:

    # /etc/init.d/Xvfb start
  6. Restart openbis service:

    # /etc/init.d/openbis restart

If there are still problems add a JVM option:

-Djava.awt.headless=true

to the jetty server configuration (for DSS modify sprint/datastore_server/etc/datastore_server.conf).

Runtime changes to logging

The script  <installation directory>/servers/datastore-server/datastore_server.sh  can be used to change the logging behavior of  the datastore server while the server is running.

The script is used like this: datastore_server.sh [command] [argument]

The table below describes the possible commands and their arguments.

CommandArgument(s)Default ValueDescription
log-service-calls'on', 'off''off'

Turns on / off detailed service call logging.

When this feature is enabled, datastore server will log about start and end of every service call it executes to file <installation directory>/servers/datastore_server/log/datastore_server_service_calls.txt

log-long-running-invocations'on', 'off''on'

Turns on / off logging of long running invocations.

When this feature is enabled, datastore server will periodically create a report of all service calls that have been in execution more than 15 seconds to file <installation directory>/servers/datastore_server/log/datastore_server_long_running_threads.txt  

debug-db-connections'on', 'off''off'

Turns on / off logging about database connection pool activity.

When this feature is enabled, information about every borrow and return to database connection pool is logged to datastore server main log in file <installation directory>/servers/datastore_server/log/datastore_server_log.txt

log-db-connectionsno argument / minimum connection age (in milliseconds)5000

When this command is executed without an argument, information about every database connection that has been borrowed from the connection pool is written into datastore server main log in file <installation directory>/servers/datastore_server/log/datastore_server_log.txt

If the "minimum connection age" argument is specified, only connections that have been out of the pool longer than the specified time are logged. The minimum connection age value is given in milliseconds.

record-stacktrace-db-connections 'on', 'off''off'

Turns on / off logging of stacktraces.

When this feature is enabled AND debug-db-connections is enabled, the full stack trace of the borrowing thread will be recorded with the connection pool activity logs.

log-db-connections-separate-log-file'on', 'off''off'

Turns on / off database connection pool logging to separate file.

When this feature is disabled, the database connection pool activity logging is done only to datastore server main log. When this feature is enabled, the activity logging is done ALSO to file <installation directory>/servers/datastore_server/log/datastore_server_db_connections.txt .



  • No labels