For the System Requirements, see Installation and Administrator Guide of the openBIS Server. The openBIS Application server needs to be running when installing the openBIS Data Store Server (DSS).
Installation
For installing DSS the main distribution (naming schema: datastore_server-<version>
r<revision>.zip
) and optional plugin distributions (naming schema:datastore_server_plugin
<plugin name><version>-r<revision>.zip
) are needed. The main distribution contains:
datastore_server/datastore_server.sh
: Bash script for starting/stopping the server.datastore_server/lib
: Folder with libraries needed.datastore_server/etc
: Folder with configuration files and key stores.datastore_server/data
: Example data folder configured to run the demo.datastore_server/log
: Empty folder which will contain log files.
The plugin distribution contain stuff relative to the DSS main folder <some folder>/datastore_server
.
Installation steps
- Create a service user account, i.e. an unprivileged, regular user account. You can use the same user account for running the Application Server and the Data Store Server. Do not run openBIS DSS as root!
- Unzip the main distribution on the server machine on its final location.
- If plugins are required go into the folder
datastore_server
and unzip them there. - Adapt
datastore_server/etc/service.properties
. - Create a Role in openBIS for the etlserver user: Administration -> Authorization -> Roles -> Assign Role
Choose as a role 'INSTANCE_ETL_SERVER' and select as a person 'etlserver'
If you do not do this and start up the Datastore Server you will get an error in thelog/datastore_server_log.txt
sayingAuthorization failure: No role assignments could be found for user 'etlserver'.
Start up the server as follows:
prompt> ./datastore_server.sh start
Have look into the log files: They should be free of exception stack traces.
Configuration file
The configuration file datastore_server/etc/service.properties
is an Extended Properties File. It can have a lot of properties. Most defining plugins which can be extracted as a plugin configuration into the core plugins. For more see Core Plugins. Nevertheless, here is a typical example which still contains everything:
# Unique code of this Data Store Server. Not more than 40 characters. data-store-server-code = DSS1 # The root directory of the data store storeroot-dir = data/store # The directory where the command queue files are located; defaults to storeroot-dir commandqueue-dir = # Comma-separated list of definitions of additional queues for processing processing plugins. # Each entry is of the form <queue name>:<regular expression> # A corresponding persistent queue is created. All processing plugins with a key matching the corresponding # regular expression are associated with the corresponding queue. # # The key of a processing plugin is its core-plugin name which is the name of the folder containing # 'plugin.properties'. # # In case of archiving is enabled the following three processing plugins are defined: # 'Archiving', 'Copying data sets to archive', and 'Unarchiving' #data-set-command-queue-mapping = archiving:Archiving|Copying data sets to archive # Cache for data set files from other Data Store Servers # cache-workspace-folder = ../../data/dss-cache # Maximum cache size in MB # cache-workspace-max-size = 1024 # cache-workspace-min-keeping-time = # Port port = 8444 # Session timeout in minutes session-timeout = 720 # Path to the keystore keystore.path = etc/openBIS.keystore # Password of the keystore keystore.password = changeit # Key password of the keystore keystore.key-password = changeit # The check interval (in seconds) check-interval = 60 # The time-out for clean up work in the shutdown sequence (in seconds). # Note that that the maximal time for the shutdown sequence to complete can be as large # as twice this time. # Remark: On a network file system, it is not recommended to turn this value to something # lower than 180. shutdown-timeout = 180 =============================== # Data Set Registration Halt: # # In order to prevent the data store from having no free disk space a limit (so called highwater mark) can be # specified. If the free disk space of the associated share goes below this specified value, # DSS halts to register data sets. Also a notification log and an email will be produced. # When the free disk space is again above the limit registration will be continued. # The value must be specified in kilobytes (1048576 = 1024 * 1024 = 1GB). If no high water mark is # specified or if the value is negative, the system will not be watching. There are 2 different kinds # of highwater mark supported: the one 'highwater-mark' that is checking the space on the store, and # one 'recovery-highwater-mark' that is checking the amount of free space for recovery state (on the local filesystem). # # Core plugins of type drop box and ingestion services (special type of reporting-plugins) can override the # highwater mark value individually by specifying the property 'incoming-share-minimum-free-space-in-gb' # in their plugin.properties. highwater-mark = -1 recovery-highwater-mark = -1 # If a data set is successfully registered it sends out an email to the registrator. # If this property is not specified, no email is sent to the registrator. This property # does not affect the mails which are sent, when the data set could not be registered. notify-successful-registration = false # The URL of the openBIS server server-url = https://localhost:8443/openbis/openbis # The username to use when contacting the openBIS server username = etlserver # The password to use when contacting the openBIS server password = etlserver # The base URL for Web client access. download-url = https://localhost:8889 # SMTP properties (must start with 'mail' to be considered). # mail.smtp.host = localhost # mail.from = datastore_server@localhost # If this property is set a test e-mail will be sent to the specified address after DSS successfully started-up. # mail.test.address = test@localhost # ---------------- Timing parameters for file system operations on remote shares. # Time (in seconds) to wait for any file system operation to finish. Operations exceeding this # timeout will be terminated. timeout = 60 # Number of times that a timed out operation will be tried again (0 means: every file system # operation will only ever be performed once). max-retries = 11 # Time (in seconds) to wait after an operation has been timed out before re-trying. failure-interval = 10 # The period of no write access that needs to pass before an incoming data item is considered # complete and ready to be processed (in seconds) [default: 300]. # Valid only when auto-detection method is used to determine if an incoming data are ready to be processed. # quiet-period = <value in seconds> # Globally used separator character which separates entities in a data set file name data-set-file-name-entity-separator = _ # --------------------------------------------------------------------------- # maintenance plugins configuration # --------------------------------------------------------------------------- # Comma separated names of maintenance plugins. # Each plugin should have configuration properties prefixed with its name. # Mandatory properties for each <plugin> include: # <plugin>.class - Fully qualified plugin class name # <plugin>.interval - The time between plugin executions (in seconds) # Optional properties for each <plugin> include: # <plugin>.start - Time of the first execution (HH:mm) # <plugin>.execute-only-once - If true the task will be executed exactly once, # interval will be ignored. By default set to false. maintenance-plugins = demo-maintenance, auto-archiver, archive-cleanup, hierarchical-storage-updater demo.class = ch.systemsx.cisd.etlserver.plugins.DemoMaintenancePlugin demo.interval = 60 demo.start = 23:00 # ----- Automatic archiver configuration ------------------------------------ # Class of a task that performs automatic archivization of 'AVAILABLE' data sets based on their properties. auto-archiver.class = ch.systemsx.cisd.etlserver.plugins.AutoArchiverTask auto-archiver.interval = 10 auto-archiver.start = 23:00 # following properties are optional # only data sets of specified type will be archived auto-archiver.data-set-type = UNKNOWN # only data sets that are older than specified number of days will be archived (default = 30) auto-archiver.older-than = 90 # Indicates whether data sets will be removed from the data store upon archiving # NOTE: You can configure two different auto-archiver tasks - one with 'remove-datasets-from-store' # set to 'false' to enable eager archiving and another one with the flag set to 'true' that will # free space on the datastore server auto-archiver.remove-datasets-from-store=false # fully qualified class name of a policy that additionally filters data sets to be filtered auto-archiver.policy.class = ch.systemsx.cisd.etlserver.plugins.DummyAutoArchiverPolicy # use this policy to archive datasets in batches grouped by experiment and dataset type # auto-archiver.policy.class = ch.systemsx.cisd.etlserver.plugins.ByExpermientPolicy # use this policy to archive datasets in batches grouped by space # auto-archiver.policy.class = ch.systemsx.cisd.etlserver.plugins.BySpacePolicy # Default archival candidate discoverer, using "older-than" criteria auto-archiver.archive-candidate-discoverer.class = ch.systemsx.cisd.etlserver.plugins.AgeArchiveCandidateDiscoverer # use this archival dataset candidate discovery to auto-archive by tags. Please note that the "older-than" will take no effect with this one # auto-archiver.archive-candidate-discoverer.class = ch.systemsx.cisd.etlserver.plugins.TagArchiveCandidateDiscoverer # auto-archiver.archive-candidate-discoverer.tags = /admin/boo, /admin/foo # ----- Alternative automatic archiver configuration ------------------------------------ # Performs automatic archiving of 'ACTIVE' data sets grouped by experiments based on the experiment's age # (which is defined as last modification date of the youngest data set of the experiment). # It iterates over all experiments, ordered by experiment age, and archives all non-locked, non-excluded # data sets of an experiment. The estimated size of the archived data sets has to be configured via the # 'estimated-data-set-size-in-KB.*' properties. The iteration over the experiments stops, when the estimated free # disk space on the monitored directory is larger than 'minimum-free-space-in-MB'. auto-archiver-exp.class = ch.systemsx.cisd.etlserver.plugins.ExperimentBasedArchivingTask # The time between subsequent archivizations (in seconds) auto-archiver-exp.interval = 86400 # Time of the first execution (HH:mm) auto-archiver-exp.start = 23:15 # A directory to monitor for free disk space #auto-archiver-exp.monitored-dir = /some/directory # The mininum free space on the monitored share to ensure by archiving data sets (optional, default is 1024) auto-archiver-exp.minimum-free-space-in-MB = 1024 # A coma-separated list of data set type keys to exclude from archiving (optional) auto-archiver-exp.excluded-data-set-types = TYPE_KEY1, TYPE_KEY2 # Estimated data set size (in KB) for data set of type TYPE_KEY3 auto-archiver-exp.estimated-data-set-size-in-KB.TYPE_KEY3=300 # Default data set size estimation (in KB) auto-archiver-exp.estimated-data-set-size-in-KB.DEFAULT=1200 # A free space provider class to be used auto-archiver-exp.free-space-provider.class=ch.systemsx.cisd.openbis.dss.generic.shared.utils.PostgresPlusFileSystemFreeSpaceProvider # PostgresSQL data source to be checked for free space auto-archiver-exp.free-space-provider.monitored-data-source=data-source # Whether a VACUUM command should be executed before the database is asked for the available free space auto-archiver-exp.free-space-provider.execute-vacuum=true # A task which cleans up deleted data sets from the archive archive-cleanup.class = ch.systemsx.cisd.etlserver.plugins.DeleteFromArchiveMaintenanceTask archive-cleanup.status-filename= ${root-dir}/deletion-event-lastseenid.txt # keep the archive copies of the data set for one week after their deletion in openBIS # the delay is specified in minutes archive-cleanup.delay-after-user-deletion=604800 # start up time archive-cleanup.start = 02:00 # run every day (in minutes) archive-cleanup.interval = 86400 # the plugin which is run periodically to create a mirror structure of the store with the same files # but with user-readable structure of directories hierarchy-builder.class = ch.systemsx.cisd.etlserver.plugins.HierarchicalStorageUpdater # The time between rebuilding the hierarchical store structure (in seconds) hierarchy-builder.interval = 86400 # The root directory of the hierarchical data store hierarchy-builder.hierarchy-root-dir = data/hierarchical-store # The naming strategy for the symbolic links hierarchy-builder.link-naming-strategy.class = ch.systemsx.cisd.etlserver.plugins.TemplateBasedLinkNamingStrategy # The exact form of link names produced by TemplateBasedLinkNamingStrategy is configurable # via the following template. The variables # dataSet, dataSetType, experiment, instance, project, sample, space # will be recognized and replaced in the final link name. hierarchy-builder.link-naming-strategy.template = ${space}/${project}/${experiment}/${dataSetType}+${sample}+${dataSet} # When specified for a given <dataset-type> this store subpath will be used as the symbolic source hierarchical-storage-updater.link-source-subpath.<dataset-type> = original # Setting this property to "true" for a given <dataset-type> will treat the first child item (file or folder) # in the specified location as the symbolic link source. It can be used in conjunction with # the "link-source-subpath.<dataset-type>" to produce links pointing to a folder with unknown name e.g. # <data-set-location>/original/UNKNOWN-NAME-20100307-1350 hierarchical-storage-updater.link-from-first-child.<dataset-type> = true # --------------------------------------------------------------------------- # (optional) archiver configuration # --------------------------------------------------------------------------- # Configuration of an archiver task. All properties are prefixed with 'archiver.'. # Archiver class specification (together with the list of packages this class belongs to). archiver.class = ch.systemsx.cisd.openbis.dss.generic.server.plugins.demo.DemoArchiver # indicates if data should be synchronized when local copy is different than one in archive archiver.synchronize-archive = false # --------------------------------------------------------------------------- # reporting and processing plugins configuration # --------------------------------------------------------------------------- # Comma separated names of reporting plugins. Each plugin should have configuration properties prefixed with its name. # If name has 'default-' prefix it will be used by default in data set Data View. reporting-plugins = demo-reporter # Label of the plugin which will be shown for the users. demo-reporter.label = Show Dataset Size # Comma separated list of dataset type codes which can be handled by this plugin. demo-reporter.dataset-types = UNKNOWN # Plugin class specification (together with the list of packages this class belongs to). demo-reporter.class = ch.systemsx.cisd.openbis.dss.generic.server.plugins.demo.DemoReportingPlugin # The property file. Its content will be passed as a parameter to the plugin. demo-reporter.properties-file = # Plugin that allows to show content of Main Data Set as an OpenBIS table # tsv-viewer.label = TSV View # tsv-viewer.dataset-types = TSV # tsv-viewer.class = ch.systemsx.cisd.openbis.dss.generic.server.plugins.standard.TSVViewReportingPlugin # tsv-viewer.properties-file = # --------------------------------------------------------------------------- # Data Set Validator Definitions # --------------------------------------------------------------------------- # Data set validators used to accept or reject data sets to be registered. # Comma separated list of validator definitions. data-set-validators = validator # Definition of data set validator 'validator' validator.data-set-type = HCS_IMAGE validator.path-patterns = **/*.txt validator.columns = id, description, size validator.id.header-pattern = ID validator.id.mandatory = true validator.id.order = 1 validator.id.value-validator = ch.systemsx.cisd.etlserver.validation.HeaderBasedValueValidatorFactory validator.id.header-types = compound, gene-locus validator.id.compound.header-pattern = CompoundID validator.id.compound.value-type = unique validator.id.compound.value-pattern = CHEBI:[0-9]+ validator.id.gene-locus.header-pattern = GeneLocus validator.id.gene-locus.value-type = unique validator.id.gene-locus.value-pattern = BSU[0-9]+ validator.description.header-pattern = Description validator.description.value-type = string validator.description.value-pattern = .{0,100} validator.size.header-pattern = A[0-9]+ validator.size.can-define-multiple-columns = true validator.size.allow-empty-values = true validator.size.value-type = numeric validator.site.value-range = [0,Infinity) # Comma separated names of processing threads. Each thread should have configuration properties prefixed with its name. # E.g. 'code-extractor' property for the thread 'my-etl' should be specified as 'my-etl.code-extractor' inputs=main-thread # --------------------------------------------------------------------------- # 'main-thread' thread configuration # --------------------------------------------------------------------------- # The directory to watch for incoming data. main-thread.incoming-dir = data/incoming # Specifies what should happen if an error occurs during dataset processing. # By default this flag is set to false and user has to modify the 'faulty paths file' # each time the faulty dataset should be processed again. # Set this flag to true if the processing should be repeated after some time without manual intervention. # Note that this can increase the server load. main-thread.reprocess-faulty-datasets = false # If 'true' then unidentified and invalid data sets will be deleted instead of being moved to 'unidentified' folder # Allowed values: # - false - (default) move unidentified or invalid data sets to 'unidentified' folder # - true - delete unidentified or invalid data sets # delete-unidentified = true # Determines when the incoming data should be considered complete and ready to be processed. # Allowed values: # - auto-detection - when no write access will be detected for a specified 'quite-period' # - marker-file - when an appropriate marker file for the data exists. # The default value is 'marker-file'. main-thread.incoming-data-completeness-condition = marker-file # Path to the script that will be executed before data set registration. # The script will be called with two parameters: <data-set-code> and <absolute-data-set-path> (in the incoming dropbox). # NOTE: before starting DSS server make sure the script is accessible and executable. # main-thread.pre-registration-script = /example/scripts/my-script.sh # Path to the script that will be executed after successful data set registration. # The script will be called with two parameters: <data-set-code> and <absolute-data-set-path> (in the data store). # NOTE: before starting DSS server make sure the script is accessible and executable. # main-thread.post-registration-script = /example/scripts/my-script.sh # ---------------- Plugin properties # The extractor class to use for code extraction main-thread.data-set-info-extractor = ch.systemsx.cisd.etlserver.DefaultDataSetInfoExtractor # Separator used to extract the barcode in the data set file name main-thread.data-set-info-extractor.entity-separator = ${data-set-file-name-entity-separator} # The space main-thread.data-set-info-extractor.space-code = TEST # Location of file containing data set properties #main-thread.data-set-info-extractor.data-set-properties-file-name = data-set.properties # The extractor class to use for type extraction main-thread.type-extractor = ch.systemsx.cisd.etlserver.SimpleTypeExtractor main-thread.type-extractor.file-format-type = TIFF main-thread.type-extractor.locator-type = RELATIVE_LOCATION main-thread.type-extractor.data-set-type = HCS_IMAGE main-thread.type-extractor.is-measured = true # The storage processor (IStorageProcessor implementation) main-thread.storage-processor = ch.systemsx.cisd.etlserver.DefaultStorageProcessor # --------------------------------------------------------------------------- # dss-rpc # --------------------------------------------------------------------------- # The dss-rpc section configures the RPC put functionality by providing a mapping between data # set type and input thread parameters. # # The default input thread is specified by the dss-rpc.put-default key. If not specified, the first input # thread will be used. # # Mappings are specified by dss-rpc.put.<data-set-type-code> = <thread-name> # # If this section is empty, then the first input thread will be used. # dss-rpc.put-default = main-thread #dss-rpc.put.HCS_IMAGE = main-thread # --------------------------------------------------------------------------- # Sample dropbox configuration # --------------------------------------------------------------------------- # <incoming-dir> will be scanned for tsv files containing samples in standard # batch import format. Additionally the file should contain (in the comment) # definition of sample type and optionally the default space and registrator. # If the 'DEFAULT_SPACE' is defined, codes of the samples will be # automatically created, so 'identifier' column is not expected. # If 'USER' is defined, he will become a 'Registrator' of the samples. # # --- EXAMPLE FILE --- #! GLOBAL_PROPERTIES_START #! SAMPLE_TYPE = <sample_type_code> #! GLOBAL_PROPERTIES_END # identifier parent container property1 property2 # /SPACE/S1 /SPACE/P1 value11 value21 # /SPACE/S1 /SPACE/P2 value12 value22 # --- END OF FILE --- # # --- EXAMPLE FILE (generate codes automatically, registrator specified) --- #! GLOBAL_PROPERTIES_START #! SAMPLE_TYPE = <sample_type_code> #! USER = <user_id> #! DEFAULT_SPACE = <space_code> #! GLOBAL_PROPERTIES_END # parent container property1 property2 # /SPACE/P1 value11 value21 # /SPACE/P2 value12 value22 # --- END OF FILE --- # # Directory scanned for files with samples samples.incoming-dir = ${root-dir}/sample-dropbox # # Class responsible for handling files with samples definition samples.dataset-handler = ch.systemsx.cisd.etlserver.SampleRegisteringDropbox # # The path to the error logs directory samples.dataset-handler.error-log-dir = ${root-dir}/error-log # # Prefix of samples with automatically created codes. Default value: 'S'. samples.dataset-handler.sample-code-prefix = AS # # Settings not relevant to sample dropbox, but required by DSS samples.incoming-data-completeness-condition = auto-detection samples.data-set-info-extractor = ch.systemsx.cisd.etlserver.DefaultDataSetInfoExtractor samples.type-extractor = ch.systemsx.cisd.etlserver.SimpleTypeExtractor samples.storage-processor = ch.systemsx.cisd.etlserver.DefaultStorageProcessor # # Reasearch Collection - ELN Plugin # # rc-exports-api-limit-data-size-megabytes=4000 # rc-exports-api-service-document-url= # rc-exports-api-user= # rc-exports-api-password= # # Zenodo - ELN Plugin # # zenodo-exports-api-limit-data-size-megabytes=4000 # zenodo-exports-api-zenodoUrl=https://zenodo.org/
Because the Data Store Server communicates with the openBIS Server the following properties have to be changed:
server-url
: Here only the host name has to be changed.username/password
: Data Store Server is just a user for the openBIS Server. It should have the roleSPACE_ETL_SERVER
orINSTANCE_ETL_SERVER
.
Production setup with Proxy
A typical production scenario is when a proxy (typically Apache) is used at the front of AS and DSS, this have two main advantages:
- Users will be accessing the same domain and port, browsers are known to be friendlier to this setup.
- The proxy can provide an https connector.
When these requirements are met is possible to comment out the download-url parameter from service.properties. As a consequence of this, the URLs provided by the system will be relative to the current domain and port of the application.
Segmented Store
Below the store root directory (specified by property storeroot-dir
) the data is organized in segments called shares. Each share is a sub directory. The name of this sub directory is a number. It is also the share ID. A share can be a symbolic link to some shared directory.
On start up the DSS automatically creates share 1. An existing store is migrated by moving all data into share 1. Administrators can create new shares by creating new sub folders or symbolic links to sub folders readable and writable by the user running DSS.
Incoming directories are automatically associated to shares. These shares are called incoming shares. All data sets are stored in the incoming share associated on DSS start up. For each incoming directory DSS tries to move an empty file from the incoming directory to a share. The first share for which this move is successful becomes the associated incoming share. Each incoming directory has to have an associated incoming share. Several incoming directories can have the same share.
It is possible to force the assignment of an incoming share for a dropbox in the plugin.properties using the optional incoming-share-id
property. The incoming share specified in this way does not have to be on the same disk as incoming folder.
Segmenting the data store allows to add memory when it is needed. An administrator has just to mount a new hard disk or and NFS folder. After creating a symbolic link in the data store root directory the memory is available.
Shuffling Data Sets Manually
The script share-manager.sh
(inside <installation directory>/servers/datastore_server
) allows to move manually data sets to another share:
./sharemanager.sh move-to <share id> <data set code 1> [<data set code 2> <data set code 3> ...]
The user is asked for user ID and password.
The following command lists all shares and their free space:
./sharemanager.sh list-shares
Shuffling Data Sets Automatically
In order to be used by DSS the SegmentedStoreShufflingTask
maintenance task is needed to shuffle data sets from full incoming shares or shares with set withdraw flag to the freshly added share (which is called an external share). Here is a typical configuration for this maintenance task:
maintenance-plugins = <other maintenance plugins>, store-shuffler store-shuffler.class = ch.systemsx.cisd.etlserver.plugins.SegmentedStoreShufflingTask store-shuffler.shuffling.class = ch.systemsx.cisd.etlserver.plugins.SimpleShuffling # Data will be moved from the incoming share when the amount of free space on it will be lower than this number: store-shuffler.shuffling.minimum-free-space-in-MB = 1024 store-shuffler.shuffling.share-finder.class = ch.systemsx.cisd.etlserver.plugins.SimpleShufflingShareFinder # Nothing is moved to the share if the amount of free space is below that mark. store-shuffler.shuffling.share-finder.minimum-free-space-in-MB = 1024 store-shuffler.interval = 86400
The maintenance task shuffles data sets from incoming shares which have less then 1GB free space to the share which the most free space. The task is executed every day (86400 seconds) once.
Speed and Speed Hint
Usually different shares are located on different (remotely) mounted disk drives. For that reason the data accessing speed among the shares can be quite different. Shuffling (and also unarchiving) can be controlled by associating a relative speed value to each share. The speed is a number between 0 and 100. Large numbers mean faster reading access. The speed is an arbitrary number which allows to compare data accessing speed between two different share. For example, if share 1 has speed 30 and share 2 has speed 50 then the data access from share 2 is faster than from share 1.
Share properties explains how to configure the speed of a share. If there is no explicit configuration a speed of 50 is assumed.
During data set registration a data set can be provided with a speed hint which can be used during shuffling to find an appropriated share. The speed hint is a number between -100 and +100. A positive value means that shares with that value or higher are preferred. A negative speed hint means that shares with same absolute value or less are preferred. For example, a data set with speed hint -30 prefers to be shuffled into a slow share of speed 30 or less.
Currently speed hints can be set only in jython drop-boxes. The default speed hint is -50.
Share properties
Each share can be configured via a share.properties
file stored the share's root folder.
Property Name | Default Value | Meaning |
---|---|---|
speed | 50 | The share's speed. Must be a number between 0 and 100 |
shuffle-priority | SPEED | Taken into account by the algorithm of StandardShareFinder . Valid values are |
withdraw-share |
| If set to |
ignored-for-shuffling | false | If set true this share will not be taken into account for shuffling from/to this share. |
unarchiving-scratch-share | false | If set true this share can only be used as a scratch share for unarchiving using MultiDataSetArchiver . For more details see Multi data set archiving. |
experiments | Comma-separated list of experiment identifiers. This property is only used by ExperimentBasedShareFinder. | |
data-set-types | Comma-separated list of dataset type codes. This property is only used by DataSetTypeBasedShareFinder. |
Share Finder
For shuffling as well as unarchiving a strategy is need to find an appropriated share for the data set. It can consider the speed hint of the data set. The finding strategy is a Java class implementing ch.systemsx.cisd.openbis.dss.generic.shared.IShareFinder
. It is specified by the properties starting with <prefix>.share-finder
. The concrete finder is specified by the fully-qualified class name defined by the property <prefix>.share-finder.class
. The following share finders are available:
ch.systemsx.cisd.openbis.dss.generic.shared.SimpleShareFinder
First it searches for shares with speed matching exactly the absolute value of the speed hint. If nothing found it does the same search for shares with speed above/below depending on the sign of the speed hint. If nothing found in this step it does the search for the third time but ignoring the speed hint.
The search is the following:
It first tries to find the external share with most free space which has enough space for the data set. If no such share is found it does the same for incoming shares.
ch.systemsx.cisd.etlserver.plugins.SimpleShufflingShareFinder
As in SimpleShareFinder
it tries to find first a share with matching speed, second a share respecting speed hint, and third a share ignoring speed hint.
Each time it searches for a share with most free space which has at least the space for the amount specified by the configuration parameter minimum-free-space-in-MB
(default value 1024).
ch.systemsx.cisd.openbis.dss.generic.shared.SpeedOptimizedShareFinder
First it searches for an extension share with matching speed and most free space. If nothing found it searches for an extension share with a speed respecting the speed hint. If this isn't successful it uses ch.systemsx.cisd.openbis.dss.generic.shared.SimpleShareFinder
but ignoring speed hint.
ch.systemsx.cisd.openbis.dss.generic.shared.StandardShareFinder
The search algorithm of this ShareFinder considers all shares with enough free space and with reset withdraw flag as potential "candidates" (The free space of the data set "home" share is increased by the data set size). The best candidate is elected by the following rules:
- An extension share is preferred above an incoming share.
- A share whose speed matches the speed requirements of the data set is preferred. If there is more than one share matching in the same way then choose the one with speed closest to absolute value of speed hint.
- If all candidates have the same parameters for (1) and (2) choose the share with most free space.
The priority of points (1) and (2) can be swapped if the current location of the data set is an incoming share and it has a shuffle priority SPEED.
Generally the StandardShareFinder tends to move data sets from incoming to extension shares. A data set can only be moved from extension to incoming share by an unarchiving operation if at the time of unarchiving all extension shares (regardless of their speeds) are full.
The configuration parameter incoming-shares-minimum-free-space-in-MB
can be specified for the StandardShareFinder. It allows to configure a threshold of free space that should always be available on the incoming shares.
ch.systemsx.cisd.openbis.dss.generic.shared.ExperimentBasedShareFinder
This share finder looks for a share which is associated with one or more experiments. If the data set to moved belongs to one of the specified experiments and the share has enough space the finder returns this share. The association of a share with some experiments is done by the property experiments
of the properties file share.properties
. The value is a comma-separated list of experiment identifiers.
ch.systemsx.cisd.openbis.dss.generic.shared.DataSetTypeBasedShareFinder
This share finder looks for a share which is associated with one or more data set types. If the data set to be moved is of one of the specified types and the share has enough space the finder returns this share. The association of a share with some data set type is done by the property data-set-types
of the properties file share.properties
. The value is a comma-separated list of data set type codes.
ch.systemsx.cisd.openbis.dss.generic.shared.MappingBasedShareFinder
This share finder reads a mapping file specified by property mapping-file
. It is used to get a list of share IDs. The first share from this list is returned which fulfills the following conditions: The share exists and its free space is larger than the size of the data set. For more details on the mapping file see Mapping File for Share Ids and Archiving Folders.
Data Sources
Often a DSS needs also relational databases. Such databases can be internal ones feeded when data sets are registered or external ones providing additional information for data set registration, processing or reporting plugins. Data sources should be defined as core plugins of type data-sources
. The following properties are understood:
Property name | Description |
---|---|
factory-class | Optional fully-qualified name of a class implementing ch.systemsx.cisd.openbis.generic.shared.util.IDataSourceFactory . The properties below are understood if the default factory class ch.systemsx.cisd.openbis.dss.generic.sharedDefaultDataSourceFactory is used. |
version-holder-class | Optional fully-qualified name of a class implementing ch.systemsx.cisd.openbis.dss.generic.shared.IDatabaseVersionHolder . This property is only used if this is a data source of an internal database where DSS takes care of creating and migrating the database. |
databaseEngineCode | Mandatory property specifying the database engine. Currently only postgresql is supported. |
basicDatabaseName | Mandatory property specifying the first part of the database name. |
databaseKind | Mandatory property specifying the second part of the database name. The full database name reads <basicDatabaseName>_<databaseKind>. |
scriptFolder | Folder containing database schema SQL scripts. This property is mandatory for internal databases. For external databases it will be ignored. |
urlHostPart | Optional host part of the database URL. This is the host name with an optional port number. Default value is a standard value for the selected database engine assuming that database server and DSS running on the same machine. |
owner | Owner of the database <basicDatabaseName>_<databaseKind> . Default: User who started up DSS |
password | Owner password. |
adminUser | Administrator user of the database server. Default is defined by the selected database engine. |
adminPassword | Administrator password. |
Example:
version-holder-class = ch.systemsx.cisd.openbis.dss.etl.ImagingDatabaseVersionHolder databaseEngineCode = postgresql basicDatabaseName = imaging urlHostPart = ${imaging-database.url-host-part:localhost} databaseKind = ${imaging-database.kind:prod} scriptFolder = ${screening-sql-root-folder:}sql/imaging
Simple data sources
If the database used in data store server is not managed by openBIS and doesn't follow openBIS conventions on versioning etc. it is possible to specify it with SimpleDataSourceFactory like in the following example
factory-class = ch.systemsx.cisd.openbis.dss.generic.shared.SimpleDataSourceFactory database-driver = oracle.jdbc.driver.OracleDriver database-url = jdbc:oracle:thin:@test.test.com:1111:orcldatabase-driver database-username = test database-password = test # database-max-idle-connections = # database-max-active-connections = # database-max-wait-for-connection = # value in miliseconds # database-active-connections-log-interval = # value in miliseconds # validation-query =
Reporting and Processing Plugins
The property reporting-plugins
and processing-plugins
are comma-separated lists of reporting/processing plugin names. Each reporting plugin should be able to take a set of data sets chosen by the user and produce a tabular result in a short time. The result table is shown to the user and can be exported to a tab separated file. A processing plugin is similar but it does not creates data immediately presented to the user. It can be used to do some time-consuming processing for the data sets selected by the user. Often, processing plugins inform the user via e-mail after finishing.
The name of all configuration properties for a particular reporting/processing plugin starts with <plugin name>.
. The following properties are understood:
Property name | Description |
---|---|
| Label of the plugin which will be shown for the users. This property is mandatory. |
| Comma separated list of data set type codes which can be handled by this plugin. This property is mandatory. |
| Fully qualified Java class name of the plugin. It has to implemented |
| The property file. Its content will be passed as a parameter to the plugin. |
| Fully qualified Java class name of an optional Servlet need. |
| Path pattern relative to the DSS application URL to which the servlet is bound. This property is mandatory if |
| Comma separated list of names of servlets needed by the plugin. For each name a set of properties with prefix |
Default Plugin used in Data View
In detail view of a dataset in openBIS Application the Data View section by default shows 'Files (Smart View)'
. If there are reporting plugins specified for dataset's type users can query a reporting plugin which results will be shown in the Data View. To show reporting plugin results by default for specific dataset types (instead of 'Files (Smart View)'
) name of the plugin should start with default-
prefix, e.g.: default-tsv-viewer
could be specified as defult plugin for datasets of type TSV
.
- If for a certain dataset type there is more than one reporting plugin specified as default then the first plugin using plugin labels in alphabetical order will be chosen by default.
- If you have two data set types, e.g.:
TSV
andCSV
, and want to be able to use the same plugin implementation for both, but choose it by default only for dataset typeTSV
you need to configure two separate plugins. Both configurations should have the sameclass
property value but differentdataset-types
, and additionally the plugin configuration for datasets of typeTSV
should have a name starting withdefault-
prefix.
Existing reporting plugins
TSV Viewer
Description
Allows to show content of Main Data Set as an OpenBIS table. To become a Main Data Set
, file must be either the only file in the data set directory or must match a given regular expression specific to given data set type. Such a regular expression can be configured by the administrator Data Set->Types->Edit->Main Data Set Pattern. This is one of a few reporting plugins working on files with tabular data such as TSV (tab separated value) files, CSV (comma separated value) files or Excel files (supported extensions: XSL, XSLS).
Configuration
# Add plugin id to the reporting-plugins, in our case: tsv-viewer reporting-plugins = tsv-viewer # Set the label, that will be visible to the OpenBIS users tsv-viewer.label = TSV View # Specify data set types that should be viewable with TSV Viewer tsv-viewer.dataset-types = TSV tsv-viewer.class = ch.systemsx.cisd.openbis.dss.generic.server.plugins.standard.TSVViewReportingPlugin tsv-viewer.properties-file = # Optional properties: # - separator of values in the file, significant for TSV and CSV files; default: tab #tsv-viewer.separator = ; # - whether lines beginning with '#' should be ignored by the plugin; default: true #tsv-viewer.ignore-comments = false # - excel sheet name or index (0 based) used for the excel file (.xsl or .xslx); default: 0 (first sheet) #tsv-viewer.excel-sheet = example_sheet_name
Jython Based Reporting Plugin
Description
Creates a report based on a jython script. For more details see Jython-based Reporting and Processing Plugins.
Example
label = My Report class = ch.systemsx.cisd.openbis.dss.generic.server.plugins.jython.JythonBasedReportingPlugin dataset-types = MY_DATA_SET script-path = script.py
Jython Based Aggregation Reporting Plugin
Description
Creates an aggregation report based on a jython script. Note, that property dataset-types
isn't needed and will be ignored. Aggregation reporting plugins can be used only via the Query API. For more details see Jython-based Reporting and Processing Plugins.
Example
label = My Report class = ch.systemsx.cisd.openbis.dss.generic.server.plugins.jython.JythonBasedAggregationServiceReportingPlugin script-path = script.py
Reporting Plugin Decorator
Description
A reporting plugin which modifies the table produced by another reporting plugin. The modification is done by a transformation class which implements ch.systemsx.cisd.openbis.generic.shared.basic.ITableModelTransformation
. Currently there is only the transformation ch.systemsx.cisd.openbis.dss.generic.server.plugins.standard.EntityLinksDecorator
which turns cells of some columns into cells with links to a material or a sample.
Configuration
Here is an example of a configuration which decorates a TSV viewer plugin which produces a table with a material and a sample column:
label = Example with Materials and Samples dataset-types = MY_DATA_SET_TYPE class = ch.systemsx.cisd.openbis.dss.generic.server.plugins.standard.DecoratingTableModelReportingPlugin # The actual reporting plugin which creates a table reporting-plugin.class = ch.systemsx.cisd.openbis.dss.generic.server.plugins.standard.TSVViewReportingPlugin reporting-plugin.separator = , # The transformation applied to the table return by the actual reporting plugin transformation.class = ch.systemsx.cisd.openbis.dss.generic.server.plugins.standard.EntityLinksDecorator # Link columns is a list of comma-separated column IDs which should be decorated. transformation.link-columns = GENE_ID, BARCODE # Entity kind is either MATERIAL or SAMPLE transformation.GENE_ID.entity-kind = MATERIAL # The type of the material. Note, it is assumed that the column contains only the material code. transformation.GENE_ID.material-type = GENE transformation.BARCODE.entity-kind = SAMPLE # Optional default property to be used if the column contains only sample codes instead of full identifiers. transformation.BARCODE.default-space = DEMO
Archiver Plugin
Archiver is an optional plugin used to archive and unarchive data sets. It receives a set of datasets chosen by the user or automatic archiver maintenance task performs its task on Data Store Server and changes status of data sets in openBIS database in the end.
The archiver plugin is also needed if freshly registered data sets should immediately be archived (mainly for backup purposes). For more details see Post Registration and ArchivingPostRegistrationTask.
The name of all configuration properties for a particular reporting plugin start with archiver.
.
Property name | Description |
---|---|
archiver.class | The archiver class, see the sections below for examples. |
archiver.share-finder.class | (optional) A #Share Finder strategy selecting a destination share when unarchiving datasets. The default value of this property is |
archiver.synchronize-archive | (optional) indicates if data should be synchronized when local copy is different than one in archive (default: true) |
archiver.batch-size-in-bytes | (optional) Datasets will be archived in batches. The datasets will be split in batches of roughly the same size (in bytes) controlled by the value of this property. Default is 1 gigabyte. |
archiver.pause-file | Path (absolute or relative to store root) of an empty file. If this file is present starting archiving/unarchiving will be paused until this file has been removed. This property is useful for archiving media/facilities with maintenance downtimes. Default value: pause-archiving |
archiver.pause-file-polling-time | Time interval between two checks whether pause file still exists or not. Default value: 10 min |
Rsync Archiver
Rsync Archiver (ch.systemsx.cisd.openbis.dss.generic.server.plugins.standard.RsyncArchiver
) is an example archiver implementation that stores the archived data in a specified destination folder. The destination folder doesn't have to be on the same file system as the data store. It can also be:
- a mounted remote folder,
- a remote folder accessible via SSH (add
<hostname>:
prefix to folder path in configurarion), - a remote folder accessible via an rsync server (add
<hostname>:<rsync module name>
prefix to folder path in configurarion).
Apart from standard archiver task properties the following specific properties are understood:
Property name | Description |
---|---|
destination | Path to the destination folder where archived data sets will be stored. |
rsync-password-file | (optional) Path to password file used when rsync module is specified. |
find-executable | (optional) Path to GNU find executable that is used to verify data integrity. If not specified default find executable found in the system, which should be the right one on most Linux distributions, but not on Mac OS. |
timeout | (optional) Network I/O timeout in seconds. The default value for this parameter is 15 seconds. |
only-mark-as-deleted | This is a flag which tells the archiver whether to delete data sets or only to mark data sets as deleted in the archive. In the second case a marker file (file name is the data set code) will be added to the folder |
verify-checksums | (optional) This flag specifies if CRC32 checksum check should be performed. The default is |
batch-size-in-bytes | (optional) This allows to control when the archiving status of just archived data sets will be updated. This is the case when the sum of data set sizes exceeds the specified threshold. Default value is 1GB. |
temp-folder | (optional) Temporary folder to be used in sanity check of archived data sets on an unmounted remote archive location. Default is just the data store itself. In case of TarArchiver this temporary folder is also used to unpack the TAR archive file for sanity check. |
Example configuration:
archiver.class = ch.systemsx.cisd.openbis.dss.generic.server.plugins.standard.RsyncArchiver archiver.share-finder.class = ch.systemsx.cisd.openbis.dss.generic.shared.SimpleShareFinder archiver.destination = hostname:/path/to/destination/folder archiver.timeout = 20
Zip Archiver
This archiver (ch.systemsx.cisd.openbis.dss.generic.server.plugins.standard.ZipArchiver
) archives data sets in ZIP files together with meta data (including properties) of the data set, the experiment and optional the sample to which the data set belongs. Also the meta data of the container data set (if present) are included (with properties, experiment and sample). The meta data are stored in the tab-separated file meta-data.tsv
inside the ZIP file. Each data set will be stored in one ZIP file named <data set code>.zip
.
The location of the ZIP file is specified by a mapping file which allows to specify archive folders in accordance with experiment/project/space identifier of the data set. Note, that all archive folders have to be on mounted disks. For more details about syntax and resolving rules see Mapping File for Share Ids and Archiving Folders.
Note, that this archiver doesn't work properly if HDF5 files are treated as folders.
Apart from standard archiver task properties the following specific properties are understood:
Property name | Description |
---|---|
only-mark-as-deleted | This is a flag which tells the archiver whether to delete data sets or only to mark data sets as deleted in the archive. In the second case a marker file (file name is the data set code) will be added to the folder |
verify-checksums | (optional) This flag specifies if CRC32 checksum check should be performed. The default is true . |
default-archive-folder | This is the path to the archive which is used if mapping hasn't been specified or an appropriated archive folder couldn't be found in the mapping file. This is a mandatory property. |
default-small-data-sets-archive-folder | This is the path to the archive folder which is used for small data sets only. Which data sets are considered "small" is controlled by " small-data-sets-size-limit" property. When the small data sets folder is defined then the small data sets limit has to be set as well. This folder is used only when t he mapping file hasn't been specified or an appropriated archive folder couldn't be found in the mapping file. This is an optional property. |
small-data-sets-size-limit | Controls which data sets are considered "small". Data sets which size is smaller or equal to this property value are treated as "small". Data sets which size is greater than this property value are treated as "big". The limit is expressed in kilobytes. E.g. 1024 value of this property means that data sets which size is smaller or equal to 1MB are considered "small". The limit is used when choosing between "default-archive-folder" and "default-small-data-sets-archive-folder" or between "big" and "small" folders defined in the mapping file. |
mapping-file | Path to the mapping file. This is an optional property. If not specified the default-archive-folder will be used for all data sets. |
mapping-file.create-archives | If true the archive folders specified in the mapping file will be created if they do not exit. Default value is false . |
compressing | If true compression is used when creating the archived data set. Otherwise an uncompressed ZIP file is created. Default value is true . |
with-sharding | If true the path of the ZIP file is <archive folder>/<data store UUID>/<sharding levels>/<data set code>/<data set code>.zip . Otherwise the path reads <archive folder>/<data set code>.zip . Default value is false . |
ignore-existing | If true then data sets that already exist in the archive (zip file exists and it is not empty) are ignored and not copied to the archive again. If false then data sets are always copied to the archive without checking if they exist. Default value is false . |
Note, that the property synchronize-archive
will be ignored: An already archived data set will be archived again if archiving has been triggered.
It is recommended to use as the share finder MappingBasedShareFinder for unarchiving.
Tar Archiver
This archiver (ch.systemsx.cisd.openbis.dss.generic.server.plugins.standard.TarArchiver
) archives data sets in TAR files. It is very similar to the ZIP archiver. It accepts all the configuration properties that the ZIP archiver accepts, except for the "compressing" property.
Automatic Archiver
Automatic Archiver ch.systemsx.cisd.etlserver.plugins.AutoArchiverTask
is a maintenance task scheduled for repeated execution beginning at the specified time (optional - by default it is executed without delay after server starts) with regular intervals between subsequent executions.
Apart from standard maintenance task properties the following specific properties (all optional) are understood:
Property name | Description |
---|---|
| Dataset type code of datasets that will be archived. By default all types will be archived. |
| Only data sets that are older than specified number of days will be archived (default = 30). |
| Fully qualified class name of a policy that additionally filters data sets to be archived. |
| Indicates whether data sets will be removed from the data store upon archiving (default = false). |
By default all data sets with status 'AVAILABLE'
will be archived. Power users can prevent auto archiving of data sets by changing their status to 'LOCKED'
.
Automatic Experiment Archiver
This a maintenance task ch.systemsx.cisd.etlserver.plugins.ExperimentBasedArchivingTask
which archives whole experiments if free disk space is below a configured threshold. Archiving a whole experiment means that all data sets of this experiments are archived.
As any maintenance task it is scheduled for repeated execution beginning at the specified time (optional - by default it is executed without delay after server starts) with regular intervals between subsequent executions.
The archiver archives one or more experiments starting with the oldest one. It stops if either a fixed number of experiments have been archived or the free disk space is above the threshold. Experiments are not archived if at least on data set is in the state LOCKED
. The age of an experiment is defined by the youngest not archived data set.
Apart from standard maintenance task properties the following specific properties are understood:
Property name | Description |
---|---|
monitored-dir | This mandatory property specified the directory of the store whom's disk space is monitored. In the case the configured free space provider is |
minimum-free-space-in-MB | This mandatory property specifies the threshold (in MB) for free disk space on the share defined by |
excluded-data-set-types | A comma-separated list of data set types. Data sets of those types will not be archived. They will not be used for calculating the age of an experiment. |
free-space-provider.class | The classname of the free space provider to be used. The default is |
free-space-provider.* | Properties for the configured free space provider |
estimated-data-set-size-in-KB.<data-set-type> | Provides an estimation of the average size in kilobytes for a given data set type. |
estimated-data-set-size-in-KB.DEFAULT | Default estimation of data set size. If no specific |
Archive clean-up
Archives are not automatically updated when deleting data sets in openBIS. To purge archives from deleted data sets one has to configure a clean-up maintenance task - ch.systemsx.cisd.etlserver.plugins.DeleteFromArchiveMaintenanceTask
.
Apart from the standard maintenance task properties the following specific properties will be accepted by the task:
Property name | Description |
---|---|
delay-after-user-deletion | Only data sets that have been deleted before more than |
status-filename | A mandatory property keeping a filename, where the task tracks the last processed deletion. The path should be outside the DSS installation in order to survive DSS update installations. |
Free space providers
SimpleFreeSpaceProvider
The class ch.systemsx.cisd.common.filesystem.SimpleFreeSpaceProvider
returns the free space on the file system of a hard drive. It works similarly to the command
$ df -h
PostgresPlusFileSystemFreeSpaceProvider
The class ch.systemsx.cisd.openbis.dss.generic.shared.utils.PostgresPlusFileSystemFreeSpaceProvider
returns the free space on a hard drive as the sum from the file system free space and the free space detected in a PostgreSQL data source. Note that the data source is required to have the PostgreSQL extension pgstattuple
installed.
The following configuration properties are available for PostgresPlusFileSystemFreeSpaceProvider
Property Name | Description |
monitored-data-source | The name of the data source to monitor |
execute-vacuum | When set to |
Data Set Validators
Data Set Validators are used to accept or reject incoming data sets. They are specific for a data set type.
The property data-set-validators
is a comma-separated list of validators for data sets. They are the names of the validators. The name of all configuration properties for a particular validator starts with <validator name>.
. For an example, see the above-mentioned example of a service.properties
. The following properties are understood:
Property name | Default value | Description |
---|---|---|
| Mandatory data set type. The validator will be used only for data sets of this type. | |
|
| Fully-qualified name of a Java class implementing |
Data Set Validator for TSV Files
The default validator is DataSetValidatorForTSV
. It is able to validate TAB-separated value (TSV) files. It is assumed, that the first line of such files contain the column headers. All headers should be unique (ie. no duplicated column headers). This validator understands the following properties:
Property name | Default value | Description |
---|---|---|
|
| List of comma separated wild-card patterns for paths of files in the data set to be validated. The characters |
| List of comma separated names of column definitions. These column definitions are used
|
A column definition understands the following properties:
Property name | Default value | Description |
---|---|---|
|
| If |
| undefined | Optional property. It specifies a column in the TSV file at a particular position. 1 means first column, 2 means second column etc. |
|
| If |
| undefined | Fully-qualified name of a Java class implementing |
| This property is mandatory if | |
|
| Fully-qualified name of a Java class implementing |
DefaultValueValidatorFactory
The default factory can generate four different types of value validators. It understands the following properties:
Property name | Default value | Description |
---|---|---|
|
| Type of the value validator. The following values are allowed:
|
|
| If |
| An optional property of a comma-separated list of strings which are handled as synonyms for an empty value. This property is only used if | |
| Regular expression for a validator of type | |
| Optional range definition for validators of type ('('|'[') (-Infinity|<floating point number>) ',' (Infinity|<floating point number>) (']'|')') Examples: (0,3.14159] [0,Infinity) (-Infinity,1e-18|CISDDoc:Installation and Administrators Guide of the openBIS Data Store Server] | |
| Mandatory property for validators of type value-type = unique_groups value-pattern = (a*)(b*)c*(d*) groups = 1,3 The number of the group can be calculated as the number of '(' present in the regex before chosen group. pattern: (A(B))C(D(E)) available groups: 1 (A(B)) 2 (B) 3 (D(E)) 4 (E) |
HeaderBasedValueValidatorFactory
This value validator factor (fully qualified class name: ch.systemsx.cisd.etlserver.validation.HeaderBasedValueValidatorFactory
) is a collection of column header specific validator factories. The factory is chosen by the first regular expression matching the header. HeaderBasedValueValidatorFactory understands the following properties:
Property name | Description |
---|---|
| List of comma-separated unique names of value validator factories. For each name configuration properties are defined. The property names start with |
| Regular expression to be matched by column header. |
Remote DSS
More than one DSS may be connected to the same AS. In this case a DSS might request files remotely from another DSS. This can happen in FTP/SFTP server and aggregation/ingestion services. The files downloaded from a remote DSS will be cached. By default the cache is located in the data folder of a standard installation (data/dss-cache
). Its default maximum size is 1GB. This parameters can be changed by the following properties in service.properties
:
Property Name | Default Value | Description |
---|---|---|
cache-workspace-folder | ../../data/dss-cache | Folder which will contain the cached files. |
cache-workspace-max-size | 1024 | Maximum size of cache in MB. |
cache-workspace-min-keeping-time | 1 day | Minimum time a data set is kept in the cache. Can be specified with one of the following time units: |
Files are removed from the cache if the downloaded file would exceeds the size specified by cache-workspace-max-size
. Removal is guided by the following rules:
- Removal happens after the downloading of a file which isn't in the cache.
- Non or all cached files of a data set are removed.
- The "oldest" data set is removed first. The age of a data set is determined by the last time a cached file in this data set has been requested.
- Data sets are not removed when they are younger than specified by
cache-workspace-min-keeping-time
.
ETL Threads
The property inputs
is a comma-separated list of ETL threads. Each ETL thread registers incoming data set files/folders at openBIS and store them in the data store (property storeroot-dir
).
The name of all configuration properties for a particular ETL thread starts with <thread name>.
. The following properties are understood:
Property name | Default value | Description |
---|---|---|
| The drop box for incoming data sets. | |
|
| Condition which determines when an incoming data set should be considered to be complete and ready for processing. Only the following values are allowed:
|
|
| If |
| Properties determine an extractor which extracts from the data set file/folder information necessary for registering in openBIS. For more details, see below. | |
| Extractor which extracts various type information from the data set file/folder. For more details, see below. | |
| Processor which processes the incoming data set file/folder and stores it in the store. For more details, see below. | |
| Path to the script that will be executed before data set registration. The script will be called with two parameters: | |
| Path to the script that will be executed after successful data set registration. The script will be called with two parameters: | |
incoming-share-id | Id of the incoming share the dropbox drops into. The share can be on a different disk than the incoming folder. |
Data Set Info Extractor
Each thread has a property <thread name>.data-set-info-extractor
which denotes a Java class implementing IDataSetInfoExtractor
.
The most important implementation is ch.systemsx.cisd.etlserver.DefaultDataSetInfoExtractor
which extracts most information from the name of the data set file/folder. It understands the following properties (all with prefix <thread name>.data-set-info-extractor.
):
Property name | Default value | Description |
---|---|---|
| Space code of the sample. If unspecified a shared sample is assumed. | |
|
| If |
|
| Character which separates entities in the file name. Whitespace characters are not allowed. |
|
| Character which separates sub entities of an entity. Whitespace characters are not allowed. |
|
| Index of the entity which is interpreted as the sample code. Data set belongs to a sample. |
| Index of the entity which is interpreted as the experiment identifier. It contains the project code, experiment code, and optionally the space code (if different from the property | |
| Index of the entity which is interpreted as sequence of parent data set codes. The codes have to be separated by the sub entity separator. If not specified no parent data set code will be extracted. Parent data set codes will be ignored if | |
| Index of the entity which is interpreted as the data producer code. If not specified no data producer code will be extracted. | |
| Index of the entity which is interpreted as the data production date. If not specified no data production date will be extracted. | |
|
| Format of the data production date. For the correct syntax see SimpleDateFormat |
| Path to a file inside a data set folder which contains data set properties. This file has to be a tab-separated file. The first line of the file should contain columns definition: data-set-properties.tsv property value DESCRIPTION Description of data set series SCHEMA_VERSION 21.3 SERIES 2009-04-01 |
Entity indexes count 0, 1, 2, etc. from left and -1, -2, -3, etc. from right.
Type Extractor
Each thread has a property <thread name>.type-extractor
which denotes a Java class implementing ITypeExtractor
.
ch.systemsx.cisd.etlserver.SimpleTypeExtractor
is one of the available type extractors. Actually it doesn't extract anything from the incoming data set. It took all information from the following properties (all with prefix <thread name>.type-extractor.
):
Property name | Default value | Description |
---|---|---|
| A data set type as registered in openBIS. | |
| A file type as registered in openBIS. | |
|
| Whether the data set is a measured one or a derived/calculated one. |
Storage Processor
Each thread has a property <thread name>.storage-processor
which denotes a Java class implementing IStorageProcessor
.
The most important storage processors are ch.systemsx.cisd.etlserver.DefaultStorageProcessor
and ch.systemsx.cisd.etlserver.imsb.StorageProcessorWithDropbox
. The last one extends a storage processor by the additional behavior of copying the original data set file/folder to a configurable folder (drop box). It understands following properties (all with prefix <thread name>.storage-processor.
):
Property name | Default value | Description |
---|---|---|
| Java class of the storage processor which will be extended. | |
| Folder to which the original data set file/folder will be copied. | |
|
| Separator character used to build the following file name for the data set file/folder in the drop box: |
Dataset handler
Each thread has a property <thread name>.dataset-handler
which denotes a Java class implementing IDataSetHandler
.
This property is optional, the default dataset handler is used if it is not specified.
This class allows to decide on a high level how to handle an incoming data set file or directory.
It can delegate its job to the default dataset handler performing some operations beforehand or afterwards.
It can also make it possible to handle datasets residing in any particular directory structure.
Monitoring Thread Activity
Each thread regularly monitors activity on a directory. In the datastore server installation directory, there is a subdirectory .activity
/ which contains an (empty) file for each thread. The file gets 'touched' each time the thread starts a processing round. Thus, by looking at the time stamp of the according directory, an administrator can find when the last processing round of this thread took place. This can help spot long-running data ingestion processes quickly and can also help to check whether a thread is 'hanging', e.g. because of an erroneous dropbox script.
Monitoring User Session Activity
It's possible to monitor the growth of the openbis user sessions. If the number of sessions exceed the specify threshold, the notification is beeing sent to the admin. This behavior is controlled by two properties (defined in openbis service.properties file)
session-notification-threshold
(by deafult it is 0, means the feature is off)session-notification-delay-period
(how much time should pass beetween two notifications expressed in seconds)
Post Registration
It is possible to perform a sequence of tasks after registration of a data set. This is done by the maintenance task ch.systemsx.cisd.etlserver.postregistration.PostRegistrationMaintenanceTask
. This tasks figures out freshly registered data sets and runs for each data set a sequence of configurable post registration tasks. It should be configured with a short time interval (e.g. 60 seconds) in order to do post-registration eagerly. It knows the following configuration properties:
Property | Description |
---|---|
| Time interval in seconds post-registration is performed. |
| Path to a folder where serialized cleanup tasks are stored. If DSS crashes the cleanup tasks in this folder will be executed after start up. Default value is |
| Path to a file which will store the technical ID of the last data set being post-registered. The path should be outside the DSS installation in order to survive DSS update installations. |
| Optional date specifying older data sets to be ignored. The format is |
| Comma-separated list of names of post registration tasks. The tasks are performed in the order they are specified. Each name is the sub-key of the properties for the tasks. All tasks should have at least the key |
Example:
maintenance-plugins = <other maintenance plugins>, post-registration post-registration.class = ch.systemsx.cisd.etlserver.postregistration.PostRegistrationMaintenanceTask post-registration.interval = 60 post-registration.cleanup-tasks-folder = cleanup-tasks post-registration.ignore-data-sets-before-date = 2011-03-27 post-registration.last-seen-data-set-file = ../../last-seen-data-set post-registration.post-registration-tasks = eager-shuffling, notifying, eager-archiving post-registration.eager-shuffling.class = ch.systemsx.cisd.etlserver.postregistration.EagerShufflingTask post-registration.eager-shuffling.share-finder.class = ch.systemsx.cisd.openbis.dss.generic.shared.SimpleShareFinder post-registration.notifying.class = ch.systemsx.cisd.etlserver.postregistration.NotifyingTask post-registration.notifying.message-template = data-set = ${data-set-code}\n\ injection-volumn = ${property.inj-vol}\n post-registration.notifying.destination-path-template = targets/ds-${data-set-code}.properties post-registration.eager-archiving.class = ch.systemsx.cisd.etlserver.postregistration.ArchivingPostRegistrationTask
Currently three different post-registration tasks are available.
EagerShufflingTask
Fully qualified class name: ch.systemsx.cisd.etlserver.postregistration.EagerShufflingTask
This task tries to find a share in the segmented store to which it will move the freshly registered data set.
Following configuration properties are recognized:
Property | Description |
---|---|
| Fully qualified class name of the Java class implementing |
| A free space threshold which leads to a notification of an admin via e-mail if this threshold has been crossed after adding a data set to a share. |
| If |
verify-checksum | If true then checksum verification is performed for the moved data sets (default: true) |
NotifyingTask
Fully qualified class name: ch.systemsx.cisd.etlserver.postregistration.NotifyingTask
This task writes a file with some information of the freshly registered data set.
Following configuration properties are recognized:
Property | Description |
---|---|
| A template for creating the file content. |
| A template for creating the path of the file. |
| Optional. Contains comma separated patterns (java regexp). If specified then only datasets which have type matching to any of the patterns will be processed. |
The template can have place holders of the form ${<place holder name>
}. The following place holder name can be used: data-set-code
and property.<data set property type code>
where the data set property type code can be in lower case. For newline/tab characters in the template use \n
and \t
, respectively.
ArchivingPostRegistrationTask
Fully qualified class name: ch.systemsx.cisd.etlserver.postregistration.ArchivingPostRegistrationTask
This task archives the freshly registered data set. The archived data set is still available. This task can be used for creating backups. Note, that ArchiverPlugin has to be configured.
RequestArchivingPostRegistrationTask
Fully qualified class name: ch.systemsx.cisd.etlserver.postregistration.RequestArchivingPostRegistrationTask
This task set the archiving-request flag of the freshly registered data set.
SecondCopyPostRegistrationTask
Fully qualified class name: ch.systemsx.cisd.etlserver.postregistration.SecondCopyPostRegistrationTask
This task creates a copy of all files of the freshly registered data set.
Following configuration properties are recognized:
Property | Description |
---|---|
| Path to the folder which will contain the copy. The files are created in the same folder structure as in a share folder. |
Path Info Database
Some data sets consists of a large number of files. Navigating in such large file set can be slow especially for NFS-based storage. In order to speed up navigation a so-called path info database can be used. Three or four things have to be configured to run it properly: A data source, a post-registration task, a deletion maintenance task, and an optional migration task. Here is an example configuration:
# --------------------------------------------------------------------------- # Data sources data-sources = path-info-db, <other data sources> # Data source for pathinfo database path-info-db.version-holder-class = ch.systemsx.cisd.openbis.dss.generic.shared.PathInfoDatabaseVersionHolder path-info-db.databaseEngineCode = postgresql path-info-db.basicDatabaseName = pathinfo path-info-db.databaseKind = productive path-info-db.scriptFolder = datastore_server/sql # Use this if you need to give the data source db superuser privileges #path-info-db.owner = postgres # --------------------------------------------------------------------------- # Maintenance Tasks maintenance-plugins = post-registration, path-info-deletion, path-info-feeding, <other maintenance tasks> # --------------------------------------------------------------------------- # Post registration task: Feeding path info database with freshly registered data sets. post-registration.class = ch.systemsx.cisd.etlserver.postregistration.PostRegistrationMaintenanceTask post-registration.interval = 30 post-registration.cleanup-tasks-folder = ../../cleanup-tasks post-registration.ignore-data-sets-before-date = 2011-01-27 post-registration.last-seen-data-set-file = ../../last-seen-data-set-for-postregistration.txt post-registration.post-registration-tasks = pathinfo-feeding post-registration.pathinfo-feeding.class = ch.systemsx.cisd.etlserver.path.PathInfoDatabaseFeedingTask # --------------------------------------------------------------------------- # Deletion task: Remove entries from path info database after data set deletion path-info-deletion.class = ch.systemsx.cisd.etlserver.plugins.DeleteFromExternalDBMaintenanceTask path-info-deletion.interval = 120 path-info-deletion.data-source = path-info-db path-info-deletion.data-set-table-name = data_sets path-info-deletion.data-set-perm-id = CODE # --------------------------------------------------------------------------- # Migration task: Initial feeding of path info database with existing data sets. path-info-feeding.class = ch.systemsx.cisd.etlserver.path.PathInfoDatabaseFeedingTask path-info-feeding.execute-only-once = true
Remarks:
- The name of the data source has to be
path-info-db
. - The migration task can be removed form the configuration file after successful execution.
File system view of the DSS
The DSS can expose its data store contents via a read-only interface that can be defined as FTP/SFTP server. The following chapters describe how to set up FTP/SFTP. The interface is exposing the same view of the dss, which has some basic structure and can be extended using properties.
Basic hierarchy
The structure of the file system view is organized like a classical openbis hierarchy.
/{SPACE}/{PROJECT}/{EXPERIMENT}/{DATA-SET-CODE}/{data-set-files-and-directories}
Plugins
It's possible to register plugins to the file system view. Such a plugin must be able to resolve paths into a response that is either a listing of a directory contents (files and directories) or a file to download.
To create a plugin it's necessary to create a new core plugin of type type file-system-plugins
with definition of a custom view. If there are any file system plugins defined than each plugin's hierarchy is visible behind a separate top-level directory and "DEFAULT" is used for the basic hierarchy.
In plugin properties it's necessary to specify a class implementing IResolverPlugin interface e.g
resolver-class = ch.systemsx.cisd.openbis.dss.generic.server.fs.plugins.JythonResolver code = jython-plugin script-file = script.py
For Jython resolver it's required to implement a method resolve(pathItems, context)
with signature as defined in IResolver.
import time def resolve(subPath, context): api = context.getApi() # the way to access objects from openbis v3 api session_token = context.getSessionToken() if len(subPath) == 0: # how to create a directory listing with files and directories directory_listing = context.createDirectoryResponse() directory_listing.addFile("someFile.txt", 0, time.time()) directory_listing.addDirectory("someDirectory", time.time()) return directory_listing if len(subPath) == 1 and subPath[0] == "someFile.txt": # this is how to handle request to download a file from openbis data set content = context.getContentProvider().asContent("DATA_SET_CODE") node = content.tryGetNode("original/someFile.txt"); return context.createFileResponse(node, content) # this is how to respond to error paths return context.createNonExistingFileResponse("Path not supported")
See http://svncisd.ethz.ch/repos/cisd/datastore_server/trunk/source/core-plugins/file-system-plugin-example/1/dss/file-system-plugins/ for examples on how to configure and implement a more complex plugin.
(Since version 20.10.4) When developing a Jython resolver script it is useful to set the property ftp.resolver-dev-mode
in DSS service.properties
to true
. In this case the script is always reloaded and compiled. The default value is false
in order to speed up normal operation where the script is loaded and compiled only once.
Obsolete file-system view
In the earlier versions of openbis the structure and options available were different. To see how to enable and configure old-style file system view please see this document:
Old-style file system view for FTP
FTP / SFTP Server
The DSS can expose its data store contents via a read-only FTP / SFTP interface. To automatically start the internal FTP server set the configuration property ftp.server.enable
to true
and set one of the two properties ftp.server.ftp-port
/ ftp-server.sftp-port
(or both) to a positive value. Please refer to the following table for a more detailed description of the possible configuration properties :
Property | Default value | Description |
---|---|---|
ftp.server.enable | false | When set to 'true' internal FTP and/or SFTP server will be started as part of the DSS bootstrap. If defined all 'keystore.*' properties become mandatory. |
ftp.server.sftp-port | Port that the SFTP server will bind to. Needs to be set to a value other than 0 to switch on the SFTP server. SFTP is a protocol which enables secure file transfer through an SSH tunnel. This protocol is more firewall-friendly than FTP over SSL/TLS. A good value for the port is 2222. | |
ftp.server.ftp-port | Port that the FTP server will bind to. Needs to be set to a value other than 0 to switch on the FTP server. A good value for the port is 2121. | |
ftp.server.port | Deprecated name for ftp.server.ftp-port , kept for backward-compatibility. Will be ignored if ftp.server.ftp-port is specified. | |
ftp.server.use-ssl | true | Enables explicit FTP over SSL/TLS. Users can switch on connection encryption by issuing the FTP "AUTH" command. Similarly to the global 'use-ssl' parameter, when set to 'true' all 'keystore.*' properties become mandatory. (Ignored by SFTP server.) |
ftp.server.implicit-ssl | false | Allows to enable implicit FTP over SSL/TLS. While this method ensures no passwords are send around unencrypted, it is not standardized by an RFC and is therefore poorly supported by clients. (Ignored by SFTP server.) |
ftp.server.passivemode.port.range | 2130-2140 | Specifies a range of ports to be used when FTP server is working in passive mode. The default value is "2130-2140", meaning only 11 ports will be provide to connecting clients. If there are more than 11 concurrent FTP users the configuration will have to be adjusted accordingly. (Ignored by SFTP server.) |
ftp.server.activemode.enable | false | When set to true enables the FTP server to run in "active mode". This means data connections will always go to a single pre-configured port on the server. Such a configuration can ease the server firewall administration (only one additional port has to be opened), but However it requires the client machines to be directly visible from the server. (Ignored by SFTP server.) |
ftp.server.activemode.port | 2122 | FTP server active mode port number configuration. (Ignored by SFTP server.) |
ftp.server.maxThreads | 25 | Maximum number of threads used by the internal FTP server. (Ignored by SFTP server.) |
Example FTPS client configuration
If FTP over SSL/TLS is enabled, clients connecting to the FTP server must adjust their connection settings accordingly.
LFTP
In case you work outside the ETH network replace bs-openbis04 with openbis-dsu.ethz.ch
Here the steps to setup you connection to openbis via LFTP:
1) Create a config file under the folder .ssh/:
cat .ssh/config Host openbis-dsu.ethz.ch KexAlgorithms +diffie-hellman-group1-sha1 HostKeyAlgorithms +ssh-rsa,ssh-dss
2) Add the following host in the file .ssh/known_hosts:
[bs-openbis04.ethz.ch]:2222 ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCbAcN/0/rmcez4QTwaDPf5bMVog/LqkuyjcEqlI3RKMl+SkHIyhY/9CQLCYWq2+eITNGASseYVC2ZXJwDlRgvkYmtL/zVHUBxee8/s+DGJnqTR6TEQ2vvGFcfc6eeEktj6JkRyFS0oCfmDUWREoJP/8NzcEEWYk7MdV5xxmoBX9C6xTTP9VfWm+vUoNmozfOIywYX611lrCKMeUpdpE4DUpBX1pY4YMIvQ87SXyxhVEAJ+e8928ZUtkB0DGQ4xMHuwmjYLvO3G0Rqi5Vz0492ICYyMFOvKM4IGxf+hV4fqCRfgIBb/krXSO8WNBpzOzNlPPheM8Tdlw7irkNttp2a3
Then connect to openBIS via LFTP:
> lftp sftp://<username>@bs-openbis04.ethz.ch:2222 Password: lftp <username>@bs-openbis04.ethz.ch:~> ls dr--r--r-- 1 openBIS openBIS 0 Jul 19 09:54 BSSE_FLOWCELLS lftp <username>@bs-openbis04.ethz.ch:~> cd BSSE_FLOWCELLS/FLOWCELLS/2016.07 lftp <username>@bs-openbis04.ethz.ch:/BSSE_FLOWCELLS/FLOWCELLS/2016.07> ls dr--r--r-- 1 openBIS openBIS 0 Jul 19 09:59 20150113231613997-60446751_FASTQ_GZ dr--r--r-- 1 openBIS openBIS 0 Jul 19 09:59 20150113231614260-60446752_FASTQ_GZ dr--r--r-- 1 openBIS openBIS 0 Jul 19 09:59 20150113231614261-60446753_FASTQ_GZ [..] # Now we recursively download the data from this level lftp <username>@bs-openbis04.ethz.ch:/BSSE_FLOWCELLS/FLOWCELLS/2016.07> mirror Total: 23 directories, 28 files, 0 symlinks New: 28 files, 0 symlinks 334768712 bytes transferred in 23 seconds (13.94M/s) To be removed: 64 directories, 45 files, 16 symlinks lftp <username>@bs-openbis04.ethz.ch:/BSSE_FLOWCELLS/FLOWCELLS/2016.07> # just pick some files using regex lftp <username>@bs-openbis04.ethz.ch:/BSSE_FLOWCELLS/FLOWCELLS/2016.07> mirror --include-glob="*6043*"
We have observed cases where the files are transferred correctly and then 'vanish' on the file system. This is due to file permissions. Please try to use the -p flag of lftp. If you are already finished and it looks like the files are not there please try to chmod -R 755
the downloaded folder.
FileZilla
FileZilla users should create a new site in the Site Manager:
It might also be necessary to increase to time-out from default value of 20 seconds to a larger value (e.g. 60 seconds). This is done in the global preferences:
Example SFTP client configuration
If SFTP port is configured SFTP client can be used.
SFTP
>sftp -oPort=2222 <username>@<openbis server address> Example: >sftp -oPort=2222 john.doe@gmail.com@openbis-dsu.ethz.ch
Mac OS Finder as well as most Web browsers do not support the SFTP protocol.
Existing Dataset Handlers
SampleAndDataSetRegistrationHandler
The SampleAndDataSetRegistrationHandler can be used to register/update samples and register data sets. It takes control files in a specific format and registers or updates the samples and registers the data sets specified in the file.
DSS Configuration
- The plugin can be configured to either accept registration and update requests (ACCEPT_ALL), accept only registration requests (IGNORE_EXISTING), or only update requests (REJECT_NONEXISTING)
- Default : ACCEPT_ALL
- Properties Key : SAMPLE_REGISTRATION_MODE
- The plugin uses a regular expression to determine which files should be treated as control files. By default it uses any file with the ".tsv" extension, but this can be configured.
Default :
.*\.[Tt][Ss][Vv]
- Properties Key : control-file-regex-pattern
- By default, the uploaded folders are deleted, even after a failure. This can be configured.
- Default : true
- Properties Key : always-cleanup-after-processing
- The plugin treats a folder that is uploaded, but not specified in the control file, as a failure. This behavior can be configured.
- Default : true
- Properties Key : unmentioned-subfolder-is-failure
- The plugin can be configured with a default sample type for registration. This can be overridden in the control file. If no default is specified, it must be specified in the control file.
- Default : require the sample type to be specified in the control file
- Properties Key : sample-type
- Control File Key : SAMPLE_TYPE
- The plugin can be configured with a default data set type for registration. This can be overridden in the control file. If no default is specified, it must be specified in the control file.
- Default : require the sample type to be specified in the control file
- Properties Key : data-set-type
- Control File Key : DATA_SET_TYPE
- There are certain situations in which the plugin will need to inform an administrator of a problem encountered during sample / data set registration by sending an email. By default, this email is set to all instance admins of the openBIS instance. It can, however, be configured to email a specific list of people by providing openBIS user IDs or email addresses.
- Default : send email to all instance admins
- Properties Key : error-mail-recipients
inputs= [...] sample-dset-reg-thread sample-dset-reg-thread.incoming-dir = ${root-dir}/incoming-sample-dset-reg sample-dset-reg-thread.incoming-data-completeness-condition = auto-detection sample-dset-reg-thread.delete-unidentified = true # The data set handler sample-dset-reg-thread.dataset-handler = ch.systemsx.cisd.etlserver.entityregistration.SampleAndDataSetRegistrationHandler # Controls whether samples may be registered and updated (ACCEPT_ALL), registered only (IGNORE_EXISTING), or updated only (REJECT_NONEXISTING). Default is ACCEPT_ALL sample-dset-reg-thread.dataset-handler.sample-registration-mode = IGNORE_EXISTING # Controls which sample type is processed by default. Omit this setting it to force it to be specified in the control file sample-dset-reg-thread.dataset-handler.sample-type = MY_SAMPLE_TYPE # Controls which data set type is processed by default. Omit this setting it to force it to be specified in the control file sample-dset-reg-thread.dataset-handler.data-set-type = HCS_IMAGE # Controls which files are treated as control files. The default is the extension .tsv. Set to .txt instead sample-dset-reg-thread.dataset-handler.control-file-regex-pattern = .*\.[Tt][Xx][Tt] # Data Set Type Information sample-dset-reg-thread.type-extractor = ch.systemsx.cisd.etlserver.SimpleTypeExtractor # The file-format-type may be overridden in the control file sample-dset-reg-thread.type-extractor.file-format-type = TIFF sample-dset-reg-thread.type-extractor.locator-type = RELATIVE_LOCATION sample-dset-reg-thread.data-set-info-extractor = ch.systemsx.cisd.etlserver.DefaultDataSetInfoExtractor sample-dset-reg-thread.storage-processor = ch.systemsx.cisd.etlserver.DefaultStorageProcessor
Control File Format
The control file is a tab-separated-value (TSV) file. Any file in the drop box with the extension ".tsv" is treated as a control file.
The first non-comment line of the TSV file should be the column headers and following the headers are the metadata for samples and data sets, with each line containing the metadata for one sample/data set combination.
The headers for columns that contain sample metadata begin with "S_"; the headers for columns that contain data set metadata begin with "D_". All headers should begin with either S_ or D_, except for a column with the header "FOLDER". This specifies the (relative) path to the data to register as a data set.
In addition to the column headers, some information necessary for the registration process is contained in special comment lines that start with "#!" (normal comments start simply with "#").
Required Comments
- The key for the user on who's behalf the samples/data sets are being registered
- Control File Key : USERID
Optional Comments (Overrides)
The sample type and the data set type may be provided in the control file. Values given here override those in the DSS configuration. If there are no values in the DSS configuration, then they must be provided here.
- The sample type for the samples
- Control File Key : SAMPLE_TYPE
- The data set type for the data sets
- Control File Key : DATA_SET_TYPE
Headers
Commonly Used Headers
The following column headers can be used for any type of sample and dataset:
- S_identifier
- S_container
- S_parent
- S_experiment
- D_code
- D_file_type
The headers to specify are dependent on the sample type and data set type.
Optionally, the file type for each data set can be specified by providing the following header:
- D_file_type
Required Headers
The following headers are required:
- S_identifier
- FOLDER
When creating new samples, the following headers are required:
- S_experiment
Example
# Control Parameters #! GLOBAL_PROPERTIES_START #! SAMPLE_TYPE = DILUTION_PLATE #! DATA_SET_TYPE = HCS_IMAGE #! USERID = test #! GLOBAL_PROPERTIES_END # Data S_identifier S_experiment S_OFFSET D_COMMENT D_GENDER FOLDER /CISD/SD-TEST-1 /CISD/DEFAULT/EXP-REUSE 1 Comment 1 MALE ds1/ /CISD/SD-TEST-2 /CISD/DEFAULT/EXP-REUSE 2 Comment 3 MALE ds2/ /CISD/SD-TEST-3 /CISD/DEFAULT/EXP-REUSE 3 Comment 2 FEMALE ds3/
To register this file, it should have a *.tsv extension (e.g. control.tsv) and should be put into the drop box along with the folders ds1, ds2, ds3.
Error Conditions
- If any of the directories for data sets are empty, the registration of that sample/data set combination will fail.
SampleRegisteringDropbox
The SampleRegisteringDropbox can be used to register new samples without registering any data sets. Instead of data set directories, it expects control files in a specific format and registers the samples specified in this control file. The control file is the same format as used for sample batch registration in the Web GUI, augmented with a global header. The dropbox has a mode called "auto id generation". In this mode, it ignores the identifiers given for the individual samples and creates new ones.
DSS Configuration
- The plugin has to be given a directory for logging errors:
- Default: none (required property)
- Properties Key: error-log-dir
- The plugin can be given a prefix for auto-generating samples:
- Default:
S
- Properties Key: sample-code-prefix
- Default:
inputs= [...] sample-reg-thread sample-reg-thread.incoming-dir = ${root-dir}/incoming-sample-reg sample-reg-thread.incoming-data-completeness-condition = auto-detection # The data set handler sample-reg-thread.dataset-handler = ch.systemsx.cisd.etlserver.SampleRegisteringDropbox sample-reg-thread.dataset-handler.error-log-dir = ${root-dir}/sample-registration-errors sample-reg-thread.dataset-handler.sample-code-prefix = ABC
Control File Format
The control file is a tab-separated-value (TSV) file with a header, and, optionally, some comments.
The first non-header and non-comment line of the TSV file should be the column headers and following the headers are the metadata for samples, with each line containing the metadata for one sample in the same format as used for the batch sample registration process in the Web GUI.
Required Control File Keys
- The control file has to specify the sample type of the samples to register. It is an error if this control file key is missing.
- Control File Key : SAMPLE_TYPE
Optional Control File Keys
- The control file can specify the space to register the new samples in. Doing so switches on auto id generation.
- Default : require the full sample identifiers, including the space, to be specified in the control file
- Control File Key : DEFAULT_SPACE
- The key for the user on who's behalf the samples/data sets are being registered
- Default: The system user
- Control File Key : USER
Example
# Control Parameters #! GLOBAL_PROPERTIES_START #! SAMPLE_TYPE = GENERAL #! USER = testuser #! GLOBAL_PROPERTIES_END # Data identifier experiment COMMENT /CISD/SD-TEST-1 /CISD/DEFAULT/EXP-REUSE /CISD/SD-TEST-2 /CISD/DEFAULT/EXP-REUSE /CISD/SD-TEST-3 /CISD/DEFAULT/EXP-REUSE
Logging Errors
If the samples cannot be registered, e.g. because the control file has an invalid format or because one of the samples does already exist in the database, a file <filename>_error-log.txt
is created in the specified error-log-dir
. The control file is not deleted, and an error marker <filename>_delete_me_after_correcting_errors
is created in the dropbox directory. If the errors in the control file are fixed, the error marker can be deleted and the control file will be automatically reprocessed.
DSS RPC
DSS RPC makes it possible for clients to view and download data sets without going through the GUI and upload data sets without using drop boxes. One aspect of this service that can be configured is how uploaded data sets are processed.
RPC Dropboxes
To have a data set of a particular data set type handed over to a specific dropbox, specify put parameters with a data set type code as the end of the key and the dropbox key as the value, e.g.:
dss-rpc.put.HCS_IMAGE = main-thread
The RPC Default Dropbox
All data set types which are not mapped, are handed over to the default RPC dropbox. This default RPC dropbox can be specified by setting the dss-rpc.put-default
property to the dropbox code you want to be the default. If you don't specify a default RPC dropbox, the first dropbox will become the default RPC dropbox.
If you have more than one Data Store Server registered in an openBIS Application Server, only one DSS can be used to handle RPC Dropboxes. In this case, the property dss-rpc.put.dss-code
needs to be set to the appropriate DSS code in the service.properties
of the openBIS Application Server. Not doing so leads to ConfigurationFailureException
when trying to access an RPC Dropbox.
Dataset Uploader and RPC Dropboxes
The Dataset Uploader uploads its data to one of the RPC Dropboxes, using the decision mechanism described above.
Data Set registration process
DSS Registration Log
DSS Registration Log is a mechanism to help troubleshooting problems that occur during the data set registration. For each file or directory that is placed in the incoming folder of a dropbox, a new log file is created with a name comprised of the timestamp when processing begins, the dropbox name, and the file name. This log file, along with its location in the directory structure, track progress of the registration of data and metadata contained in the file.
Directory structure
The logs are stored in a directory specified by the property
dss-registration-log-dir
which defults to datastore_server/log-registrations.
The main log registration directory contains 3 subdirectories in_process, succeded, failed.
The log file is initially created in the in_process directory and moved to succeeded or failed when registration completes.
The log file
The log file is created in the in-process directory when the data set registration process has started. The file is named with the current timestamp, thread name and the dataset filename.
2012-01-18-13-33-02-837_simple-dropbox_image.png.log
While the data set is being processed, the log file is kept in the in_process directory. When the data set registration has finished, the file is moved to the succeeded or failed directory depending on the result of registration.
Log contents
The log is a concise representation of the processing that happend as a result of something being placed in the dropbox. Each line of the log begins with a date and time that an event happend. Events that appear in the log include: preparation of data sets, progress of data through the various directories on the file system, and registration of metadata with the openBIS application server. In the case of failure, information about the problems that caused failure are shown in the log as well.
Event | Description |
---|---|
Prepared registration of N data set(s): Data set codes | The dropbox has defined N data sets to register with openBIS. The codes of the first several data sets are shown. |
Data has been moved to the pre-commit directory: Dir | The data has been processed and prepared for storage. The location of the data is shown. |
About to register metadata with AS: registrationId(N ) | A call will be made to the application server to register the metadata. The call has the id N. |
Data has been registered with the openBIS Application Server. | The metadata has been successfully registered. |
Storage processors have committed. | All post-processing (e.g., storage of data in a secondary data base) has completed. |
Data has been moved to the final store. | Data is now in its designated storage location in the store directory of the DSS |
Storage has been confirmed in openBIS Application Server. | The application server has been informed that the data is its designated location. |
On successful data registration log will look like this:
2012-06-22 10:05:56 Prepared registration of 1 data set: 20120622100556527-13696 2012-06-22 10:05:56 Data has been moved to the pre-commit directory: {directory} 2012-06-22 10:05:56 About to register metadata with AS: registrationId(99) 2012-06-22 10:05:59 Data has been registered with the openBIS Application Server. 2012-06-22 10:05:59 Storage processors have committed. 2012-06-22 10:05:59 Data has been moved to the final store. 2012-06-22 10:05:59 Storage has been confirmed in openBIS Application Server.
In case of failure the log can look for instance like this:
2012-06-20 12:14:26 Prepared registration of 1 data set: 20120620121426786-13398 2012-06-20 12:14:26 Data has been moved to the pre-commit directory: {directory} 2012-06-20 12:14:26 About to register metadata with AS: registrationId(89) 2012-06-20 12:14:26 Error in registrating data in application server 2012-06-20 17:36:43 Starting recovery at checkpoint Precommit 2012-06-20 17:36:44 Responding to error [OPENBIS_REGISTRATION_FAILURE] by performing action LEAVE_UNTOUCHED on {directory} 2012-06-20 17:36:44 File has been left untouched {directory} 2012-06-20 17:36:44 Operations haven't been registered in AS - recovery rollback
DSS Registration Failure Email Notifications
Since 20.10.06 is also possible to configure a list of emails on the DSS service.properties that will get notified when failures occur.
# Email addresses of people to get notifications about problems in dataset registrations mail.addresses.dropbox-errors=admin1@localhost,admin2@localhost
Pre-staging
The registration process can make use of the prestaging directory. When the registration task starts the hardlink copy of the original input data is created in prestaging directory, so that throught whole registration process the original input data can remain untouched. Then on the succesfull registration the original data can be removed depending on the user's configuration.
Property dataset-registration-prestaging-behavior
can be used to define whether the registration process should make use of the the prestaging directory. It can be set to the following values:
default
oruse_original
- doesn't use the prestaging directory. Use the original input file.delete
- Use the prestaging file during the registration process, and delete original input file on successful registrationleave-untouched
- Use the prestaging file during the registration process, but leave original input file untouched on successful registration.
Start Server
The server is started as follows:
prompt> cd datastore_server prompt> ./datastore_server.sh start
If mail.test.address
property is set a test e-mail will be sent to the specified address after successful start-up of the server.
Stop Server
The server is stopped as follows:
prompt> cd datastore_server prompt> ./datastore_server.sh stop
Fix "Can't connect to X11 window server" or "Could not initialize class sun.awt.X11GraphicsEnvironment" problem
Some Data Store Server plugins operate on images (e.g. by generating charts or rescaling images before they are displayed or registered).
These operations do not work reliably in headless mode, thus a dummy X11 server like Xvfb needs to be installed and run on the server running the DSS.
Typical symptoms for this problems are when you see messages like
2010-12-10 10:34:07.406::WARN: Error for /datastore_server/chromatogram java.lang.NoClassDefFoundError: Could not initialize class sun.awt.X11GraphicsEnvironment at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Unknown Source) at java.awt.GraphicsEnvironment.getLocalGraphicsEnvironment(Unknown Source) at sun.awt.X11.XToolkit.<clinit>(Unknown Source) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Unknown Source) at java.awt.Toolkit$2.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.awt.Toolkit.getDefaultToolkit(Unknown Source) at sun.swing.SwingUtilities2$AATextInfo.getAATextInfo(Unknown Source) at javax.swing.plaf.metal.MetalLookAndFeel.initComponentDefaults(Unknown Source) at javax.swing.plaf.basic.BasicLookAndFeel.getDefaults(Unknown Source) at javax.swing.plaf.metal.MetalLookAndFeel.getDefaults(Unknown Source) at javax.swing.UIManager.setLookAndFeel(Unknown Source) at javax.swing.UIManager.setLookAndFeel(Unknown Source) at javax.swing.UIManager.initializeDefaultLAF(Unknown Source) at javax.swing.UIManager.initialize(Unknown Source) at javax.swing.UIManager.maybeInitialize(Unknown Source) at javax.swing.UIManager.getDefaults(Unknown Source) at javax.swing.UIManager.getColor(Unknown Source) at org.jfree.chart.JFreeChart.<clinit>(JFreeChart.java:261) at org.jfree.chart.ChartFactory.createXYLineChart(ChartFactory.java:1748)
in startup_log.txt
.
or in DSS log messages like:
2011-10-11 14:43:44,412 ERROR [hcs_image_raw - Incoming Data Monitor] OPERATION.DataSetStorageAlgorithmRunner - Error during dataset registertion: InternalError: Can't connect to X11 window se rver using ':1.0' as the value of the DISPLAY variable. java.lang.InternalError: Can't connect to X11 window server using ':1.0' as the value of the DISPLAY variable. at sun.awt.X11GraphicsEnvironment.initDisplay(Native Method) at sun.awt.X11GraphicsEnvironment.access$200(X11GraphicsEnvironment.java:62) at sun.awt.X11GraphicsEnvironment$1.run(X11GraphicsEnvironment.java:178) at java.security.AccessController.doPrivileged(Native Method) at sun.awt.X11GraphicsEnvironment.<clinit>(X11GraphicsEnvironment.java:142) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:186) at java.awt.GraphicsEnvironment.getLocalGraphicsEnvironment(GraphicsEnvironment.java:82) at java.awt.image.BufferedImage.createGraphics(BufferedImage.java:1152) at java.awt.image.BufferedImage.getGraphics(BufferedImage.java:1142) at ch.systemsx.cisd.imagereaders.ij.ImageJReaderLibrary.createBufferedImageOfSameType(ImageJReaderLibrary.java:75) at ch.systemsx.cisd.imagereaders.ij.ImageJReaderLibrary.access$0(ImageJReaderLibrary.java:70) at ch.systemsx.cisd.imagereaders.ij.ImageJReaderLibrary$1.readImage(ImageJReaderLibrary.java:66) at ch.systemsx.cisd.openbis.dss.generic.shared.utils.ImageUtil$TiffImageLoader.loadWithImageJ(ImageUtil.java:119) at ch.systemsx.cisd.openbis.dss.generic.shared.utils.ImageUtil$TiffImageLoader.load(ImageUtil.java:107) at ch.systemsx.cisd.openbis.dss.generic.shared.utils.ImageUtil.loadImageGuessingLibrary(ImageUtil.java:425) at ch.systemsx.cisd.openbis.dss.generic.shared.utils.ImageUtil.loadImageGuessingLibrary(ImageUtil.java:398) at ch.systemsx.cisd.openbis.dss.generic.shared.utils.ImageUtil.loadUnchangedImage(ImageUtil.java:234) ...
To fix it, perform these steps (assume Redhat Linux):
Install package xorg-x11-server-Xvfb:
#yum install xorg-x11-server-Xvfb in RHEL 6.3 the package is called: xorg-x11-server-Xvfb-1.7.7-29.el6.x86_64
Add the line:
export DISPLAY=:1.0
to file
/etc/sysconfig/openbis
.- Copy the attached file
Xvfb
to/etc/init.d/Xvfb
. Enable service Xvfb:
# chkconfig Xvfb on
Start service Xvfb:
# /etc/init.d/Xvfb start
Restart openbis service:
# /etc/init.d/openbis restart
If there are still problems add a JVM option:
-Djava.awt.headless=true
to the jetty server configuration (for DSS modify sprint/datastore_server/etc/datastore_server.conf
).
Runtime changes to logging
The script <installation directory>/servers/datastore-server/datastore_server.sh
can be used to change the logging behavior of the datastore server while the server is running.
The script is used like this: datastore_server.sh [command] [argument]
The table below describes the possible commands and their arguments.
Command | Argument(s) | Default Value | Description |
---|---|---|---|
log-service-calls | 'on', 'off' | 'off' | Turns on / off detailed service call logging. When this feature is enabled, datastore server will log about start and end of every service call it executes to file <installation directory>/servers/datastore_server/log/datastore_server_service_calls.txt |
log-long-running-invocations | 'on', 'off' | 'on' | Turns on / off logging of long running invocations. When this feature is enabled, datastore server will periodically create a report of all service calls that have been in execution more than 15 seconds to file <installation directory>/servers/datastore_server/log/datastore_server_long_running_threads.txt |
debug-db-connections | 'on', 'off' | 'off' | Turns on / off logging about database connection pool activity. When this feature is enabled, information about every borrow and return to database connection pool is logged to datastore server main log in file <installation directory>/servers/datastore_server/log/datastore_server_log.txt |
log-db-connections | no argument / minimum connection age (in milliseconds) | 5000 | When this command is executed without an argument, information about every database connection that has been borrowed from the connection pool is written into datastore server main log in file <installation directory>/servers/datastore_server/log/datastore_server_log.txt If the "minimum connection age" argument is specified, only connections that have been out of the pool longer than the specified time are logged. The minimum connection age value is given in milliseconds. |
record-stacktrace-db-connections | 'on', 'off' | 'off' | Turns on / off logging of stacktraces. When this feature is enabled AND debug-db-connections is enabled, the full stack trace of the borrowing thread will be recorded with the connection pool activity logs. |
log-db-connections-separate-log-file | 'on', 'off' | 'off' | Turns on / off database connection pool logging to separate file. When this feature is disabled, the database connection pool activity logging is done only to datastore server main log. When this feature is enabled, the activity logging is done ALSO to file <installation directory>/servers/datastore_server/log/datastore_server_db_connections.txt . |