openBIS for Proteomics

This is a customized version of openBIS which allows to shows proteins found in an analysis pipeline which identifies proteins in proteomics mass spectra.

With a special drop-box such identified proteins can be registered from a prot.xml file. The proteins and the peptides which identified them are stored in a special database. openBIS for Proteomics has a proteins viewer which shows the proteins. Protein details view shows also additional information like the peptides which lead to the protein identification. The database schema is here. It allows to defined customized SQL queries.

Predefined Types

In order to register such data sets the experiment type MS_SEARCH and the data set type PROT_RESULT have to be specified. These types are automatically defined when set up an openBIS instance with the proteomics installer.

Customized prot.xml

In order to register not only identified protein and peptides but also amino-acid sequences of identified proteins and quantification data openBIS recognizes/interprets some part of the prot.xml in special ways.

Probability to False Discovery Rate Mapping

If in the header section of the prot.xml file the element <proteinprophet_details> appears all its child elements of type <protein_summary_data_filter> are used to map the probability (i.e. the attribute probability of element <protein>) to a false discovery rate (FDR). If this mapping isn't provided in the prot.xml file FDR is undefined.

Identified proteins will not be added to the database if FDR > 0.1. Proteins with an undefined FDR will be added.

Accession Number and Amino-Acid Sequence

The attribute protein_description of element <annotation> contains accession number, description, and amino-acid sequence encoded as follows: <accession number> \DE=<description> \SEQ=<amino-acid sequence>

If this attribute isn't of this form the whole description will be taken as the protein description and the accession number will be one of the following attributes of <annotation> element:



URL Template















If more than one attribute is specified the first in this table will define the accession number. In the database a prefix will be added to the extracted accession number in order to link in the GUI the accession number with a public database. Note, that in the GUI the prefix will not be shown.


<annotation protein_description="Ribosomal protein L5" ipi_name="IPI00640037" ensembl_name="ENSP00000359338" trembl_name="Q5T7N0"/>

Protein Abundances

Abundance data for an identified protein can be stored in <parameter> elements of the corresponding <protein> element. The attributes of a <parameter> element have the following meaning:




The parameter type. Should be abundance otherwise it will be ignored by openBIS.


Sample identifier/code or property uniquely identifying a sample with spectra data. If there is the delimiter specified by property delimiter_for_sample_resolving only what is left of the delimiter is used for identification.

There are two ways of identifications which are tried successively until one succeeds:

  • There exists a sample with specified identifier. The default space is MS_DATA if only the code is present.
  • There exists exactly one sample in the space specified by the search experiment with a property MZXML_FILENAME of value specified by this attribute.

If the property restricted_sample_resolving is true the second way of identification is disabled. Registration of the data set will be interrupted if non of these identification methods succeed.


A number specifying the protein abundance.

Modification Fractions

Similar to protein abundance amino-acid modifications per sample can be specified in <parameter> elements inside <peptide> elements.




The parameter type. Should be modification otherwise it will be ignored by openBIS.


Code or property uniquely identifying a sample with spectra data. Identification is as for protein abundance data.


Position inside the peptide sequence, modification mass, and fraction. All three numbers are separated by ':'.

Configuring Proteomics openBIS

In order to be able to run the proteomics version of openBIS the configuration files (i.e. of AS and DSS have to be extended. If installation is done by openBIS installer for proteomics the configuration is already done as described below.

Configuration for AS

The following sections are need in
# Configuration of database containing identified proteins
proteomics.database.engine = postgresql
proteomics.database.create-from-scratch = false
proteomics.database.script-single-step-mode = false
proteomics.database.url-host-part =
proteomics.database.basic-name = proteomics
proteomics.database.kind = productive
proteomics.database.owner =
proteomics.database.owner-password =
proteomics.database.admin-user = 
proteomics.database.admin-password =
proteomics.script-folder = proteomics

# Core plugins folder.

Usually only proteomics.database.kind will be configured differently.

Also needs a new section:
technologies = proteomics

# Relative path of cache. Default value is 'cache'.
proteomics.cache-folder = ../../../web-client-data-cache
# Minimum free disk space needed for the cache. Default value is 1 GB.
#proteomics.minimum-free-disk-space-in-MB = 1024
# Maximum retention time. Data older than this time will be removed from cache. Default value is a week.
#proteomics.maximum-retention-time-in-days = 7

Configuration for DSS

DSS has to be configured in a way to allow registration of data sets with prot.xml files.
root-dir = <root directory of store and drop boxes>
storeroot-dir = ${root-dir}/store

# ---------------------------------------------------------------------------
# Data sources

data-sources = data-source
data-source.databaseEngineCode = postgresql
data-source.basicDatabaseName = proteomics
data-source.databaseKind = productive

# ---------------------------------------------------------------------------
# ETL processing threads (aka 'Drop Boxes')
# ---------------------------------------------------------------------------

inputs = ms-search

# ---------------------------------------------------------------------------
# 'ms-search' drop box for spectra data
# ---------------------------------------------------------------------------
# The directory to watch for incoming data.
ms-search.incoming-dir = ${root-dir}/incoming-ms-search

# Determines when the incoming data should be considered complete and ready to be processed.
# Allowed values: 
#  - auto-detection - when no write access will be detected for a specified 'quite-period'
#  - marker-file		- when an appropriate marker file for the data exists. 
# The default value is 'marker-file'.
ms-search.incoming-data-completeness-condition = auto-detection

# Extracts meta data and creates an experiment on the fly = ch.systemsx.cisd.openbis.etlserver.proteomics.DataSetInfoExtractorForProteinResults
# Separator character between space code and project code to be extracted from folder name = +
# Type of the experiment to be created. Default: MS_SEARCH
# Name of the properties file with properties of the experiment to be created. Default:
# =
# Threshold of prot.xml files not being processed
# = 256

ms-search.type-extractor = ch.systemsx.cisd.etlserver.SimpleTypeExtractor
ms-search.type-extractor.file-format-type = XML
ms-search.type-extractor.locator-type = RELATIVE_LOCATION = PROT_RESULT = false

# Storage processor which also uploads data to the database with identified proteins = ch.systemsx.cisd.openbis.etlserver.proteomics.StorageProcessorWithResultDataSetUploader = ch.systemsx.cisd.etlserver.DefaultStorageProcessor
# If this flag is 'true' amino-acid sequence and description is expected as described above. Default: false = false
# Defines the delimiter for sampling resolving of parameter elements in protXML file. 
# Everything after the delimiter will be ignored if present. The identifer is either a full identifier 
# (space code and sample code) or only the sample code for space MS_DATA. Default: ~ = ~
# If this flag is 'false' samples in protXML parameters are tried to resolve via property MZXML_FILENAME 
# in same space as the experiment if not found by identifier. Default: true = true
# If this flag is 'true' the prot.xml file will be validated in accordance to the prot.xml schema. Default: false = false
# Database credentials = ${data-source.basicDatabaseName} = ${data-source.databaseKind} = = 

# ---------------------------------------------------------------------------
# maintenance plugins configuration
# ---------------------------------------------------------------------------

maintenance-plugins = data-set-clean-up

# Maintenance plugin which deletes stuff in the database of identified proteins after a data set has been deleted
data-set-clean-up.class = ch.systemsx.cisd.etlserver.plugins.DeleteFromExternalDBMaintenanceTask
data-set-clean-up.interval = 300 = data-source = data_sets 

Some remarks:

  • The values of the properties data-source.basicDatabaseName and data-source.databaseKind must be the same as the values of properties proteomics.database.basic-name and proteomics.database.kind, respectively, in of AS.
  • In order to register a data set with an prot.xml file a folder named <space code>+<project code> has to be dropped into <root directory>/incoming-ms-search. It has to contain
    • A prot.xml file. File name has to end with prot.xml.
    • A properties file named It contains properties of the freshly created experiment of type MS_SEARCH in the project specified by the folder name <space code>+<project code>. Unknown properties are ignored. The following properties have special meanings:
      • experiment-code: Optional property specifying the experiment code. If missing the code of the experiment will be automatically created. Note, that the experiment code has to be unique per project.
      • base-experiment: Identifier of an experiment. All data sets of this experiment will become parent data sets.
      • parent-data-set-codes: Comma and/or space separated list of codes of data sets which will become parent data sets. If this property is present base-experiment will be ignored.
  • No labels