openBIS for Proteomics
This is a customized version of openBIS which allows to shows proteins found in an analysis pipeline which identifies proteins in proteomics mass spectra.
With a special drop-box such identified proteins can be registered from a prot.xml file. The proteins and the peptides which identified them are stored in a special database. openBIS for Proteomics has a proteins viewer which shows the proteins. Protein details view shows also additional information like the peptides which lead to the protein identification. The database schema is here. It allows to defined customized SQL queries.
Predefined Types
In order to register such data sets the experiment type MS_SEARCH
and the data set type PROT_RESULT
have to be specified. These types are automatically defined when set up an openBIS instance with the proteomics installer.
Customized prot.xml
In order to register not only identified protein and peptides but also amino-acid sequences of identified proteins and quantification data openBIS recognizes/interprets some part of the prot.xml in special ways.
Probability to False Discovery Rate Mapping
If in the header section of the prot.xml file the element <proteinprophet_details>
appears all its child elements of type <protein_summary_data_filter>
are used to map the probability (i.e. the attribute probability
of element <protein>
) to a false discovery rate (FDR). If this mapping isn't provided in the prot.xml file FDR is undefined.
Identified proteins will not be added to the database if FDR > 0.1. Proteins with an undefined FDR will be added.
Accession Number and Amino-Acid Sequence
The attribute protein_description
of element <annotation>
contains accession number, description, and amino-acid sequence encoded as follows: <accession number> \DE=<description> \SEQ=<amino-acid sequence>
If this attribute isn't of this form the whole description will be taken as the protein description and the accession number will be one of the following attributes of <annotation>
element:
Attribute | Prefix | URL Template |
---|---|---|
swissprot_name |
| http://www.uniprot.org/uniprot/$id |
trembl_name |
| http://www.uniprot.org/uniprot/$id |
ipi_name |
| http://www.uniprot.org/uniprot/?query=$id |
ensembl_name |
| http://www.uniprot.org/uniprot/?query=$id |
refseq_name |
| http://www.uniprot.org/uniprot/?query=$id |
locus_link_name |
| http://www.uniprot.org/uniprot/?query=$id |
flybase |
| http://www.uniprot.org/uniprot/?query=$id |
If more than one attribute is specified the first in this table will define the accession number. In the database a prefix will be added to the extracted accession number in order to link in the GUI the accession number with a public database. Note, that in the GUI the prefix will not be shown.
Example:
<annotation protein_description="Ribosomal protein L5" ipi_name="IPI00640037" ensembl_name="ENSP00000359338" trembl_name="Q5T7N0"/>
- Protein description:
Ribosomal protein L5
- Accession number:
Q5T7N0
- Accession number in database:
tr|Q5T7N0
- URL: http://www.uniprot.org/uniprot/Q5T7N0
Protein Abundances
Abundance data for an identified protein can be stored in <parameter>
elements of the corresponding <protein>
element. The attributes of a <parameter>
element have the following meaning:
Attribute | Meaning |
---|---|
type | The parameter type. Should be |
name | Sample identifier/code or property uniquely identifying a sample with spectra data. If there is the delimiter specified by property delimiter_for_sample_resolving only what is left of the delimiter is used for identification. There are two ways of identifications which are tried successively until one succeeds:
If the property |
value | A number specifying the protein abundance. |
Modification Fractions
Similar to protein abundance amino-acid modifications per sample can be specified in <parameter>
elements inside <peptide>
elements.
Attribute | Meaning |
---|---|
type | The parameter type. Should be |
name | Code or property uniquely identifying a sample with spectra data. Identification is as for protein abundance data. |
value | Position inside the peptide sequence, modification mass, and fraction. All three numbers are separated by ':'. |
Configuring Proteomics openBIS
In order to be able to run the proteomics version of openBIS the configuration files (i.e. service.properties
) of AS and DSS have to be extended. If installation is done by openBIS installer for proteomics the configuration is already done as described below.
Configuration for AS
The following sections are need in service.properties
# Configuration of database containing identified proteins proteomics.database.engine = postgresql proteomics.database.create-from-scratch = false proteomics.database.script-single-step-mode = false proteomics.database.url-host-part = proteomics.database.basic-name = proteomics proteomics.database.kind = productive proteomics.database.owner = proteomics.database.owner-password = proteomics.database.admin-user = proteomics.database.admin-password = proteomics.script-folder = proteomics # Core plugins folder. core-plugins-folder=./webapps/openbis/core-plugins
Usually only proteomics.database.kind
will be configured differently.
Also web-client.properties
needs a new section:
technologies = proteomics # Relative path of cache. Default value is 'cache'. proteomics.cache-folder = ../../../web-client-data-cache # Minimum free disk space needed for the cache. Default value is 1 GB. #proteomics.minimum-free-disk-space-in-MB = 1024 # Maximum retention time. Data older than this time will be removed from cache. Default value is a week. #proteomics.maximum-retention-time-in-days = 7
Configuration for DSS
DSS has to be configured in a way to allow registration of data sets with prot.xml files.
root-dir = <root directory of store and drop boxes> storeroot-dir = ${root-dir}/store # --------------------------------------------------------------------------- # Data sources data-sources = data-source data-source.databaseEngineCode = postgresql data-source.basicDatabaseName = proteomics data-source.databaseKind = productive # --------------------------------------------------------------------------- # ETL processing threads (aka 'Drop Boxes') # --------------------------------------------------------------------------- inputs = ms-search # --------------------------------------------------------------------------- # 'ms-search' drop box for spectra data # --------------------------------------------------------------------------- # The directory to watch for incoming data. ms-search.incoming-dir = ${root-dir}/incoming-ms-search # Determines when the incoming data should be considered complete and ready to be processed. # Allowed values: # - auto-detection - when no write access will be detected for a specified 'quite-period' # - marker-file - when an appropriate marker file for the data exists. # The default value is 'marker-file'. ms-search.incoming-data-completeness-condition = auto-detection # Extracts meta data and creates an experiment on the fly ms-search.data-set-info-extractor = ch.systemsx.cisd.openbis.etlserver.proteomics.DataSetInfoExtractorForProteinResults # Separator character between space code and project code to be extracted from folder name ms-search.data-set-info-extractor.separator = + # Type of the experiment to be created. Default: MS_SEARCH # ms-search.data-set-info-extractor.experiment-type-code = MS_SEARCH # Name of the properties file with properties of the experiment to be created. Default: search.properties # ms-search.data-set-info-extractor.experiment-properties-file-name = search.properties # Threshold of prot.xml files not being processed # ms-search.data-set-info-extractor.prot-xml-size-threshold-in-MB = 256 ms-search.type-extractor = ch.systemsx.cisd.etlserver.SimpleTypeExtractor ms-search.type-extractor.file-format-type = XML ms-search.type-extractor.locator-type = RELATIVE_LOCATION ms-search.type-extractor.data-set-type = PROT_RESULT ms-search.type-extractor.is-measured = false # Storage processor which also uploads data to the database with identified proteins ms-search.storage-processor = ch.systemsx.cisd.openbis.etlserver.proteomics.StorageProcessorWithResultDataSetUploader ms-search.storage-processor.processor = ch.systemsx.cisd.etlserver.DefaultStorageProcessor # If this flag is 'true' amino-acid sequence and description is expected as described above. Default: false #ms-search.storage-processor.assuming-extended-prot-xml = false # Defines the delimiter for sampling resolving of parameter elements in protXML file. # Everything after the delimiter will be ignored if present. The identifer is either a full identifier # (space code and sample code) or only the sample code for space MS_DATA. Default: ~ #ms-search.storage-processor.delimiter_for_sample_resolving = ~ # If this flag is 'false' samples in protXML parameters are tried to resolve via property MZXML_FILENAME # in same space as the experiment if not found by identifier. Default: true #ms-search.storage-processor.restricted_sample_resolving = true # If this flag is 'true' the prot.xml file will be validated in accordance to the prot.xml schema. Default: false #ms-search.storage-processor.validating-xml = false # Database credentials ms-search.storage-processor.database.basic-name = ${data-source.basicDatabaseName} ms-search.storage-processor.database.kind = ${data-source.databaseKind} ms-search.storage-processor.database.owner = ms-search.storage-processor.database.password = # --------------------------------------------------------------------------- # maintenance plugins configuration # --------------------------------------------------------------------------- maintenance-plugins = data-set-clean-up # Maintenance plugin which deletes stuff in the database of identified proteins after a data set has been deleted data-set-clean-up.class = ch.systemsx.cisd.etlserver.plugins.DeleteFromExternalDBMaintenanceTask data-set-clean-up.interval = 300 data-set-clean-up.data-source = data-source data-set-clean-up.data-set-table-name = data_sets
Some remarks:
- The values of the properties
data-source.basicDatabaseName
anddata-source.databaseKind
must be the same as the values of propertiesproteomics.database.basic-name
andproteomics.database.kind
, respectively, inservice.properties
of AS. - In order to register a data set with an prot.xml file a folder named
<space code>+<project code>
has to be dropped into<root directory>/incoming-ms-search
. It has to contain- A prot.xml file. File name has to end with
prot.xml
. - A properties file named
search.properties
. It contains properties of the freshly created experiment of typeMS_SEARCH
in the project specified by the folder name<space code>+<project code>
. Unknown properties are ignored. The following properties have special meanings:experiment-code
: Optional property specifying the experiment code. If missing the code of the experiment will be automatically created. Note, that the experiment code has to be unique per project.base-experiment
: Identifier of an experiment. All data sets of this experiment will become parent data sets.parent-data-set-codes
: Comma and/or space separated list of codes of data sets which will become parent data sets. If this property is presentbase-experiment
will be ignored.
- A prot.xml file. File name has to end with