Abstract

SCOT in a Nutshell

SCOT, a novel multipurpose software that incorporates the benefits of a multitude of approaches for the classification of helices, strands, and turns in proteins. To our knowledge, it is the very first method that not only captures a variety of rare and basic secondary structure elements (right- and left-handed α-, 310-, 2.27-, plus right-handed π-helices, PPII helices, and β-sheets) in protein structures, but also their irregularities in a single step, and provides proper output and visualization options.

SCOT combines the benefits of geometry-based and hydrogen bond-based methods by using hydrogen bond and geometric information to gain insights into the structural space of proteins. Its dual character enables robust classifications of secondary structure elements without major influence on the geometric regularity of the assigned secondary structure elements. In consequence, it is perfectly suited to automatically assign secondary structure elements for subsequent helix- and strand-based protein alignments with methods such as LOCK2 [1]. This is especially supported by our elaborate kink detection. All of these benefits are clearly demonstrated by our results. Together with the easy to use visualization of assignments by the means of PyMOL [2] scripts, SCOT enables a comprehensive analysis of regular backbone geometries in protein structures.

Key Features

Underlying Methodology

  • Dihedral angles
  • Hydrogen bonds
  • Geometry

Helices

  • Right-handed α-, 310-, 2.27-, and π-helices
  • Left-handed α-, 310-, 2.27-helices
  • Polyproline II helices
  • Helix class purity
  • Helix kinks

β-sheets

  • Sheet assignment
  • Strand kinks

Turns

  • Normal, reverse, and open turns
  • Hydrogen bond energy (normal, reverse) and Cα-distance (open) output

Input

Output

  • PDB file [3]
  • Secondary structure elements only (in PDB file format)
  • PyMOL [2] visualization script

Methodology

SCOT is a secondary structure element classification method (SSAM) using hydrogen bonds, geometric properties (Cα–Cα distances), as well as dihedral angles (based on turn clustering).

It supports the classification of helices, β-strands and turns. It reads and writes files in the well-established PDB file format [3].

Parsing and Hydrogen Atom Assignment

SCOT requires standard PDB files as input. All options can be set via command line arguments. A list of all supported arguments and a short documentation can be evoked using --help. In addition, trained ESOM files are required for the dihedral angle-based classification of turns.

Our PDB file parsing procedure relies on the information given in the lines with the following prefixes: REMARK 465 (for the support of missing residues), SEQRES (for the support of modified residues), ATOM, HETATM (modified residues), TER, and ENDMDL in NMR structures. We parse the residues according to the ATOM and HETATM lines in the order of their appearance.

SCOT utilizes a hierarchical assignment of protein structural elements starting with the assignment of turns. Since most of the publicly available protein structures in the PDB do not contain information on hydrogen atoms, we use the algorithm by McDonald and Thornton [4] to assign them artificially.

Hydrogen Atom Placement

SCOT:Hydrogen Atom Placement
Visualization of the hydrogen atom (H) placement for the right residue according to McDonald and Thornton [4].

Turns

For the determination of the hydrogen-bonded normal and reverse turns, we utilize the DREIDING [5] instead of the established DSSP (Define Secondary Structure of Proteins) [6] hydrogen bonding criterion. For the open turns, we determine the Cα–Cα distance between the first and the last residue of the turn. The detected turns are then clustered according to their dihedral angles.

We use a dataset of more than 3,500 protein structures from the PDB with distinct sequences by the use of the PISCES sequence culling server [7]. The dihedral angles of the classified turns of these proteins are transformed using our Jigsaw-transformation consisting of two functions to address the challenges of a metric distance measure in angular space. The transformed dihedral angles, i.e., two transformed angles for each input angle, are then clustered by emergent self organizing maps (ESOMs) [8], with up to more than 1,000,000 neurons for a single class, resulting in a variety of distinct turn clusters of similar backbone conformations.

Turn Categories, Hydrogen Bonds, and Cα–Cα Distances

SCOT: Turn Categories, Hydrogen Bonds, and Cα–Cα Distances
Normal turns, reverse turns, and open turns based on a Cα–Cα atom distance between 4 Å and 8 Å.

Jigsaw Transformation

SCOT: Jigsaw Transformation
Visualization of the two Jigsaw transformation functions f1 and f2.

Strands

The next layer of the hierarchical assignment of secondary structure elements is dedicated to the classification of sheets and strands. We have developed three algorithms to assign sheets and strands. The final one determines the hydrogen bond contacts for all residues of an input protein structure. Using these contacts, we build a strand graph consisting of sequence regions of consecutive parallel or anti- parallel hydrogen bonding patterns. The edges are labeled with the hydrogen bonds they represent connecting different strands. Thus, each strand, and its length in particular, is implicitly defined by the hydrogen bond information stored at the labels of its vertex’ incident edges.

We then determine a merge blocking fingerprint based on specific turns which are usually located between succeeding strands within the same sheet. Using this fingerprint, we merge consecutive strands whose gap is not indicated as blocked by this fingerprint. Each connected component of the graph represents a sheet, each of which consisting of at least two strands.

To cope with the circularity of β-barrels and to guarantee a deterministic assignment of sheets and strands, we use a priority queue to extract the sheet and strand information out of the graph. During this step, we also determine kinks based on the Cα–Cα distances in segments of length 4 in a strand. If this distance falls below a pre-defined threshold, a kink is defined.

The additional information about kinks is added to the REMARK section of the output PDB file to be conform to the PDB file format.

Strand Scheme

SCOT: Strand Scheme
Visualization of the parallel and anti-parallel sheet hydrogen bonding patterns.

Helices

The final layer of the hierarchical assignment of secondary structure elements deals with the classification of helices. We have developed five different algorithms for this purpose. The final one classifies right-handed (α, 310, π), left-handed (α, 310), and ribbon (polyproline II, 2.27) helices. Each of these three groups is processed separately. In each such group and for each class of a helix (e.g., α), the turn overlaps of all sequence positions of the corresponding turn (i.e., normal of length 5 and class 1) are determined. Plus, we also determine the turn overlaps of the corresponding open turns for all helix classes within one group. These are used for the extension of our helices.

Based on the class-specific and extension overlaps, we define three layered helices consisting of a core, a hull, and an extension. Each such helix is created whenever we detect a segment of succeeding helix-specific turn overlaps of a minimum number of overlaps and segment length. This is the core of the helix. The hull is defined as all neighboring residues with an overlap of at least 1. The extension is defined according to the core but based on the extension overlaps.

We then split and block these helices whenever the Cα–Cα distance of a sequence segment of length 4 within a helix exceeds a predefined threshold. After that, we merge consecutive and overlapping helices and determine their classification based on the sequence coverage and turn overlaps for each involved helix class. The dominant class is taken as the final helix class. We also determine a helix class Purity based on these overlaps to reflect the dominance of a helix’ class.

We finally assign kinks within cores and hulls based on minima in the corresponding turn overlaps. We also assign classes to kinks to reflect the different geometrical regions a helix can consist of (e.g., 15 for a kink between an α (1) and 310 (5) core).

The information about kinks and class Purity are added to the REMARK section.

Core Helix Layers

SCOT: Core helix layers
Visualization of the core helix layer definition based on a generic α-core helix at residues 7–24. The core is based on three normal-5 1 turns which lead to a helix turn overlap of at least 2 from residue 16 to 21. The hull requires at least one normal-5 1 turn. The extension shown in this example is based on open-5 1 and open-6 4 turns. The required turn overlap of at least 3 is given from residue 15 to 23.

Core Helix Splitting

SCOT: Core helix splitting
Generic example of the splitting of a core helix h. There are four splits based on Cα–Cα distances above the threshold between residues 15-18, 18-21, 20-23, and 25-28 leading to 5 segments. The first segment from residue 14 to 16 does not contain a part of the core and is, therefore, dropped. The segment from residue 20 to 21 is too short and is, therefore, also dropped. The segment from residue 17 to 19 contains a part of the core, is at least 3 residues long, and, thus, leads to the new core helix h1. The same holds true for the segments 22–26 and 27-29 leading to core helices h2 and h3.

Output

For each PDB input file, we write a PDB output file containing the SCOT secondary structure element assignment and an optional PyMOL [2] script on request using the command line option --write-pymol.

The PDB output file contains all lines from the PDB input file except for the HELIX, SHEET, TURN, REMARK 650, REMARK 700, and REMARK 750 lines. The HELIX and SHEET lines are given in PDB format. The REMARK lines contain information about kinks for helices and strands, helix class purities, and a format description for the TURN lines.

Example PyMOL Script Visualization

Example PyMOL Script Visualization
Visualization of the SCOT assigned secondary structure elements of chain D of protein 1rd4@pdb using SCOT's PyMOL [2] visualization script.

Download

The use of this software is free to the scientific community as long as the authors are credited.

Please report bugs.

The executable contains all required libraries. Therefore, it is ready to go without the necessity of any installation.

Version 1.0.2

Quick Use

Preparation

  1. Download the latest version of SCOT from the download section.
  2. Extract the archive using the command
    tar -xzvf scot+esoms.tar.gz
  3. If neccessary apply execution rights
    chmod u+x scot

Usage

  • Call the following command to view the complete usage
    ./scot --help
  • Call ./scot 1gos.pdb 1 to classify (single thread) the secondary structure elements for the PDB [3] file 1gos.pdb and add them to this file. Existing secondary structure element information in this file is overwritten.
  • Call ./scot 1gos.pdb 1gos_scot.pdb 1 to classify (single thread) the secondary structure elements for the PDB file 1gos.pdb and write a new file with name 1gos_scot.pdb.
  • Call ./scot input/ output/ 5 to classify in parallel (5 threads) the secondary structure elements for all *.pdb files in the input directory and write PDB files containing the classified secondary structure elements to the output directory.
  • Use the option --write-pymol to write a PyMOL [2] visualization script for each input PDB file.

References

SCOT

Tobias Brinkjost, Christiane Ehrt, Oliver Koch, and Petra Mutzel "SCOT: Rethinking the Classification of Secondary Structure Elements", Bioinformatics, 2019, DOI: 10.1093/bioinformatics/btz826

Download Reference:

Bibliography

  1. J. Shapiro and D. Brutlag. FoldMiner and LOCK 2: protein structure comparison and motif discovery on the web. In: Nucleic Acids Research 32 (2004), W536–W541. DOI: 10.1093/nar/gkh389.
  2. Schrödinger, LLC. The PyMOL molecular graphics system, version 1.8. 2015.
  3. H. Berman, K. Henrick, and H. Nakamura. Announcing the worldwide protein data bank. In: Nature Structural and Molecular Biology 10 (2003), p. 980. DOI: 10.1038/nsb1203-980.
  4. I. K. McDonald and J. M. Thornton. Satisfying hydrogen bonding potential in proteins. In: Journal of Molecular Biology 238.5 (1994), pp. 777–793. DOI: 10.1006/jmbi.1994.1334.
  5. S. L. Mayo, B. D. Olafson, and W. A. Goddard. DREIDING: a generic force field for molecular simulations. In: The Journal of Physical Chemistry 94.26 (1990), pp. 8897–8909. DOI: 10.1021/j100389a010.
  6. W. Kabsch and C. Sander. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. In: Biopolymers 22.12 (1983), pp. 2577–2637. DOI: 10.1002/bip.360221211.
  7. G. Wang and R. L. Dunbrack Jr. PISCES: a protein sequence culling server. In: Bioinfor- matics 19.12 (2003), pp. 1589–1591. DOI: 10.1093/bioinformatics/btg224.
  8. A. Ultsch and F. Mörchen. ESOM-Maps: tools for clustering, visualization, and classification with Emergent SOM. Tech. rep. 46. University of Marburg, 2005.