[Site/Group Extended DATA format]
SgedTools
jydu.github.io/sgedtools
About SgedTools

The Site/Group Extended Data (SGED) is a simple text format to store annotation of (groups of) sequence alignment sites. Basically, SGED files are CSV or TSV files with a column representing the site index. Content in this column has a special format, surrounded by square brackets, where positions are separated by semi-colons: [3;12;987] . SGED files start with a header line, containing the column names. The special index column is usually the first and named "Group" or "Site", but none of these are strict requirements.

SgedTools is set of programs to manipulate SGED files. Each program performs a specific task, like translating coordinates, merging files, adding annotations. Complex manipulations and analyses can be performed by combining the various tools.

The SgedTools currently include the following programs:

Translate coordinates

sged-concatenate-alignments.py
Concatenate two or more alignments and create indexes of the relative positions of each alignment column in the concatenated alignment
sged-create-sequence-index.py
Create an index file from a sequence alignment. Allows conversion between alignment positions and sequence specific positions.
sged-create-structure-index.py
Match sequences in a sequence alignment with one or more three-dimensional structures. Creates an index to translate between alignment positions and structural coordinates.
sged-liftover-index.py
Translate one index using a second one, so that A->B + B->C = A->C
sged-merge-indexes.py
Combine compatible indexes (with unique sets of positions).
sged-translate-coords.py
Convert coordinates of the entries in a SGED file from one reference to another (for instance from alignment positions to PDB coordinates). Requires a previously computed index files.

Manipulate groups

sged-get-all-pairs.py
Take all N groups in a SGED file and combine them in pairs, resulting in N*(N-1)/2 groups.
sged-group.py
Aggregate single sites or groups into (super) groups, possibly according to a given column.
sged-group-test-inclusion.py
Test whether the groups in one file are included in a second file.
sged-merge.py
Merge two SGED files. Several join operations are supported.
sged-randomize-groups.py
Generate a list of groups with characteristics similar to a given set of groups. The output groups have the same size and similar site properties, but sites are taken randomly.
sged-ungroup.py
Convert groups of positions into a list of single positions. For instance, [1;2;3] becomes [1] [2] [3] (1 line -> 3 lines).

Structure analysis

sged-structure-list.py
List all positions in a structure file and output them in SGED format.
sged-structure-infos.py
Compute several structural statistics of sites or groups from a three dimensional structure (e.g., 3D distance, solvent accessibility, secondary structure, etc.).
sged-sged2defattr.py
Export data from a SGED file to an attribute file importable in Chimera/ChimeraX.

Statistics

sged-summary.py
Compute summary statistics for all sites in each group, based on single site properties.

Convert from other file formats

sged-disembl2sged.py
Convert the output of the disembl program (for predicting intrinsically disordered regions in a protein structure) to SGED format.
sged-paml2sged.py
Convert the output of PAML site models (sites under positive selection) to SGED format.
sged-raser2sged.py
Convert the output of RASER (rate shift estimator) to SGED format.

Installing SgedTools

The SgedTools are standalone python (version 3) scripts that can simply be copied and executed. They are located in the src sub-directory of the distribution. The programs can be obtained at GitHub.com, stable releases are available here for download.

Some python packages are needed for them to work properly, and some of these python packages require external software to be installed:

  • Ppandas, for spreadsheet data I/O and manipulation,
  • BioPython, for sequence data I/O,
  • NumPy and SciPy, for calculations and numerics.
  • progress, for displaying progress bars.
These packages can all be installed using the pip manager. To avoid any version conflict, it is recommended to use a conda environment:

    conda create -n sgedtools-env python=3
    conda activate sgedtools-env
    pip install progress
    pip install pandas
    pip install numpy
    pip install scipy
    pip install biopython
    
Note that biopython only provide wrappers to the DSSP and MSMS programs, which need to be installed separately in order to compute secondary structures, solvent accessibility and residue depth (see the Bio.PDB.DSSP and Bio.PDB.ResidueDepth module descriptions for more info).

A Makefile is provided for easier installation. You can run it using, for instance:


    cd sgedtools
    make install PREFIX=$HOME/.local/bin
    

Using SgedTools

Each program in the SgedTools package takes as input one or several arguments, which can be listed by running the program:


>python3 sged-create-structure-index.py -h

sged-create-structure-index

    Create a structure index for an alignment. Align each sequence to all chains of one
    or more input structures and find the best match.

Available arguments:
    --pdb (-p): Input protein data bank file (required).
        Can be used multiple times to selected several entries.
        File globs can be used to select multiple files.
    --pdb-format (-f): Format of the protein data bank file (default: PDB).
        Either PDB or mmCif is supported. In addition, remote:PDB or remote:mmCif
        allow to directly download the structure file from the Protein Data Bank.
        In this case, --pdb-id indicates the PDB id.
    --pdb-id (-i): Specify the id of the PDB file to retrieve remotely.
    --alignment (-a): Input alignment file (required);
    --alignment-format (-g): Input alignment format (default: fasta).
        Any format recognized by Bio::AlignIO (see https://biopython.org/wiki/AlignIO).
    --output (-o): Output index file (required).
    --exclude-incomplete (-x): Exclude incomplete chains from scan (default: false).
    --help (-h): Print this message.
    

The examples directory contains several example pipelines demonstrating concrete use of the SgedTools:

alignment_confidence
Uses the bppAlnScore program from the Bio++ Program Suite to compare two alignments and compute alignment column scores. The scores are then visualized on a three-dimensional structure.
concatenate_alignments
Demonstrates how alignments can be combined while keeping track of the coordinates of each alignment. The three mitochondrial subunits of the cytochrome oxydase are used as an example. Inter-subunit coevolving sites are predicted and mapped on the 3D structure.
structure_to_sged
Shows how to compute simple statistics from a protein structure.
structure_statistics
Performs advanced structural analyses of candidate coevolving positions. Include conditional Monte-Carlo sampling.
intrinsic_disordered
Look at how fast intrinsically disordered regions evolve.
raser_to_structure
Example of RASER results analysis.
paml_to_structure
Example of PAML results analysis.
rate_to_structure
Performs a ConSurf-like analysis: estimate evolutionary rates and map the results onto a three-dimensional structure.