The Site/Group Extended Data (SGED) is a simple text format to store annotation of (groups of) sequence alignment sites. Basically, SGED files are CSV or TSV files with a column representing the site index. Content in this column has a special format, surrounded by square brackets, where positions are separated by semi-colons:
[3;12;987]
.
SGED files start with a header line, containing the column names. The special index column is usually the first and named "Group" or "Site", but none of these are strict requirements.
SgedTools is set of programs to manipulate SGED files. Each program performs a specific task, like translating coordinates, merging files, adding annotations. Complex manipulations and analyses can be performed by combining the various tools.
The SgedTools currently include the following programs:
sged-concatenate-alignments.py
sged-create-sequence-index.py
sged-create-structure-index.py
sged-liftover-index.py
sged-merge-indexes.py
sged-translate-coords.py
sged-get-all-pairs.py
sged-group.py
sged-group-test-inclusion.py
sged-merge.py
sged-randomize-groups.py
sged-ungroup.py
[1;2;3]
becomes [1] [2] [3]
(1 line -> 3 lines).sged-structure-list.py
sged-structure-infos.py
sged-sged2defattr.py
sged-summary.py
sged-disembl2sged.py
disembl
program (for predicting intrinsically disordered regions in a protein structure) to SGED format.sged-paml2sged.py
PAML
site models (sites under positive selection) to SGED format.sged-raser2sged.py
The SgedTools are standalone python (version 3) scripts that can simply be copied and executed. They are located in the src sub-directory of the distribution. The programs can be obtained at GitHub.com, stable releases are available here for download.
Some python packages are needed for them to work properly, and some of these python packages require external software to be installed:
conda create -n sgedtools-env python=3
conda activate sgedtools-env
pip install progress
pip install pandas
pip install numpy
pip install scipy
pip install biopython
Note that biopython only provide wrappers to the DSSP and MSMS programs, which need to be installed separately in order to compute secondary structures, solvent accessibility and residue depth (see the Bio.PDB.DSSP and Bio.PDB.ResidueDepth module descriptions for more info).
A Makefile is provided for easier installation. You can run it using, for instance:
cd sgedtools
make install PREFIX=$HOME/.local/bin
Each program in the SgedTools package takes as input one or several arguments, which can be listed by running the program:
>python3 sged-create-structure-index.py -h
sged-create-structure-index
Create a structure index for an alignment. Align each sequence to all chains of one
or more input structures and find the best match.
Available arguments:
--pdb (-p): Input protein data bank file (required).
Can be used multiple times to selected several entries.
File globs can be used to select multiple files.
--pdb-format (-f): Format of the protein data bank file (default: PDB).
Either PDB or mmCif is supported. In addition, remote:PDB or remote:mmCif
allow to directly download the structure file from the Protein Data Bank.
In this case, --pdb-id indicates the PDB id.
--pdb-id (-i): Specify the id of the PDB file to retrieve remotely.
--alignment (-a): Input alignment file (required);
--alignment-format (-g): Input alignment format (default: fasta).
Any format recognized by Bio::AlignIO (see https://biopython.org/wiki/AlignIO).
--output (-o): Output index file (required).
--exclude-incomplete (-x): Exclude incomplete chains from scan (default: false).
--help (-h): Print this message.
The examples directory contains several example pipelines demonstrating concrete use of the SgedTools: