s
1 111
About
1 1
bulletOverview vline
bulletFeatures vline
bulletWorkflows vline
bulletProjects vline
bulletStaden vline
bulletClipping Module vline
bulletCredits vline
Download
1 1
Installation
1 1
1 1
1 1
1 1


Phineus is powered by Postgres

Release Notes

 

 



 Improving Accuracy - Sequence Trimming

At present, the Staden vector-clipping module is utilised by STARS to trim sequences by identifying the start and end positions of the sequence used to define alleles. At present the vector-clipping module utilises a 'vector sequence file' containing the sequence relating to one gene. The system used by STARS is limited in several ways, which contribute to a less efficient system, potentially impairing the speed and accuracy with which STARS is able to act. The two primary limitations are:

1. The trimming position utilises sequence before (or after) the start of the region used to define alleles, which is potentially of low quality as it is near the start of the sequence. Searching for the sequence at the start of the region used to define the allele provide higher quality sequence to guide trimming and thus less cases where the trimming positions cannot be identified.

2. Only one gene can be analysed at a time, as Staden does not support the use of multiple vector files in a single run.

Therefore, a new 'clipping module' was designed, making use of a Smith-Waterman algorithm to determine trimming points.
Due to the variety of contexts that Phineus will be used in, the clipping module has several modes which allows it to determine trimming points in different ways.

Mode 1 uses two sequences, one located at the 5' end of the correctly trimmed sequence, and one at the 3' end.

Mode 2 makes use of the fact that the sequences that define the alleles at a locus in MLST are almost invariant in length, and so uses one sequence, located at either the 5' or the 3' end of the correctly trimmed sequence, along with an offset value (the trimmed sequence length) which enables it to determine the second trimming point.

Mode 3 uses a sequence corresponding to the entire length of the MLST locus to identify the trimming points. It is intended that this mode will be further improved by the use of a Hidden Markov Model for defining the correct/desired sequence trimming points.

Mode 4 utilises a sequence located within a conserved (or relatively conserved) internal region of the gene fragment, and uses a pair of offsets to determine the correct trimming points.


A schematic diagram of how the four modes operate is summarised here

The differences between the Clipping module and the Staden vector_clip module are shown here


<< Prev