tffm Module

Module implementing the TFFMs.

platform:Unix
synopsis:Define the class representing the Transcription Factor Flexible Models and the necessary functions to manipulate them.
todo:Allow the construction of TFFMs using a different de novo motif finding tool than MEME.
class tffm_module.TFFM(emission_domain, distribution, cmodel, kind, name='TFFM')[source]

Bases: ghmm.DiscreteEmissionHMM

Define the Transcription Factor Flexible Models.

Note

Instances of this class have to be created through the functions tffm_from_xml() or tffm_from_meme().

__del__()[source]

Delete the underlying C structures.

Note:The destruction is made using the ghmm.DiscreteEmissionHMM destructor.
__init__(emission_domain, distribution, cmodel, kind, name='TFFM')[source]

Construct an instance of the TFFM class.

Parameters:
  • emission_domain (ghmm.EmissionDomain) – The emission domain of the underlying ghmm.HMM.
  • distribution (ghmm.Distribution) – The distribution over the emission domain.
  • cmodel – The cmodel (HMM itself implemented in C) of the underlying ghmm.HMM.
  • kind (Enum) – The TFFM can be either a 0-order, a 1st-order, or a detailed TFFM, use TFFM_KIND.ZERO_ORDER, or `TFFM_KIND.FIRST_ORDER, or TFFM_KIND.DETAILED respectively.
  • name (str) – Give the name of the TFFM. ‘TFFM’ is given by default.
Raises:

exceptions.TFFMKindError when the given kind is neither ‘1st-order’ nor ‘detailed’.

__len__()[source]

Give the length of the TFFM, i.e. the number of nucleotides in the model excluding the background.

__module__ = 'tffm_module'
_construct_hits(posterior_proba, seq_record, threshold, negative)[source]

Compute the TFBS hits on a sequence given the posterior probabilities and construct the corresponding instances of HIT.

Parameters:
  • posterior_proba (list of list of float) – The posterior probabilities at each position of the sequence computed given the tffm.
  • seq_record (Bio.SeqRecord) – The sequence on which to predict TFBS hits.
  • threshold (float) – The minimal probability to predict a position as a TFBS hit.
  • negative (bool) – A boolean stating if the TFBS hits are to be predicted on the positive or the negative strand of the sequence. Set to True when on the negative strand.
Returns:

The list of TFBS hits predicted on the sequence strand.

Return type:

list of HIT

_get_hits(seq_record, threshold, negative=False)[source]

Predict TFBS hits in the sequence given the TFFM.

Parameters:
  • seq_record (Bio.SeqRecord) – The sequence on which to predict TFBS hits.
  • threshold (float) – The minimal probability to predict a position as a TFBS hit.
  • negative (bool) – A boolean stating if the TFBS hits are to be predicted on the positive or the negative strand of the sequence. Set to True when on the negative strand (default: False).
Returns:

The list of TFBS hits predicted on the sequence strand.

Return type:

list of HIT

_get_posterior_proba(sequence_split)[source]

Get the posterior probabilities at each nucleotide position given the TFFM.

Parameters:sequence_split (list) – The sequence splitted in subsequences to not consider non ACGT nucleotides.
Returns:The posterior probabilities at each position of the sequence.
Return type:list of list
Note:One example of a sequence_split is [“ACT”, “N”, “ATC”].
_get_trimmed_hmm(first, last)[source]

Return the new trimmed HMM.

Parameters:
  • first (int) – Position of the new first matching position.
  • last (int) – Position of the new last matching position.
Returns:

The new trimmed HMM.

Return type:

ghmm.DiscreteEmissionHMM

Todo:

Raise an error rather than a sys.exit() when the trimmed HMM becomes empty.

background_emission_proba()[source]

Return the emission probabilities of the nucleotides in the background state.

Returns:A dictionnary with characters ‘A’, ‘C’, ‘G’, and ‘T’ as keys and the corresponding probabilities as values.
Return type:dict
final_states()[source]

Give the list of final states in the HMM (i.e. corresponding to the last matching position in the TFFM).

Returns:A list of final states as int.
Return type:list
get_emission_update_pos_proba(position_proba, position, previous_position_proba, index)[source]

Get the emission probabilities of ACGT at position position and update the emission probabilities in position_proba given the emission probabilities at the previous position (previous_position_proba).

Note:

This function is used state by state and several states represent the same position in detailed TFFM, this is why we need to update the probabilities listed in position_proba.

Parameters:
  • position_proba (dict) – Probabilities of getting ACGT at the current position that need to be updated.
  • position (int) – Current position in the motif.
  • previous_position_proba (dict) – Probabilities of getting ACGT at the previous position.
  • index – Represents the index of the state of the TFFM to be analyzed at the current position.
Returns:

The emission probabilities of ACGT by the state indexed by index at position position in the TFFM.

Return type:

list

get_information_content()[source]

Give the information content of the whole TFFM.

Returns:A float corresponding to the information content of the TFFM.
Return type:float
get_position_start()[source]

Give the position of the first matching state.

Returns:The position of the first matching state of the TFFM.
Return type:float
Warning:The position is given 0-based.
get_positions_ic()[source]

Give the information content for every positions of the motif modeled by the TFFM.

Returns:A list of floats giving the information contents of the positions.
Return type:list
Note:The output is an ordered list following the order of the positions within the motif.
get_significant_positions(threshold)[source]

Get the first and last significant position the TFFM where the insignificant positions are the ones on the edges with low information content.

Parameters:threshold (float) – The minimal information content to consider a position to be significant.
Returns:The positions of the first and last positions that are to be considered significant (given in this order).
Return type:tuple
get_trimmed(threshold, new_name='TFFM')[source]

Trim the current TFFM by removing edges with low information content.

Parameters:
  • threshold (float) – The minimal information content value for an edge TFFM match position to be kept.
  • new_name (str) – Name of the new TFFM to create (default:’TFFM’).
Returns:

A TFFM corresponding to the current TFFM trimmed.

Return type:

TFFM

See also:

trim_in_place()

pocc_sequences(seq_file, threshold=0.0)[source]

Apply the TFFM on the fasta sequences and return the Pocc value (probability of occupancy) for each sequence.

Parameters:
  • seq_file (str) – Fasta file giving the DNA sequences to apply the TFFM on.
  • threshold (float) – The threshold used to predict hits that will be used to compute the Pocc (default: 0.0).
Returns:

Pocc values through a generator.

Return type:

Generator of HIT

Note:

(0.0<= threshold <=1.0)

Print the svg code of the corresponding dense logo (i.e. displaying the dinucleotide dependencies captured by the TFFM).

Parameters:output (file) – Stream where to output the svg (defaut: sys.stdout).
Note:The output argument is not a file name but it is an already open file stream.

Print the svg code of the corresponding summary logo (i.e. similar to a regular sequence logo).

Parameters:output (file) – Stream where to output the svg (defaut: sys.stdout).
Note:The output argument is not a file name but it is an already open file stream.
scan_sequence(sequence, threshold=0.0, only_best=False)[source]

Apply the TFFM on the fasta sequence and return the TFBS hits.

Parameters:
  • sequence (Bio.SeqRecord) – DNA sequence to apply the TFFM on.
  • threshold (float) – The threshold used to predict a hit (i.e. the minimal probability value for a position to be considered a TFBS hit) (default: 0.0).
  • only_best (bool) – Argument to be set to True if only the best TFBS hit per sequence is to be reported (default: False)
Returns:

TFBS hits.

Return type:

list of HIT

Note:

(0.0<= threshold <=1.0)

scan_sequences(seq_file, threshold=0.0, only_best=False)[source]

Apply the TFFM on the fasta sequences and return the TFBS hits.

Parameters:
  • seq_file (str) – Fasta file giving the DNA sequences to apply the TFFM on.
  • threshold (float) – The threshold used to predict a hit (i.e. the minimal probability value for a position to be considered a TFBS hit) (default: 0.0).
  • only_best (bool) – Argument to be set to True if only the best TFBS hit per sequence is to be reported (default: False)
Returns:

TFBS hits through a generator.

Return type:

Generator of HIT

Note:

(0.0<= threshold <=1.0)

train(training_file, epsilon=0.0001, max_iter=500)[source]

Train the TFFM using the fasta sequences to learn emission and transition probabilities.

Note:

The training of the underlying HMM is made using the Baum-Welsh algorithm.

Parameters:
  • training_file (str) – The fasta file of the sequences to train the TFFM on.
  • epsilon (float) – The least relative improvement cut-off in likelihood compared to the previous iteration of the Baum-Welsh algorithm (default: 0.0001).
  • max_iter (int) – The maximum number of iteration of the Baum-Welsh algorithm to reestimate the probabilities (default: 500).
trim_in_place(threshold)[source]

Trim the current TFFM by removing edges with low information content.

Parameters:threshold (float) – The minimal information content value for an edge TFFM match position to be kept.
Warning:Trims the TFFM in place. To preserve the TFFM, use the get_trimmed() method which returns a trimmed copy of the TFFM but does not alter this TFFM.
See also:get_trimmed()
tffm_module.background_emission_proba_1storder(tffm)[source]
tffm_module.background_emission_proba_detailed(tffm)[source]
tffm_module.best_hit(hits_positive_strand, hits_negative_strand)[source]

Give the best hit in a sequence by considering both positive and negative strands.

Parameters:
  • hits_positive_strand (list of HIT) – The list of hits on the positive strand.
  • hits_negative_strand (list of HIT) – The list of hits on the negative strand.
Returns:

The best hit (None if no hit).

Return type:

HIT

tffm_module.compute_entropy(emissions)[source]

Compute the entropy given the emission probabilities of the ACGT nucleotides.

Parameters:emissions (list of float) – Emission probabilities of the ACGT nucleotides.
Returns:The computed entropy.
Return type:float
Warning:The list gives the probabilities corresponding to A, C, G, and T in this order.
tffm_module.create_0order_hmm(nb_seq, nb_residues, first_letters, motif)[source]

Create a 0-order HMM initialized from MEME result

Parameters:
  • nb_seq (int) – Number of sequences used by MEME
  • nb_residues (int) – Number of residues used by MEME
  • first_letters (dic of str->int) – Number of occurrences of ACGT at the begining of sequences used by MEME
  • motif (Bio.motifs) – PFM as a Biopython motif to be used to initialize the TFFFM
Returns:

The constructed HMM

Return type:

ghmm.DiscreteEmissionHMM

tffm_module.create_1storder_hmm(nb_seq, nb_residues, first_letters, motif)[source]

Create a 1st-order HMM initialized from MEME result

Parameters:
  • nb_seq (int) – Number of sequences used by MEME
  • nb_residues (int) – Number of residues used by MEME
  • first_letters (dic of str->int) – Number of occurrences of ACGT at the begining of sequences used by MEME
  • motif (Bio.motifs) – PFM as a Biopython motif to be used to initialize the TFFFM
Returns:

The constructed HMM

Return type:

ghmm.DiscreteEmissionHMM

tffm_module.create_detailed_hmm(nb_seq, nb_residues, first_letters, motif)[source]

Create a detailed HMM initialized from MEME result

Parameters:
  • nb_seq (int) – Number of sequences used by MEME
  • nb_residues (int) – Number of residues used by MEME
  • first_letters (dic of str->int) – Number of occurrences of ACGT at the begining of sequences used by MEME
  • motif (Bio.motifs) – PFM as a Biopython motif to be used to initialize the TFFFM
Returns:

The constructed HMM

Return type:

ghmm.DiscreteEmissionHMM

tffm_module.merge_hits(hits_positive_strand, hits_negative_strand, only_best)[source]

Merges the hits from both strands.

Parameters:
  • hits_positive_strand (list of HIT) – The list of hits on the positive strand.
  • hits_negative_strand (list of HIT) – The list of hits on the negative strand.
  • only_best (bool) – Boolean set to True only if the best TFBS hit in the sequence is to be kept.
Returns:

A list containing the TFBS hits (empty if no hit).

Return type:

list

Note:

The two input lists are required to be ordered following the positions on the sequence. The best hit per position is given. When no hit has been found at a position, the constant None is used.

tffm_module.tffm_from_meme(meme_output, kind, name='TFFM')[source]

Construct a TFFM from the output of MEME on ChIP-seq data.

Parameters:
  • meme_output (str) – File containing the output of MEME.
  • kind (str) – Type of TFFM to construct between ‘1st-order’ and ‘detailed’.
  • name (str) – Name of the TFFM (default: “TFFM”)
Returns:

The TFFM initialized from MEME results.

Return type:

TFFM

Note:

As the PFM is used to initialize the TFFM, a pseudocount of 1 is added to all the values in the PFM

tffm_module.tffm_from_motif(motif, kind, name='TFFM', nb_res=None, first_letters=None)[source]

Construct an initialized TFFM from a PFM.

Parameters:
  • motif (Bio.motifs) – PFM as a Biopython motif to be used to initialize the TFFFM
  • kind (str) – Type of TFFM to construct between ‘1st-order’ and ‘detailed’
  • name (str) – Name of the TFFM (default: “TFFM”)
  • nb_res – Number of residues to be used to compute the background->foreground transition probabilities (default: 0 meaning that we will assume a 100nt length for the ChIP-seq used to derive the PFM)
  • first_letters (dic of str->int) – Number of occurrences of ACGT at the begining of sequences in the background (default: default values will give equiprobabilities for A, C, G, and T
Returns:

The TFFM initialized from the PFM

Return type:

TFFM

See also:

tffm_from_meme()

Note:

As the PFM is used to initialize the TFFM, a pseudocount of 1 is added to all the values in the PFM

tffm_module.tffm_from_xml(xml, kind)[source]

Construct a TFFM described in an XML file.

Parameters:
  • xml (str) – File containing the TFFM description in XML format.
  • kind (str) – Type of TFFM to construct between ‘1st-order’ and ‘detailed’.
Returns:

The TFFM described in the XML file.

Return type:

TFFM

Module author: Anthony Mathelier <amathelier@cmmt.ubc.ca>

Previous topic

Welcome to TFFM’s documentation!

Next topic

hit Module

This Page