FAQ
Frequently Asked Questions (and Answers)
Q: What is the constructions of basenji2 models?
A: Basenji2 is a deep learning model specifically designed for genomic sequence analysis. Basenji2 is primarily used for interpreting noncoding genetic variants and predicting their impact on gene expression levels.
Here, we modified its structure to create a prediction model specifically for plant species, and to advance plant gene expression prediction development.
The model includes seven ConvBlocks, several dilated residual blocks, a convolution layer and a fully connected layer with 1 node. The number of the blocks and pooling size were designed to reach a 192-bp bin size which can cover two nucleosome core particles.
The dilated residual block was developed to spread information across the sequences and model long-range interactions.
For each species, its number of dilated residual blocks is specific. For specific structural details, please refer to
https://github.com/liulifenyf/PlantCRE.
Q: what's the input of basenji2 models?
A: For each gene, we first obtain a segment of genomic sequence around its transcription start site (TSS), and then perform one-hot encoding on it. The encoded sequence serves as the input to the model. The length of the genomic sequence involved varies for each species. For specific details, please refer to the model introduction of each model.
Q: What's the output of basenji2 models?
A: FOr Zea mays, we used the maximal TPM across multipel RNA-seq experments from different tissues as teh target gene expression level. For other three species, we used medium TPM across multiple RNA-seq experiments as gene expression level. All RNA-seq experiments used to get outputs can be found at
https://github.com/liulifenyf/PlantCRE.
Q: How to get contribution score for a gene?
A: We use the interpretability algorithm gradient × input to calculate contribution scores. The gradient × input method is one of the gradient-based methods, and it estimates contribution scores using the back-propagation procedure through the network. Specifically, given a one-hot encoded input sequence (A = [1,0,0,0], C = [0,1,0,0], G = [0,0,1,0], T = [0,0,0,1], N = [0,0,0,0]), we first calculated the gradient vector and then employed an element-wise product between the gradient vector and the input. Subsequently, we took an average of contribution scores on four types of bases. Finally, we obtained a contribution score for each base with the same length as the input sequence.
Q: How to identify candidate CREs based on the contribution score?
A: To identify CREs for each gene, we developed a peak-calling algorithm based on base contribution scores. The detail code used to call peak can be found at
https://github.com/liulifenyf/PlantCRE.
Q: How can i obtain all candidate CREs identified by PlantCRE for a specific species?