信号处理结合k-mer方法在生物序列相似性的应用研究毕业论文
2020-04-21 17:12:38
摘 要
计算分子生物学(Computational molecular biology)是一门新兴跨学科主题。它将计算机和网络作为工具,运用数学和信息科学等学科知识来研究和分析生物大分子,如核酸和蛋白质。生物序列比对是计算分子生物学中主要的研究方向之一,也是研究物种之间的同源性关系的重要工具。本文将信号邻域的方法与生物学原理结合起来,利用DNA中的特征,进行DNA序列相似性分析。
在DNA序列相似性分析中,常见的有几何图形表示方法与K-mer频次统计方法,几何图形主要是把生物序列转化成一条可视化的三维空间曲线,而K-mer频次统计主要是揭示生物序列子序列的规律。本文对于DNA序列进行基于K-mer方法数值映射后,分别进行离散傅里叶变换(DFT)和快速傅里叶变换(FFT),并基于模式特征频率构造DNA的模式特征向量,利用Cophenetic相关系数筛选类间距离,对模式特征向量进行聚类分析并建立进化树。本文讨论了k-mer频次统计、DFT和FFT变换和DNA进化树的研究等问题,得出了基于2-mer的逆序双核苷酸可以更有效的提取DNA序列生物特征,FFT变换可以有效降低计算复杂度等结论。
本文的创新点在于介绍并回顾了序列比对的历史与方法,在DNA数值映射中考虑到了核苷酸分子量的大小、引入FFT构造模式特征向量提取DNA特征、给出DNA序列相似性分析的最佳流程图等。同时我们总结了本文仍未解决的问题并提出了在今后的研究中我们将要做的工作。
关键词:DNA序列相似性分析;k-mer方法;傅里叶变换;模式特征向量;聚类分析
The Application of Signal Processing Combined with k-mer Method in Biological Sequence Similarity
Abstact
Computational molecular biology is an emerging interdisciplinary theme. It uses computers and networks as tools, and uses mathematical and information science theories and methods to study and analyze biological macromolecules such as nucleic acids and proteins. Biological sequence similarity analysis is one of the most important and fundamental research directions in the calculation of molecular biology, and is an important means to study the homology relationship between species. In this paper, the method of signal neighborhood is combined with biological principles, and the characteristics of DNA are used to analyze the similarity of DNA sequences.
In the DNA sequence similarity analysis, there are common geometric representation methods and K-mer frequency statistical methods. The geometric representation is mainly to convert the biological sequence into a visual three-dimensional space curve, and the K-mer frequency statistics mainly reveals the law of the subsequence of the biological sequence.In this paper, the K-mer frequency statistics are used to numerically map the DNA sequences, and then discrete Fourier transform (DFT) and fast Fourier transform (FFT) are performed respectively. The pattern feature vector of DNA is constructed based on the mode feature frequency. The Cophenetic correlation coefficient is used to screen the distance between classes, then the pattern feature vector is clustered and the evolution tree is established.In this paper, we discuss the k-mer frequency statistics, DFT and FFT transform and DNA evolution tree research. It is concluded that the 2-mer-based reverse sequence dinucleotide can extract DNA sequence bio features more effectively, and FFT transform can effectively reduce the complexity of calculation.
The innovation of this paper is to introduce and review the history and methods of sequence alignment,considere the molecular weight of the nucleotide in the DNA value mapping ,introduce FFT construction pattern feature vector to extract DNA features and give the best flow chart for DNA sequence similarity analysis.At the same time, we summarize the problems that have not been solved in this paper and put forward the work we will do in future research.
Key words: DNA sequence similarity analysis; k-mer method; Fourier transform; Mode Characteristic Vector; cluster analysis
目录
摘 要.........................................................................................................1
ABSTRACT.............................................................................................2
第一章 引言......................................................................................4
1.1研究背景………………………………………………………………..4
1.2文献综述………………………………………………………………..4
1.2.1 比对模型概述…………………………………………………...5
1.2.2 非比对模型概述………………………………………………...6
1.3本文的研究方法及创新点……………………………………………..7
第二章 DNA序列的数值表示...........................................................7
2.1 DNA的相关知识………………………………………………………..7
2.2字符序列转化为数值序列……………………………………………..9
2.2.1 应用k-mer方法进行频率提取………………………………...9
2.2.2 数值映射…………………………………………………...…..10
第三章 基于信号处理方法对DNA序列进行分析........................12
3.1傅里叶变换……………………………………………………………12
3.1.1离散傅里叶变换………………………………………………..12
3.1.1快速傅里叶变换………………………………………………..12
3.2功率谱与信噪比………………………………………………………13
3.3模式特征向量的构造…………………………………………………14
3.4聚类分析………………………………………………………………15
第四章 基于MATLAB的实证研究..................................................16
4.1傅里叶变换的研究……………………………………………………16
4.2 DNA序列组进化树的研究……………………………..……………..18
第五 总结与展望............................................................................20
5.1 总 结………………………………………………………………….20
5.2 展 望………………………………………………………..………...21
参考文献..........................................................................................22
附录..................................................................................................23
致 谢................................................................................................27
本科期间发表论文情况..................................................................28
- 引言
1.1 研究背景