生物序列中致病基因的筛选毕业论文

2022-02-02 22:16:35

论文总字数：15176字

摘要

基因在染色体中的特定位置称为位点，探究位点与遗传疾病及性状的关联，有利于科研人员从机制上更加深入地了解遗传疾病，甚至可以帮助人们从源头上预防遗传疾病。本文主要运用统计的方法探究位点与遗传疾病的关联：第一部分采用单因素方差分析、列联分析、假设检验的方法探究单一致病位点，对三种方法的实验结果进行比对，结合卡方检验的方法进行检验。第二部分采用Logistic回归、Lasso回归的方法探究多个致病位点和致病基因。第三部分采用典型相关分析探究多个性状与位点的相关性。本文利用了matlab、sas等软件进行建模，从单一位点与性状的关联性分析，到基因与性状的关联性分析，再到多个性状与位点的相关性分析，层层深入，多种方法对照并对结果进行检验，使得实验结果客观准确。

关键词：遗传位点；单因素方差分析；列联分析；假设检验；卡方检验；Logistic回归；Lasso算法；典型相关分析

ABSTRACT

Specific location of genes in the chromosome are called SNP.Locating SNPs associated with genetic diseases or characters helps researchers understand the genetic mechanisms of characters and some diseases.It also enables people to intervene in the pathogenic SNPs, preventing the occurrence of some genetic diseases.In this paper, we used statistical methods to explore the SNPs associated with genetic diseases.In the first part,the single factor analysis of variance, contingency analysis, hypothesis testing method were used to explore the single pathogenic SNP.The experimental results of the three methods were compared, and the chi-square test was adopted to verify the results.In the second part, Logistic regression and Lasso regression were used to explore multiple pathogenic SNPs and pathogenic genes.In the third part,we used typical correlation analysis to explore the correlation between multiple characters and SNPs.Many software were used for modeling,such as Matlab, SAS .etc.In all there parts,multiple methods were used for comparison and the results were statistically tested,making the experiment objective and accurate.

Keywords:SNP;single factor analysis of variance;columns even analysis;hypothesis testing,;chi-square test,;Logistic regression,;Lasso regression;canonical correlation analysis

诚信声明.................................................I

摘要....................................................II

Abstract...............................................III

引言..............................................1
样本预处理........................................2

2.1 样本的HWE检验.......................................2

2.2 样本的转化...........................................3

单一致病位点的探究................................5

3.1 单因素方差分析.......................................5

3.2 假设检验.............................................6

3.3 列联分析.............................................7

3.4 卡方检验.............................................9

3.5 本章总结.............................................10

多个致病位点的联合分析...........................11

4.1 模型的选取..................................................11

4.1.1 Logistic回归...........................................11

4.1.2 逐步回归................................................11

4.2 模型的建立..................................................12

4.3 Lasso算法...................................................13

4.3.1 Lasso算法简介...........................................13

4.3.2 Lasso回归的实现.........................................14

4.4 变量的确定...................................................15

致病基因的探究...................................16

5.1 模型的建立...................................................16

5.2 实验结果.....................................................16

与多个性状关联位点的研究.........................18

6.1 典型相关分析.................................................18

6.1.1 典型相关系数的计算及显著性检验...........................18

6.2 模型的建立...................................................21

6.3 实验结果.....................................................21

第七章总结.............................................22

参考文献................................................23

致谢....................................................24

引言

DNA又称脱氧核糖核酸，存在于人体的染色体中，是承载遗传信息的重要物质，而基因则是DNA中储存遗传信息的部分。DNA分子是具有双螺旋结构链状物，由脱氧核苷酸链接组成。脱氧核苷酸含有A、G、C、T四种碱基，在整个DNA链上有数以亿计的碱基对。在组成DNA的众多的碱基对中，有一些特定位置对应的的脱氧核苷酸经常发生变异，这是造成DNA多样性的直接原因，这些脱氧核苷酸被称为位点（SNP），染色体、基因和位点三者的结构关系如下图所示。越来越多的研究表明，某些位点或者包含多个位点的基因与遗传性疾病和性状有密切联系。因此探究位点与遗传疾病及性状的关联，有利于科研人员从机制上更加深入地了解遗传疾病，甚至可以帮助人们从源头上预防遗传疾病。随着科学技术的发展，能够帮助科研人员确定致病位点或致病基因的方法也越来越多。但使用较为普遍的方法是全基因组方法，其主要做法是：在人群中进行大量采样，样本包括具有某种遗传病的人和健康的人。两种样本通过不同的编码区分开来，例如：用0-1来区分患病样本和健康样本。而对每个样本的位点，通常用碱基对的方式表示一个位点的信息，如AA、AG、GG等，这是由DNA双螺旋结构决定的。通过样本的对照分析，可以发现致病位点，从而掌握遗传疾病的遗传机制。

样本预处理

本文以1000个样本的某种遗传疾病（或性状）信息及其9445个位点的编码信息作为数据基础，前500个样本为患病样本，后500个样本为健康样本[^[1]]。

请支付后下载全文，论文总字数：15176字

您需要先支付 80元 才能查看全部内容！立即支付

注册

找回密码