用SeqinR包在NCBI获取基因组序列并分析
这里是网页版获取DNA序列,下载保存后可以用read.fasta
打开
##########################
用SeqinR包获取序列并进行统计
##########################
比如,在NCBI获取NC_001477登革病毒的基因组序列,
安装加载seqinr
包
install.packages("seqinr")
library(seqinr)
choosebank()#list all the sub-database in ACNUC
> choosebank()
[1] "genbank""embl""emblwgs""swissprot"
[5] "ensembl""hogenom7""hogenom""hogenomdna"
[9] "hovergendna""hovergen""hogenom5""hogenom5dna"
[13] "hogenom4""hogenom4dna""homolens""homolensdna"
[17] "hobacnucl""hobacprot""phever2""phever2dna"
[21] "refseq""refseq16s""greviews""bacterial"
[25] "archaeal""protozoan""ensprotists""ensfungi"
[29] "ensmetazoa""ensplants""ensemblbacteria" "mito"
[33] "polymorphix""emglib""refseqViruses""ribodb"
[37] "taxodb"
"genebank"包含DNA和RNA序列数据库比如要获取DEN-1登革病毒基因组序列,accesion number NC_001477
“refseq”包含"Refseq”中DNA和RNA
"refseqViruses”包含Refseq中病毒的DNA,RNA和蛋白序列
更详细的见http://doua.prabi.fr/databases/acnuc
1 构造一个函数,由Accession number直接下载所需要的序列
getncbiseq <- function(accession)
{
require("seqinr") # this function requires the SeqinR R package
# first find which ACNUC database the accession is stored in:
dbs <- c("genbank","refseq","refseqViruses","bacterial")
numdbs <- length(dbs)
for (i in 1:numdbs)
{
db <- dbs[i]
choosebank(db)
# check if the sequence is in ACNUC database 'db':
resquery <- try(query(".tmpquery", paste("AC=", accession)), silent = TRUE)
if (!(inherits(resquery, "try-error")))
{
queryname <- "query2"
thequery <- paste("AC=",accession,sep="")
query2 <- query(queryname, thequery)
# see if a sequence was retrieved:
seq <- getSequence(query2$req[[1]])
closebank()
return(seq)
}
closebank()
}
print(paste("ERROR: accession",accession,"was not found"))
}
2.根据accession number下载序列
dengueseq <- getncbiseq("NC_001477")
> dengueseq
[1] "a" "g" "t" "t" "g" "t" "t" "a" "g" "t" "c" "t" "a" "c" "g" "t" "g" "g" "a"
[20] "c" "c" "g" "a" "c" "a" "a" "g" "a" "a" "c" "a" "g" "t" "t" "t" "c" "g" "a"
[39] "a" "t" "c" "g" "g" "a" "a" "g" "c" "t" "t" "g" "c" "t" "t" "a" "a" "c" "g"
[58] "t" "a" "g" "t" "t" "c" "t" "a" "a" "c" "a" "g" "t" "t" "t" "t" "t" "t" "a"
[77] "t" "t" "a" "g" "a" "g" "a" "g" "c" "a" "g" "a" "t" "c" "t" "c" "t" "g" "a"
[96] "t" "g" "a" "a" "c" "a" "a" "c" "c" "a" "a" "c" "g" "g" "a" "a" "a" "a" "a"
3 输出fasta格式文件
write.fasta(names="DEN-1", sequences=dengueseq, file.out="den1.fasta")
4读入,如果通过网页直接下载序列,可以直接从这一步开始进行导入
dengue <- read.fasta(file = "den1.fasta")
dengueseq <- dengue[[1]]
5.1DNA length
> length(dengueseq)
[1] 10735
5.2Base composition
> table(dengueseq)
dengueseq
acgt
3426 2240 2770 2299
【用SeqinR包在NCBI获取基因组序列并分析】等同于
count(dengueseq,1)acgt
3426 2240 2770 2299
5.3GC Content
> GC(dengueseq)
[1] 0.4666977
5.4 DNA words
> count(dengueseq,2)aaacagatcacccgctgagcgggttatctgtt
1108720890708901523261555976500787507440497832529
5.5写一个函数计算AT含量
AT <- function(inputseq)
{
mytable <- count(inputseq,1)#make a table with the count of As,Cs,Gs,Ts
mylength <- length(inputseq)#find the length of the whole sequence
myAs <- mytable[[1]]#number of As in the sequence
myTs <- mytable[[4]]#nmmber of Ts in the sequence
myAT <- (myAs+myTs)/mylength
return(myAT)
}
计算denggueseq的AT含量
> AT(dengueseq)
[1] 0.5333023
5.6统计三联体密码一共多少个
> count(dengueseq,3) %>% sum
[1] 10733
参考文章:A Little Book of R For Bioinformatics, Release 0.1
推荐阅读
- Docker应用:容器间通信与Mariadb数据库主从复制
- JS中的各种宽高度定义及其应用
- 由浅入深理解AOP
- 【译】20个更有效地使用谷歌搜索的技巧
- 涉毒患者(新诗)
- 参保人员因患病来不及到指定的医疗机构就医,能否报销医疗费用()
- mybatisplus如何在xml的连表查询中使用queryWrapper
- MybatisPlus|MybatisPlus LambdaQueryWrapper使用int默认值的坑及解决
- MybatisPlus使用queryWrapper如何实现复杂查询
- 标签、语法规范、内联框架、超链接、CSS的编写位置、CSS语法、开发工具、块和内联、常用选择器、后代元素选择器、伪类、伪元素。