怎么理解基因组注释文件(gtf)的内容?
1.输入文件:GRCh38.p13.gtf
- 905M,1995529行=1,995,529行。
$ ls -lth *gtf
-rw-r--r--. 1 wangjl jinlab 905M May 1 2023 GRCh38.p13.gtf
$ wc GRCh38.p13.gtf
1995529 82989334 948602107 GRCh38.p13.gtf
$ head GRCh38.p13.gtf
##description: evidence-based annotation of the human genome (GRCh38), version 42 (Ensembl 108)
##provider: GENCODE
##contact: gencode-help@ebi.ac.uk
##format: gtf
##date: 2022-07-20
chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000290825.1"; gene_type "lncRNA"; gene_name "DDX11L2"; level 2; tag "overlaps_pseudogene";
chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000290825.1"; transcript_id "ENST00000456328.2"; gene_type "lncRNA"; gene_name "DDX11L2"; transcript_type "lncRNA"; transcript_name "DDX11L2-202"; level 2; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 11869 12227 . + . gene_id "ENSG00000290825.1"; transcript_id "ENST00000456328.2"; gene_type "lncRNA"; gene_name "DDX11L2"; transcript_type "lncRNA"; transcript_name "DDX11L2-202"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 12613 12721 . + . gene_id "ENSG00000290825.1"; transcript_id "ENST00000456328.2"; gene_type "lncRNA"; gene_name "DDX11L2"; transcript_type "lncRNA"; transcript_name "DDX11L2-202"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 13221 14409 . + . gene_id "ENSG00000290825.1"; transcript_id "ENST00000456328.2"; gene_type "lncRNA"; gene_name "DDX11L2"; transcript_type "lncRNA"; transcript_name "DDX11L2-202"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical"; havana_transcript "OTTHUMT00000362751.1";
2. 第三列是基因组位置: exon/CDS/UTR等
(1) 按频率统计location
$ grep -v '^#' GRCh38.p13.gtf | awk '{print $3}' | sort | uniq -c | sort -k 1nr
852230 exon
653960 CDS
180584 UTR
117681 transcript
64207 start_codon
64065 stop_codon
62696 gene
101 Selenocysteine
(2) 查看文件第一个基因
$ grep -v '^#' GRCh38.p13.gtf | awk '{print $1"\t"$4"\t"$5"\t"$7"\t"$3"\t"$14"\t"$16}' | head -n 20
chr1 11869 14409 + gene "DDX11L2"; 2;
chr1 11869 14409 + transcript "lncRNA"; "DDX11L2";
chr1 11869 12227 + exon "lncRNA"; "DDX11L2";
chr1 12613 12721 + exon "lncRNA"; "DDX11L2";
chr1 13221 14409 + exon "lncRNA"; "DDX11L2";
chr1 12010 13670 + gene "DDX11L1"; 2;
chr1 12010 13670 + transcript "transcribed_unprocessed_pseudogene"; "DDX11L1";
chr1 12010 12057 + exon "transcribed_unprocessed_pseudogene"; "DDX11L1";
chr1 12179 12227 + exon "transcribed_unprocessed_pseudogene"; "DDX11L1";
chr1 12613 12697 + exon "transcribed_unprocessed_pseudogene"; "DDX11L1";
chr1 12975 13052 + exon "transcribed_unprocessed_pseudogene"; "DDX11L1";
chr1 13221 13374 + exon "transcribed_unprocessed_pseudogene"; "DDX11L1";
chr1 13453 13670 + exon "transcribed_unprocessed_pseudogene"; "DDX11L1";
chr1 14404 29570 - gene "WASH7P"; 2;
chr1 14404 29570 - transcript "unprocessed_pseudogene"; "WASH7P";
chr1 29534 29570 - exon "unprocessed_pseudogene"; "WASH7P";
chr1 24738 24891 - exon "unprocessed_pseudogene"; "WASH7P";
chr1 18268 18366 - exon "unprocessed_pseudogene"; "WASH7P";
chr1 17915 18061 - exon "unprocessed_pseudogene"; "WASH7P";
chr1 17606 17742 - exon "unprocessed_pseudogene"; "WASH7P";
可见,对于基因 DDX11L1
- 第一级是 gene
- 第二级是它的两个转录本 transcript
- 第三级是该转录本的外显子
(3) 找一个蛋白编码基因:+链上的 CD5
$ grep -v '^#' GRCh38.p13.gtf | grep -P '\"CD5\"' | awk '{print $1"\t"$4"\t"$5"\t"$7"\t"$3"\t"$14"\t"$16}'
chr11 61102489 61127852 + gene "CD5"; 2;
chr11 61102489 61127852 + transcript "protein_coding"; "CD5";
chr11 61102489 61102615 + exon "protein_coding"; "CD5";
chr11 61102561 61102615 + CDS "protein_coding"; "CD5";
chr11 61102561 61102563 + start_codon "protein_coding"; "CD5";
chr11 61115056 61115094 + exon "protein_coding"; "CD5";
chr11 61115056 61115094 + CDS "protein_coding"; "CD5";
chr11 61118175 61118480 + exon "protein_coding"; "CD5";
chr11 61118175 61118480 + CDS "protein_coding"; "CD5";
chr11 61118915 61118977 + exon "protein_coding"; "CD5";
chr11 61118915 61118977 + CDS "protein_coding"; "CD5";
chr11 61119234 61119575 + exon "protein_coding"; "CD5";
chr11 61119234 61119575 + CDS "protein_coding"; "CD5";
chr11 61121611 61121904 + exon "protein_coding"; "CD5";
chr11 61121611 61121904 + CDS "protein_coding"; "CD5";
chr11 61122907 61123032 + exon "protein_coding"; "CD5";
chr11 61122907 61123032 + CDS "protein_coding"; "CD5";
chr11 61123884 61123937 + exon "protein_coding"; "CD5";
chr11 61123884 61123937 + CDS "protein_coding"; "CD5";
chr11 61125032 61125151 + exon "protein_coding"; "CD5";
chr11 61125032 61125151 + CDS "protein_coding"; "CD5";
chr11 61125751 61125841 + exon "protein_coding"; "CD5";
chr11 61125751 61125836 + CDS "protein_coding"; "CD5";
chr11 61125837 61125839 + stop_codon "protein_coding"; "CD5";
chr11 61126288 61127852 + exon "protein_coding"; "CD5";
chr11 61102489 61102560 + UTR "protein_coding"; "CD5";
chr11 61125837 61125841 + UTR "protein_coding"; "CD5";
chr11 61126288 61127852 + UTR "protein_coding"; "CD5";
经过和IGV图对照,发现从上到下,分别对应着基因的5-3方向。
可以忽略其中的CDS位置。
$ grep -v '^#' GRCh38.p13.gtf | grep -P '\"CD5\"' | awk '{print $1"\t"$4"\t"$5"\t"$7"\t"$3"\t"$14"\t"$16}' | grep -v "CDS"
chr11 61102489 61127852 + gene "CD5"; 2;
chr11 61102489 61127852 + transcript "protein_coding"; "CD5";
chr11 61102489 61102615 + exon "protein_coding"; "CD5";
chr11 61102561 61102563 + start_codon "protein_coding"; "CD5";
chr11 61115056 61115094 + exon "protein_coding"; "CD5";
chr11 61118175 61118480 + exon "protein_coding"; "CD5";
chr11 61118915 61118977 + exon "protein_coding"; "CD5";
chr11 61119234 61119575 + exon "protein_coding"; "CD5";
chr11 61121611 61121904 + exon "protein_coding"; "CD5";
chr11 61122907 61123032 + exon "protein_coding"; "CD5";
chr11 61123884 61123937 + exon "protein_coding"; "CD5";
chr11 61125032 61125151 + exon "protein_coding"; "CD5";
chr11 61125751 61125841 + exon "protein_coding"; "CD5";
chr11 61125837 61125839 + stop_codon "protein_coding"; "CD5";
chr11 61126288 61127852 + exon "protein_coding"; "CD5";
chr11 61102489 61102560 + UTR "protein_coding"; "CD5";
chr11 61125837 61125841 + UTR "protein_coding"; "CD5";
chr11 61126288 61127852 + UTR "protein_coding"; "CD5";
对比IGV图,可知:
- 层级结构是: 1) gene, 2)转录本,3)外显子
-
- UTR,其实第一个是5UTR,后两个是3UTR,结果都排在了exon后面。
- stop_coden在倒数第二个外显子上,所以3UTR分成了两段:exon倒数第二最后一段,和最后一个exon。
(4) 找一个蛋白编码基因:-链上的 CD7
直接过滤掉CDS行:
$ grep -v '^#' GRCh38.p13.gtf | grep -P '\"CD7\"' | awk '{print $1"\t"$4"\t"$5"\t"$7"\t"$3"\t"$14"\t"$16}' | grep -v "CDS"
chr17 82314868 82317608 - gene "CD7"; 1;
chr17 82314868 82317577 - transcript "protein_coding"; "CD7";
chr17 82317414 82317577 - exon "protein_coding"; "CD7";
chr17 82317493 82317495 - start_codon "protein_coding"; "CD7";
chr17 82316667 82316981 - exon "protein_coding"; "CD7";
chr17 82314868 82316409 - exon "protein_coding"; "CD7";
chr17 82316141 82316143 - stop_codon "protein_coding"; "CD7";
chr17 82317496 82317577 - UTR "protein_coding"; "CD7";
chr17 82314868 82316143 - UTR "protein_coding"; "CD7";
chr17 82314873 82317608 - transcript "protein_coding"; "CD7";
chr17 82317414 82317608 - exon "protein_coding"; "CD7";
chr17 82317493 82317495 - start_codon "protein_coding"; "CD7";
chr17 82316667 82316981 - exon "protein_coding"; "CD7";
chr17 82316195 82316409 - exon "protein_coding"; "CD7";
chr17 82314873 82315431 - exon "protein_coding"; "CD7";
chr17 82315321 82315323 - stop_codon "protein_coding"; "CD7";
chr17 82317496 82317608 - UTR "protein_coding"; "CD7";
chr17 82314873 82315323 - UTR "protein_coding"; "CD7";
chr17 82314874 82317552 - transcript "protein_coding"; "CD7";
chr17 82316667 82317552 - exon "protein_coding"; "CD7";
chr17 82316761 82316763 - start_codon "protein_coding"; "CD7";
chr17 82316195 82316409 - exon "protein_coding"; "CD7";
chr17 82314874 82315431 - exon "protein_coding"; "CD7";
chr17 82315321 82315323 - stop_codon "protein_coding"; "CD7";
chr17 82316764 82317552 - UTR "protein_coding"; "CD7";
chr17 82314874 82315323 - UTR "protein_coding"; "CD7";
chr17 82315975 82317558 - transcript "protein_coding"; "CD7";
chr17 82317414 82317558 - exon "protein_coding"; "CD7";
chr17 82316667 82317011 - exon "protein_coding"; "CD7";
chr17 82316761 82316763 - start_codon "protein_coding"; "CD7";
chr17 82315975 82316409 - exon "protein_coding"; "CD7";
chr17 82316141 82316143 - stop_codon "protein_coding"; "CD7";
chr17 82317414 82317558 - UTR "protein_coding"; "CD7";
chr17 82316764 82317011 - UTR "protein_coding"; "CD7";
chr17 82315975 82316143 - UTR "protein_coding"; "CD7";
- 还是4个层级结构:1)gene;2)transcript 4个;3) exon; 4) UTR跟在最后
- 如果有start_coden和stop_coden,会在exon之间显示。
- 从上到下,对应着RNA中基因的5-3,也就是从IGV的右侧到左侧(-链上的基因)
- 第一个UTR是5UTR,第二个是3UTR.
3. 获取基因范围
$ grep -v '^#' GRCh38.p13.gtf | grep "protein_coding" | awk '{if($3=="gene")print $1"\t"$4"\t"$5"\t"$7"\t"$14}' |sed -e 's/;//' -e 's/\"//g' | head
chr1 65419 71585 + OR4F5
chr1 450740 451678 - OR4F29
chr1 685716 686654 - OR4F16
chr1 923923 944575 + SAMD11
chr1 944203 959309 - NOC2L
chr1 960584 965719 + KLHL17
chr1 966482 975865 + PLEKHN1
chr1 975198 982117 - PERM1
chr1 998962 1000172 - HES4
chr1 1001138 1014540 + ISG15
4. 获取3UTR区域的方法
- shell: gtf 文件,保留每个转录本的stop和UTR,strand
- Python: 对于+,UTR>=stop的保留;对于-,UTR<=stop的保留
(1) shell 获取 bed 文件
$ grep -v '^#' GRCh38.p13.gtf | grep -P '\"CD7\"' | awk '{print $1"\t"$4"\t"$5"\t"$7"\t"$3"\t"$14"\t"$16}' | awk '{if($5=="UTR" || $5=="stop_codon") print $0}' | head
chr17 82316141 82316143 - stop_codon "protein_coding"; "CD7";
chr17 82317496 82317577 - UTR "protein_coding"; "CD7";
chr17 82314868 82316143 - UTR "protein_coding"; "CD7";
chr17 82315321 82315323 - stop_codon "protein_coding"; "CD7";
chr17 82317496 82317608 - UTR "protein_coding"; "CD7";
chr17 82314873 82315323 - UTR "protein_coding"; "CD7";
chr17 82315321 82315323 - stop_codon "protein_coding"; "CD7";
chr17 82316764 82317552 - UTR "protein_coding"; "CD7";
chr17 82314874 82315323 - UTR "protein_coding"; "CD7";
chr17 82316141 82316143 - stop_codon "protein_coding"; "CD7";
格式化:去掉"和;
$ grep -v '^#' GRCh38.p13.gtf | grep "protein_coding" | awk '{print $1"\t"$4"\t"$5"\t"$7"\t"$3"\t"$14"\t"$16}' | awk '{if($5=="UTR" || $5=="stop_codon") print $0}' | sed -e 's/;//g' -e 's/\"//g' | head
chr1 70006 70008 + stop_codon protein_coding OR4F5
chr1 65419 65433 + UTR protein_coding OR4F5
chr1 65520 65564 + UTR protein_coding OR4F5
chr1 70006 71585 + UTR protein_coding OR4F5
chr1 450740 450742 - stop_codon protein_coding OR4F29
chr1 450740 450742 - UTR protein_coding OR4F29
chr1 685716 685718 - stop_codon protein_coding OR4F16
chr1 685716 685718 - UTR protein_coding OR4F16
chr1 944151 944153 + stop_codon protein_coding SAMD11
chr1 923923 924431 + UTR protein_coding SAMD11
(2) 步骤2:Python 获取UTR区域
一个转录本只有一个stop,接着怎么判断紧邻的UTR是否是UTR3?
伪代码如下:
如果是 stop,则记录 stop_codon=当前行
如果不是stop(就只能是 UTR了),则要求基因名和上一个stop行一致:stop_codon[6] == arr[6]:
if +:
if UTR[1]>=stop_codon[1]
save UTR3
else:
save UTR5
if -:
UTR[2]<=stop_codon[2]
save UTR3
else:
save UTR5
如果基因名和上一个stop不一致,则该行只能是UTR5.
注意 Note:
- start codon 3个碱基属于cds区域,不属于UTR5区;
- 而 stop codon 3个碱基属于UTR3区。