Last active
October 20, 2023 13:12
-
-
Save Chubek/db779e5d92aadccbca1a8a25b4a55e20 to your computer and use it in GitHub Desktop.
Revisions
-
Chubek revised this gist
Oct 20, 2023 . 1 changed file with 10 additions and 2 deletions.There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,4 +1,12 @@ The file `triplet.awk` is a nucleotide triplet counter written in AWK. It's very simple, and has been generated by Python. For the uninitiated, AWK is a language designed by Keringhan and Aho at Bell Labs. Today there are many AWK interpreters, and I am making my own AWK interpreter, you can view the project [here](https://github.com/Chubek/Squawk). First: ``` wget https://gist.githubusercontent.com/Chubek/db779e5d92aadccbca1a8a25b4a55e20/raw/f3256185f4a43920b031af29dc9e45cdc1da9871/triplet.awk ``` ... to download the script file. `triplet.awk` does not care if your file is FASTA, FASTAQ, or whatever else. It just sees triplets, counts them, and displays them at the end. Imagine we have this FASTA file: @@ -106,6 +114,6 @@ You could also: awk -f triplet.awk seq.fa [..more files can be added here] ``` This script has been written as a gift for my friend Kevin whom I owe a lot to. But I will post it on r/bioinformatics as well. The script is under Unlicense license, you can do whatever you want with it! In the future I will make more biotech stuff. I already have 4 projects brewing, you can view them in the list of my repositories. Thanks, Chubak. -
Chubek created this gist
Oct 20, 2023 .There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,111 @@ The file `triplet.awk` is a nucleotide triplet counter written in AWK. For the uninitiated, AWK is a language designed by Keringhan and Aho at Bell Labs. Today there are many AWK interpreters, and I am making my own interpreter, you can view the project [here](https://github.com/Chubek/Squawk). `triplet.awk` does not care if your file is FASTA, FASTAQ, or whatever else. It just sees triplets, counts them, and displays them at the end. Imagine we have this FASTA file: ``` >ref|NC_005213.1|:c879-490883 ATGCGATTGCTATTAGAACTTAAAGCCCTAAATAGCATAGATAAAAAACAATTATCTAACTATCTAATAC AAGGTTTCATTTATAATATATTAAAAAATACCGAATACTCTTGGTTACATAATTGGAAGAAAGAGAAATA TTTTAATTTTACCTTAATCCCAAAAAAAGACATAATAGAGAATAAGAGGTATTATTTAATCATATCTTCG CCCGATAAAAGGTTTATAGAGGTTTTGCATAATAAAATAAAAGATTTAGATATAATAACTATTGGTTTGG CTCAATTCCAATTAAGGAAAACAAAAAAATTCGATCCAAAATTAAGATTTCCTTGGGTTACTATAACTCC TATAGTATTAAGGGAAGGCAAAATAGTAATATTAAAAGGGGACAAATACTATAAGGTATTTGTTAAGCGA TTGGAAGAATTGAAAAAGTATAATCTAATAAAAAAGAAAGAGCCCATTTTAGAAGAACCCATAGAGATTA GTTTAAACCAAATCAAAGATGGATGGAAAATTATAGATGTAAAAGATAGGTATTACGATTTTAGAAATAA GAGTTTTAGCGCTTTTTCTAATTGGTTGCGAGACCTAAAAGAGCAAAGCTTAAGAAAATATAATAATTTC TGTGGGAAGAACTTTTATTTTGAAGAAGCAATATTCGAAGGTTTTACTTTCTATAAAACAGTATCAATAA GAATAAGAATTAACAGAGGGGAAGCAGTATATATAGGCACATTATGGAAAGAGTTAAATGTTTATAGAAA ATTAGACAAAGAAGAAAGAGAATTTTACAAATTTTTGTACGATTGCGGTTTGGGTTCATTAAATTCTATG GGTTTTGGGTTTGTTAATACAAAAAAGAACTCTGCGAGATAA >ref|NC_005213.1|:883-2691 ATGAAAAAGCCCCAACCCTATAAAGATGAAGAGATATATTCTATTTTAGAAGAGCCCGTAAAACAATGGT TTAAAGAGAAATACAAAACATTCACTCCCCCACAAAGGTATGCAATAATGGAAATACATAAAAGGAACAA TGTTTTAATTTCTTCCCCCACAGGTTCGGGAAAAACGTTAGCAGCGTTTTTAGCTATAATAAATGAATTA ATAAAGTTATCTCATAAAGGAAAATTAGAAAATAGAGTTTATGCCATTTATGTTTCTCCATTAAGAAGTT TAAATAACGATGTAAAGAAAAACTTAGAAACTCCATTAAAAGAAATAAAAGAAAAAGCGAAAGAGCTTAA ``` We can save this in a file called `seq.fa` and then: ``` cat seq.fa | awk -f triplet.awk ``` We will see: ``` AAA = 18 AAC = 12 AAT = 18 AAG = 17 ACA = 12 ACC = 6 ACT = 9 ACG = 4 ATA = 17 ATC = 7 ATT = 17 ATG = 9 AGA = 16 AGC = 9 AGT = 6 AGG = 11 CAA = 11 CAC = 3 CAT = 10 CAG = 3 CCA = 9 CCC = 7 CCT = 5 CCG = 3 CTA = 10 CTC = 6 CTT = 8 CTG = 1 CGA = 11 CGC = 1 CGT = 2 CGG = 2 TAA = 18 TAC = 10 TAT = 16 TAG = 13 TCA = 8 TCC = 6 TCT = 11 TCG = 4 TTA = 18 TTC = 10 TTT = 15 TTG = 10 TGA = 4 TGC = 7 TGT = 9 TGG = 12 GAA = 17 GAC = 4 GAT = 9 GAG = 12 GCA = 8 GCC = 4 GCT = 4 GCG = 7 GTA = 10 GTC = 0 GTT = 12 GTG = 1 GGA = 10 GGC = 2 GGT = 13 GGG = 7 ``` You could also: ``` awk -f triplet.awk seq.fa [..more files can be added here] ``` This script has been written as a gift for my friend Kevin whom I owe a lot to. But I will post it on r/bioinformatics as well. The script is under Unlicense license, you can do whatever you want with it! Thanks, Chubak. This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,128 @@ /AAA/{ aaa++; } /AAC/{ aac++; } /AAT/{ aat++; } /AAG/{ aag++; } /ACA/{ aca++; } /ACC/{ acc++; } /ACT/{ act++; } /ACG/{ acg++; } /ATA/{ ata++; } /ATC/{ atc++; } /ATT/{ att++; } /ATG/{ atg++; } /AGA/{ aga++; } /AGC/{ agc++; } /AGT/{ agt++; } /AGG/{ agg++; } /CAA/{ caa++; } /CAC/{ cac++; } /CAT/{ cat++; } /CAG/{ cag++; } /CCA/{ cca++; } /CCC/{ ccc++; } /CCT/{ cct++; } /CCG/{ ccg++; } /CTA/{ cta++; } /CTC/{ ctc++; } /CTT/{ ctt++; } /CTG/{ ctg++; } /CGA/{ cga++; } /CGC/{ cgc++; } /CGT/{ cgt++; } /CGG/{ cgg++; } /TAA/{ taa++; } /TAC/{ tac++; } /TAT/{ tat++; } /TAG/{ tag++; } /TCA/{ tca++; } /TCC/{ tcc++; } /TCT/{ tct++; } /TCG/{ tcg++; } /TTA/{ tta++; } /TTC/{ ttc++; } /TTT/{ ttt++; } /TTG/{ ttg++; } /TGA/{ tga++; } /TGC/{ tgc++; } /TGT/{ tgt++; } /TGG/{ tgg++; } /GAA/{ gaa++; } /GAC/{ gac++; } /GAT/{ gat++; } /GAG/{ gag++; } /GCA/{ gca++; } /GCC/{ gcc++; } /GCT/{ gct++; } /GCG/{ gcg++; } /GTA/{ gta++; } /GTC/{ gtc++; } /GTT/{ gtt++; } /GTG/{ gtg++; } /GGA/{ gga++; } /GGC/{ ggc++; } /GGT/{ ggt++; } /GGG/{ ggg++; } END { printf "AAA = %d\n",aaa; } END { printf "AAC = %d\n",aac; } END { printf "AAT = %d\n",aat; } END { printf "AAG = %d\n",aag; } END { printf "ACA = %d\n",aca; } END { printf "ACC = %d\n",acc; } END { printf "ACT = %d\n",act; } END { printf "ACG = %d\n",acg; } END { printf "ATA = %d\n",ata; } END { printf "ATC = %d\n",atc; } END { printf "ATT = %d\n",att; } END { printf "ATG = %d\n",atg; } END { printf "AGA = %d\n",aga; } END { printf "AGC = %d\n",agc; } END { printf "AGT = %d\n",agt; } END { printf "AGG = %d\n",agg; } END { printf "CAA = %d\n",caa; } END { printf "CAC = %d\n",cac; } END { printf "CAT = %d\n",cat; } END { printf "CAG = %d\n",cag; } END { printf "CCA = %d\n",cca; } END { printf "CCC = %d\n",ccc; } END { printf "CCT = %d\n",cct; } END { printf "CCG = %d\n",ccg; } END { printf "CTA = %d\n",cta; } END { printf "CTC = %d\n",ctc; } END { printf "CTT = %d\n",ctt; } END { printf "CTG = %d\n",ctg; } END { printf "CGA = %d\n",cga; } END { printf "CGC = %d\n",cgc; } END { printf "CGT = %d\n",cgt; } END { printf "CGG = %d\n",cgg; } END { printf "TAA = %d\n",taa; } END { printf "TAC = %d\n",tac; } END { printf "TAT = %d\n",tat; } END { printf "TAG = %d\n",tag; } END { printf "TCA = %d\n",tca; } END { printf "TCC = %d\n",tcc; } END { printf "TCT = %d\n",tct; } END { printf "TCG = %d\n",tcg; } END { printf "TTA = %d\n",tta; } END { printf "TTC = %d\n",ttc; } END { printf "TTT = %d\n",ttt; } END { printf "TTG = %d\n",ttg; } END { printf "TGA = %d\n",tga; } END { printf "TGC = %d\n",tgc; } END { printf "TGT = %d\n",tgt; } END { printf "TGG = %d\n",tgg; } END { printf "GAA = %d\n",gaa; } END { printf "GAC = %d\n",gac; } END { printf "GAT = %d\n",gat; } END { printf "GAG = %d\n",gag; } END { printf "GCA = %d\n",gca; } END { printf "GCC = %d\n",gcc; } END { printf "GCT = %d\n",gct; } END { printf "GCG = %d\n",gcg; } END { printf "GTA = %d\n",gta; } END { printf "GTC = %d\n",gtc; } END { printf "GTT = %d\n",gtt; } END { printf "GTG = %d\n",gtg; } END { printf "GGA = %d\n",gga; } END { printf "GGC = %d\n",ggc; } END { printf "GGT = %d\n",ggt; } END { printf "GGG = %d\n",ggg; }