Skip to content

Instantly share code, notes, and snippets.

@Chubek
Last active October 20, 2023 13:12
Show Gist options
  • Save Chubek/db779e5d92aadccbca1a8a25b4a55e20 to your computer and use it in GitHub Desktop.
Save Chubek/db779e5d92aadccbca1a8a25b4a55e20 to your computer and use it in GitHub Desktop.
Counting nucleotide triplets in AWK

The file triplet.awk is a nucleotide triplet counter written in AWK. It's very simple, and has been generated by Python. For the uninitiated, AWK is a language designed by Keringhan and Aho at Bell Labs. Today there are many AWK interpreters, and I am making my own AWK interpreter, you can view the project here.

First:

wget https://gist.githubusercontent.com/Chubek/db779e5d92aadccbca1a8a25b4a55e20/raw/f3256185f4a43920b031af29dc9e45cdc1da9871/triplet.awk

... to download the script file.

triplet.awk does not care if your file is FASTA, FASTAQ, or whatever else. It just sees triplets, counts them, and displays them at the end. Imagine we have this FASTA file:

>ref|NC_005213.1|:c879-490883
ATGCGATTGCTATTAGAACTTAAAGCCCTAAATAGCATAGATAAAAAACAATTATCTAACTATCTAATAC
AAGGTTTCATTTATAATATATTAAAAAATACCGAATACTCTTGGTTACATAATTGGAAGAAAGAGAAATA
TTTTAATTTTACCTTAATCCCAAAAAAAGACATAATAGAGAATAAGAGGTATTATTTAATCATATCTTCG
CCCGATAAAAGGTTTATAGAGGTTTTGCATAATAAAATAAAAGATTTAGATATAATAACTATTGGTTTGG
CTCAATTCCAATTAAGGAAAACAAAAAAATTCGATCCAAAATTAAGATTTCCTTGGGTTACTATAACTCC
TATAGTATTAAGGGAAGGCAAAATAGTAATATTAAAAGGGGACAAATACTATAAGGTATTTGTTAAGCGA
TTGGAAGAATTGAAAAAGTATAATCTAATAAAAAAGAAAGAGCCCATTTTAGAAGAACCCATAGAGATTA
GTTTAAACCAAATCAAAGATGGATGGAAAATTATAGATGTAAAAGATAGGTATTACGATTTTAGAAATAA
GAGTTTTAGCGCTTTTTCTAATTGGTTGCGAGACCTAAAAGAGCAAAGCTTAAGAAAATATAATAATTTC
TGTGGGAAGAACTTTTATTTTGAAGAAGCAATATTCGAAGGTTTTACTTTCTATAAAACAGTATCAATAA
GAATAAGAATTAACAGAGGGGAAGCAGTATATATAGGCACATTATGGAAAGAGTTAAATGTTTATAGAAA
ATTAGACAAAGAAGAAAGAGAATTTTACAAATTTTTGTACGATTGCGGTTTGGGTTCATTAAATTCTATG
GGTTTTGGGTTTGTTAATACAAAAAAGAACTCTGCGAGATAA
>ref|NC_005213.1|:883-2691
ATGAAAAAGCCCCAACCCTATAAAGATGAAGAGATATATTCTATTTTAGAAGAGCCCGTAAAACAATGGT
TTAAAGAGAAATACAAAACATTCACTCCCCCACAAAGGTATGCAATAATGGAAATACATAAAAGGAACAA
TGTTTTAATTTCTTCCCCCACAGGTTCGGGAAAAACGTTAGCAGCGTTTTTAGCTATAATAAATGAATTA
ATAAAGTTATCTCATAAAGGAAAATTAGAAAATAGAGTTTATGCCATTTATGTTTCTCCATTAAGAAGTT
TAAATAACGATGTAAAGAAAAACTTAGAAACTCCATTAAAAGAAATAAAAGAAAAAGCGAAAGAGCTTAA 

We can save this in a file called seq.fa and then:

cat seq.fa | awk -f triplet.awk

We will see:

AAA = 18
AAC = 12
AAT = 18
AAG = 17
ACA = 12
ACC = 6
ACT = 9
ACG = 4
ATA = 17
ATC = 7
ATT = 17
ATG = 9
AGA = 16
AGC = 9
AGT = 6
AGG = 11
CAA = 11
CAC = 3
CAT = 10
CAG = 3
CCA = 9
CCC = 7
CCT = 5
CCG = 3
CTA = 10
CTC = 6
CTT = 8
CTG = 1
CGA = 11
CGC = 1
CGT = 2
CGG = 2
TAA = 18
TAC = 10
TAT = 16
TAG = 13
TCA = 8
TCC = 6
TCT = 11
TCG = 4
TTA = 18
TTC = 10
TTT = 15
TTG = 10
TGA = 4
TGC = 7
TGT = 9
TGG = 12
GAA = 17
GAC = 4
GAT = 9
GAG = 12
GCA = 8
GCC = 4
GCT = 4
GCG = 7
GTA = 10
GTC = 0
GTT = 12
GTG = 1
GGA = 10
GGC = 2
GGT = 13
GGG = 7

You could also:

awk -f triplet.awk seq.fa [..more files can be added here]

This script has been written as a gift for my friend Kevin whom I owe a lot to. But I will post it on r/bioinformatics as well. The script is under Unlicense license, you can do whatever you want with it! In the future I will make more biotech stuff. I already have 4 projects brewing, you can view them in the list of my repositories.

Thanks, Chubak.

/AAA/{ aaa++; }
/AAC/{ aac++; }
/AAT/{ aat++; }
/AAG/{ aag++; }
/ACA/{ aca++; }
/ACC/{ acc++; }
/ACT/{ act++; }
/ACG/{ acg++; }
/ATA/{ ata++; }
/ATC/{ atc++; }
/ATT/{ att++; }
/ATG/{ atg++; }
/AGA/{ aga++; }
/AGC/{ agc++; }
/AGT/{ agt++; }
/AGG/{ agg++; }
/CAA/{ caa++; }
/CAC/{ cac++; }
/CAT/{ cat++; }
/CAG/{ cag++; }
/CCA/{ cca++; }
/CCC/{ ccc++; }
/CCT/{ cct++; }
/CCG/{ ccg++; }
/CTA/{ cta++; }
/CTC/{ ctc++; }
/CTT/{ ctt++; }
/CTG/{ ctg++; }
/CGA/{ cga++; }
/CGC/{ cgc++; }
/CGT/{ cgt++; }
/CGG/{ cgg++; }
/TAA/{ taa++; }
/TAC/{ tac++; }
/TAT/{ tat++; }
/TAG/{ tag++; }
/TCA/{ tca++; }
/TCC/{ tcc++; }
/TCT/{ tct++; }
/TCG/{ tcg++; }
/TTA/{ tta++; }
/TTC/{ ttc++; }
/TTT/{ ttt++; }
/TTG/{ ttg++; }
/TGA/{ tga++; }
/TGC/{ tgc++; }
/TGT/{ tgt++; }
/TGG/{ tgg++; }
/GAA/{ gaa++; }
/GAC/{ gac++; }
/GAT/{ gat++; }
/GAG/{ gag++; }
/GCA/{ gca++; }
/GCC/{ gcc++; }
/GCT/{ gct++; }
/GCG/{ gcg++; }
/GTA/{ gta++; }
/GTC/{ gtc++; }
/GTT/{ gtt++; }
/GTG/{ gtg++; }
/GGA/{ gga++; }
/GGC/{ ggc++; }
/GGT/{ ggt++; }
/GGG/{ ggg++; }
END { printf "AAA = %d\n",aaa; }
END { printf "AAC = %d\n",aac; }
END { printf "AAT = %d\n",aat; }
END { printf "AAG = %d\n",aag; }
END { printf "ACA = %d\n",aca; }
END { printf "ACC = %d\n",acc; }
END { printf "ACT = %d\n",act; }
END { printf "ACG = %d\n",acg; }
END { printf "ATA = %d\n",ata; }
END { printf "ATC = %d\n",atc; }
END { printf "ATT = %d\n",att; }
END { printf "ATG = %d\n",atg; }
END { printf "AGA = %d\n",aga; }
END { printf "AGC = %d\n",agc; }
END { printf "AGT = %d\n",agt; }
END { printf "AGG = %d\n",agg; }
END { printf "CAA = %d\n",caa; }
END { printf "CAC = %d\n",cac; }
END { printf "CAT = %d\n",cat; }
END { printf "CAG = %d\n",cag; }
END { printf "CCA = %d\n",cca; }
END { printf "CCC = %d\n",ccc; }
END { printf "CCT = %d\n",cct; }
END { printf "CCG = %d\n",ccg; }
END { printf "CTA = %d\n",cta; }
END { printf "CTC = %d\n",ctc; }
END { printf "CTT = %d\n",ctt; }
END { printf "CTG = %d\n",ctg; }
END { printf "CGA = %d\n",cga; }
END { printf "CGC = %d\n",cgc; }
END { printf "CGT = %d\n",cgt; }
END { printf "CGG = %d\n",cgg; }
END { printf "TAA = %d\n",taa; }
END { printf "TAC = %d\n",tac; }
END { printf "TAT = %d\n",tat; }
END { printf "TAG = %d\n",tag; }
END { printf "TCA = %d\n",tca; }
END { printf "TCC = %d\n",tcc; }
END { printf "TCT = %d\n",tct; }
END { printf "TCG = %d\n",tcg; }
END { printf "TTA = %d\n",tta; }
END { printf "TTC = %d\n",ttc; }
END { printf "TTT = %d\n",ttt; }
END { printf "TTG = %d\n",ttg; }
END { printf "TGA = %d\n",tga; }
END { printf "TGC = %d\n",tgc; }
END { printf "TGT = %d\n",tgt; }
END { printf "TGG = %d\n",tgg; }
END { printf "GAA = %d\n",gaa; }
END { printf "GAC = %d\n",gac; }
END { printf "GAT = %d\n",gat; }
END { printf "GAG = %d\n",gag; }
END { printf "GCA = %d\n",gca; }
END { printf "GCC = %d\n",gcc; }
END { printf "GCT = %d\n",gct; }
END { printf "GCG = %d\n",gcg; }
END { printf "GTA = %d\n",gta; }
END { printf "GTC = %d\n",gtc; }
END { printf "GTT = %d\n",gtt; }
END { printf "GTG = %d\n",gtg; }
END { printf "GGA = %d\n",gga; }
END { printf "GGC = %d\n",ggc; }
END { printf "GGT = %d\n",ggt; }
END { printf "GGG = %d\n",ggg; }
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment