The file triplet.awk
is a nucleotide triplet counter written in AWK. It's very simple, and has been generated by Python. For the uninitiated, AWK is a language designed by Keringhan and Aho at Bell Labs. Today there are many AWK interpreters, and I am making my own AWK interpreter, you can view the project here.
First:
wget https://gist.githubusercontent.com/Chubek/db779e5d92aadccbca1a8a25b4a55e20/raw/f3256185f4a43920b031af29dc9e45cdc1da9871/triplet.awk
... to download the script file.
triplet.awk
does not care if your file is FASTA, FASTAQ, or whatever else. It just sees triplets, counts them, and displays them at the end. Imagine we have this FASTA file:
>ref|NC_005213.1|:c879-490883
ATGCGATTGCTATTAGAACTTAAAGCCCTAAATAGCATAGATAAAAAACAATTATCTAACTATCTAATAC
AAGGTTTCATTTATAATATATTAAAAAATACCGAATACTCTTGGTTACATAATTGGAAGAAAGAGAAATA
TTTTAATTTTACCTTAATCCCAAAAAAAGACATAATAGAGAATAAGAGGTATTATTTAATCATATCTTCG
CCCGATAAAAGGTTTATAGAGGTTTTGCATAATAAAATAAAAGATTTAGATATAATAACTATTGGTTTGG
CTCAATTCCAATTAAGGAAAACAAAAAAATTCGATCCAAAATTAAGATTTCCTTGGGTTACTATAACTCC
TATAGTATTAAGGGAAGGCAAAATAGTAATATTAAAAGGGGACAAATACTATAAGGTATTTGTTAAGCGA
TTGGAAGAATTGAAAAAGTATAATCTAATAAAAAAGAAAGAGCCCATTTTAGAAGAACCCATAGAGATTA
GTTTAAACCAAATCAAAGATGGATGGAAAATTATAGATGTAAAAGATAGGTATTACGATTTTAGAAATAA
GAGTTTTAGCGCTTTTTCTAATTGGTTGCGAGACCTAAAAGAGCAAAGCTTAAGAAAATATAATAATTTC
TGTGGGAAGAACTTTTATTTTGAAGAAGCAATATTCGAAGGTTTTACTTTCTATAAAACAGTATCAATAA
GAATAAGAATTAACAGAGGGGAAGCAGTATATATAGGCACATTATGGAAAGAGTTAAATGTTTATAGAAA
ATTAGACAAAGAAGAAAGAGAATTTTACAAATTTTTGTACGATTGCGGTTTGGGTTCATTAAATTCTATG
GGTTTTGGGTTTGTTAATACAAAAAAGAACTCTGCGAGATAA
>ref|NC_005213.1|:883-2691
ATGAAAAAGCCCCAACCCTATAAAGATGAAGAGATATATTCTATTTTAGAAGAGCCCGTAAAACAATGGT
TTAAAGAGAAATACAAAACATTCACTCCCCCACAAAGGTATGCAATAATGGAAATACATAAAAGGAACAA
TGTTTTAATTTCTTCCCCCACAGGTTCGGGAAAAACGTTAGCAGCGTTTTTAGCTATAATAAATGAATTA
ATAAAGTTATCTCATAAAGGAAAATTAGAAAATAGAGTTTATGCCATTTATGTTTCTCCATTAAGAAGTT
TAAATAACGATGTAAAGAAAAACTTAGAAACTCCATTAAAAGAAATAAAAGAAAAAGCGAAAGAGCTTAA
We can save this in a file called seq.fa
and then:
cat seq.fa | awk -f triplet.awk
We will see:
AAA = 18
AAC = 12
AAT = 18
AAG = 17
ACA = 12
ACC = 6
ACT = 9
ACG = 4
ATA = 17
ATC = 7
ATT = 17
ATG = 9
AGA = 16
AGC = 9
AGT = 6
AGG = 11
CAA = 11
CAC = 3
CAT = 10
CAG = 3
CCA = 9
CCC = 7
CCT = 5
CCG = 3
CTA = 10
CTC = 6
CTT = 8
CTG = 1
CGA = 11
CGC = 1
CGT = 2
CGG = 2
TAA = 18
TAC = 10
TAT = 16
TAG = 13
TCA = 8
TCC = 6
TCT = 11
TCG = 4
TTA = 18
TTC = 10
TTT = 15
TTG = 10
TGA = 4
TGC = 7
TGT = 9
TGG = 12
GAA = 17
GAC = 4
GAT = 9
GAG = 12
GCA = 8
GCC = 4
GCT = 4
GCG = 7
GTA = 10
GTC = 0
GTT = 12
GTG = 1
GGA = 10
GGC = 2
GGT = 13
GGG = 7
You could also:
awk -f triplet.awk seq.fa [..more files can be added here]
This script has been written as a gift for my friend Kevin whom I owe a lot to. But I will post it on r/bioinformatics as well. The script is under Unlicense license, you can do whatever you want with it! In the future I will make more biotech stuff. I already have 4 projects brewing, you can view them in the list of my repositories.
Thanks, Chubak.