Skip to content

Instantly share code, notes, and snippets.

@Chubek
Last active October 20, 2023 13:12
Show Gist options
  • Save Chubek/db779e5d92aadccbca1a8a25b4a55e20 to your computer and use it in GitHub Desktop.
Save Chubek/db779e5d92aadccbca1a8a25b4a55e20 to your computer and use it in GitHub Desktop.

Revisions

  1. Chubek revised this gist Oct 20, 2023. 1 changed file with 10 additions and 2 deletions.
    12 changes: 10 additions & 2 deletions README.md
    Original file line number Diff line number Diff line change
    @@ -1,4 +1,12 @@
    The file `triplet.awk` is a nucleotide triplet counter written in AWK. For the uninitiated, AWK is a language designed by Keringhan and Aho at Bell Labs. Today there are many AWK interpreters, and I am making my own interpreter, you can view the project [here](https://github.com/Chubek/Squawk).
    The file `triplet.awk` is a nucleotide triplet counter written in AWK. It's very simple, and has been generated by Python. For the uninitiated, AWK is a language designed by Keringhan and Aho at Bell Labs. Today there are many AWK interpreters, and I am making my own AWK interpreter, you can view the project [here](https://github.com/Chubek/Squawk).

    First:

    ```
    wget https://gist.githubusercontent.com/Chubek/db779e5d92aadccbca1a8a25b4a55e20/raw/f3256185f4a43920b031af29dc9e45cdc1da9871/triplet.awk
    ```

    ... to download the script file.

    `triplet.awk` does not care if your file is FASTA, FASTAQ, or whatever else. It just sees triplets, counts them, and displays them at the end. Imagine we have this FASTA file:

    @@ -106,6 +114,6 @@ You could also:
    awk -f triplet.awk seq.fa [..more files can be added here]
    ```

    This script has been written as a gift for my friend Kevin whom I owe a lot to. But I will post it on r/bioinformatics as well. The script is under Unlicense license, you can do whatever you want with it!
    This script has been written as a gift for my friend Kevin whom I owe a lot to. But I will post it on r/bioinformatics as well. The script is under Unlicense license, you can do whatever you want with it! In the future I will make more biotech stuff. I already have 4 projects brewing, you can view them in the list of my repositories.

    Thanks, Chubak.
  2. Chubek created this gist Oct 20, 2023.
    111 changes: 111 additions & 0 deletions README.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,111 @@
    The file `triplet.awk` is a nucleotide triplet counter written in AWK. For the uninitiated, AWK is a language designed by Keringhan and Aho at Bell Labs. Today there are many AWK interpreters, and I am making my own interpreter, you can view the project [here](https://github.com/Chubek/Squawk).

    `triplet.awk` does not care if your file is FASTA, FASTAQ, or whatever else. It just sees triplets, counts them, and displays them at the end. Imagine we have this FASTA file:

    ```
    >ref|NC_005213.1|:c879-490883
    ATGCGATTGCTATTAGAACTTAAAGCCCTAAATAGCATAGATAAAAAACAATTATCTAACTATCTAATAC
    AAGGTTTCATTTATAATATATTAAAAAATACCGAATACTCTTGGTTACATAATTGGAAGAAAGAGAAATA
    TTTTAATTTTACCTTAATCCCAAAAAAAGACATAATAGAGAATAAGAGGTATTATTTAATCATATCTTCG
    CCCGATAAAAGGTTTATAGAGGTTTTGCATAATAAAATAAAAGATTTAGATATAATAACTATTGGTTTGG
    CTCAATTCCAATTAAGGAAAACAAAAAAATTCGATCCAAAATTAAGATTTCCTTGGGTTACTATAACTCC
    TATAGTATTAAGGGAAGGCAAAATAGTAATATTAAAAGGGGACAAATACTATAAGGTATTTGTTAAGCGA
    TTGGAAGAATTGAAAAAGTATAATCTAATAAAAAAGAAAGAGCCCATTTTAGAAGAACCCATAGAGATTA
    GTTTAAACCAAATCAAAGATGGATGGAAAATTATAGATGTAAAAGATAGGTATTACGATTTTAGAAATAA
    GAGTTTTAGCGCTTTTTCTAATTGGTTGCGAGACCTAAAAGAGCAAAGCTTAAGAAAATATAATAATTTC
    TGTGGGAAGAACTTTTATTTTGAAGAAGCAATATTCGAAGGTTTTACTTTCTATAAAACAGTATCAATAA
    GAATAAGAATTAACAGAGGGGAAGCAGTATATATAGGCACATTATGGAAAGAGTTAAATGTTTATAGAAA
    ATTAGACAAAGAAGAAAGAGAATTTTACAAATTTTTGTACGATTGCGGTTTGGGTTCATTAAATTCTATG
    GGTTTTGGGTTTGTTAATACAAAAAAGAACTCTGCGAGATAA
    >ref|NC_005213.1|:883-2691
    ATGAAAAAGCCCCAACCCTATAAAGATGAAGAGATATATTCTATTTTAGAAGAGCCCGTAAAACAATGGT
    TTAAAGAGAAATACAAAACATTCACTCCCCCACAAAGGTATGCAATAATGGAAATACATAAAAGGAACAA
    TGTTTTAATTTCTTCCCCCACAGGTTCGGGAAAAACGTTAGCAGCGTTTTTAGCTATAATAAATGAATTA
    ATAAAGTTATCTCATAAAGGAAAATTAGAAAATAGAGTTTATGCCATTTATGTTTCTCCATTAAGAAGTT
    TAAATAACGATGTAAAGAAAAACTTAGAAACTCCATTAAAAGAAATAAAAGAAAAAGCGAAAGAGCTTAA
    ```

    We can save this in a file called `seq.fa` and then:

    ```
    cat seq.fa | awk -f triplet.awk
    ```

    We will see:

    ```
    AAA = 18
    AAC = 12
    AAT = 18
    AAG = 17
    ACA = 12
    ACC = 6
    ACT = 9
    ACG = 4
    ATA = 17
    ATC = 7
    ATT = 17
    ATG = 9
    AGA = 16
    AGC = 9
    AGT = 6
    AGG = 11
    CAA = 11
    CAC = 3
    CAT = 10
    CAG = 3
    CCA = 9
    CCC = 7
    CCT = 5
    CCG = 3
    CTA = 10
    CTC = 6
    CTT = 8
    CTG = 1
    CGA = 11
    CGC = 1
    CGT = 2
    CGG = 2
    TAA = 18
    TAC = 10
    TAT = 16
    TAG = 13
    TCA = 8
    TCC = 6
    TCT = 11
    TCG = 4
    TTA = 18
    TTC = 10
    TTT = 15
    TTG = 10
    TGA = 4
    TGC = 7
    TGT = 9
    TGG = 12
    GAA = 17
    GAC = 4
    GAT = 9
    GAG = 12
    GCA = 8
    GCC = 4
    GCT = 4
    GCG = 7
    GTA = 10
    GTC = 0
    GTT = 12
    GTG = 1
    GGA = 10
    GGC = 2
    GGT = 13
    GGG = 7
    ```

    You could also:

    ```
    awk -f triplet.awk seq.fa [..more files can be added here]
    ```

    This script has been written as a gift for my friend Kevin whom I owe a lot to. But I will post it on r/bioinformatics as well. The script is under Unlicense license, you can do whatever you want with it!

    Thanks, Chubak.
    128 changes: 128 additions & 0 deletions triplet.awk
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,128 @@
    /AAA/{ aaa++; }
    /AAC/{ aac++; }
    /AAT/{ aat++; }
    /AAG/{ aag++; }
    /ACA/{ aca++; }
    /ACC/{ acc++; }
    /ACT/{ act++; }
    /ACG/{ acg++; }
    /ATA/{ ata++; }
    /ATC/{ atc++; }
    /ATT/{ att++; }
    /ATG/{ atg++; }
    /AGA/{ aga++; }
    /AGC/{ agc++; }
    /AGT/{ agt++; }
    /AGG/{ agg++; }
    /CAA/{ caa++; }
    /CAC/{ cac++; }
    /CAT/{ cat++; }
    /CAG/{ cag++; }
    /CCA/{ cca++; }
    /CCC/{ ccc++; }
    /CCT/{ cct++; }
    /CCG/{ ccg++; }
    /CTA/{ cta++; }
    /CTC/{ ctc++; }
    /CTT/{ ctt++; }
    /CTG/{ ctg++; }
    /CGA/{ cga++; }
    /CGC/{ cgc++; }
    /CGT/{ cgt++; }
    /CGG/{ cgg++; }
    /TAA/{ taa++; }
    /TAC/{ tac++; }
    /TAT/{ tat++; }
    /TAG/{ tag++; }
    /TCA/{ tca++; }
    /TCC/{ tcc++; }
    /TCT/{ tct++; }
    /TCG/{ tcg++; }
    /TTA/{ tta++; }
    /TTC/{ ttc++; }
    /TTT/{ ttt++; }
    /TTG/{ ttg++; }
    /TGA/{ tga++; }
    /TGC/{ tgc++; }
    /TGT/{ tgt++; }
    /TGG/{ tgg++; }
    /GAA/{ gaa++; }
    /GAC/{ gac++; }
    /GAT/{ gat++; }
    /GAG/{ gag++; }
    /GCA/{ gca++; }
    /GCC/{ gcc++; }
    /GCT/{ gct++; }
    /GCG/{ gcg++; }
    /GTA/{ gta++; }
    /GTC/{ gtc++; }
    /GTT/{ gtt++; }
    /GTG/{ gtg++; }
    /GGA/{ gga++; }
    /GGC/{ ggc++; }
    /GGT/{ ggt++; }
    /GGG/{ ggg++; }
    END { printf "AAA = %d\n",aaa; }
    END { printf "AAC = %d\n",aac; }
    END { printf "AAT = %d\n",aat; }
    END { printf "AAG = %d\n",aag; }
    END { printf "ACA = %d\n",aca; }
    END { printf "ACC = %d\n",acc; }
    END { printf "ACT = %d\n",act; }
    END { printf "ACG = %d\n",acg; }
    END { printf "ATA = %d\n",ata; }
    END { printf "ATC = %d\n",atc; }
    END { printf "ATT = %d\n",att; }
    END { printf "ATG = %d\n",atg; }
    END { printf "AGA = %d\n",aga; }
    END { printf "AGC = %d\n",agc; }
    END { printf "AGT = %d\n",agt; }
    END { printf "AGG = %d\n",agg; }
    END { printf "CAA = %d\n",caa; }
    END { printf "CAC = %d\n",cac; }
    END { printf "CAT = %d\n",cat; }
    END { printf "CAG = %d\n",cag; }
    END { printf "CCA = %d\n",cca; }
    END { printf "CCC = %d\n",ccc; }
    END { printf "CCT = %d\n",cct; }
    END { printf "CCG = %d\n",ccg; }
    END { printf "CTA = %d\n",cta; }
    END { printf "CTC = %d\n",ctc; }
    END { printf "CTT = %d\n",ctt; }
    END { printf "CTG = %d\n",ctg; }
    END { printf "CGA = %d\n",cga; }
    END { printf "CGC = %d\n",cgc; }
    END { printf "CGT = %d\n",cgt; }
    END { printf "CGG = %d\n",cgg; }
    END { printf "TAA = %d\n",taa; }
    END { printf "TAC = %d\n",tac; }
    END { printf "TAT = %d\n",tat; }
    END { printf "TAG = %d\n",tag; }
    END { printf "TCA = %d\n",tca; }
    END { printf "TCC = %d\n",tcc; }
    END { printf "TCT = %d\n",tct; }
    END { printf "TCG = %d\n",tcg; }
    END { printf "TTA = %d\n",tta; }
    END { printf "TTC = %d\n",ttc; }
    END { printf "TTT = %d\n",ttt; }
    END { printf "TTG = %d\n",ttg; }
    END { printf "TGA = %d\n",tga; }
    END { printf "TGC = %d\n",tgc; }
    END { printf "TGT = %d\n",tgt; }
    END { printf "TGG = %d\n",tgg; }
    END { printf "GAA = %d\n",gaa; }
    END { printf "GAC = %d\n",gac; }
    END { printf "GAT = %d\n",gat; }
    END { printf "GAG = %d\n",gag; }
    END { printf "GCA = %d\n",gca; }
    END { printf "GCC = %d\n",gcc; }
    END { printf "GCT = %d\n",gct; }
    END { printf "GCG = %d\n",gcg; }
    END { printf "GTA = %d\n",gta; }
    END { printf "GTC = %d\n",gtc; }
    END { printf "GTT = %d\n",gtt; }
    END { printf "GTG = %d\n",gtg; }
    END { printf "GGA = %d\n",gga; }
    END { printf "GGC = %d\n",ggc; }
    END { printf "GGT = %d\n",ggt; }
    END { printf "GGG = %d\n",ggg; }