awk1.knit

2021-08-29

Brainstorming

I have a fasta file with the following sequence header name. See in my previous post.


>NC_030986.1 Fusarium oxysporum f. sp. lycopersici 4287 chromosome 1, whole genome shotgun sequence
>NC_030987.1 Fusarium oxysporum f. sp. lycopersici 4287 chromosome 2, whole genome shotgun sequence
>NC_030988.1 Fusarium oxysporum f. sp. lycopersici 4287 chromosome 3, whole genome shotgun sequence
>NC_030989.1 Fusarium oxysporum f. sp. lycopersici 4287 chromosome 4, whole genome shotgun sequence
>NC_030990.1 Fusarium oxysporum f. sp. lycopersici 4287 chromosome 5, whole genome shotgun sequence
>NC_030991.1 Fusarium oxysporum f. sp. lycopersici 4287 chromosome 6, whole genome shotgun sequence
>NC_030992.1 Fusarium oxysporum f. sp. lycopersici 4287 chromosome 7, whole genome shotgun sequence
>NC_030993.1 Fusarium oxysporum f. sp. lycopersici 4287 chromosome 8, whole genome shotgun sequence
>NC_030994.1 Fusarium oxysporum f. sp. lycopersici 4287 chromosome 9, whole genome shotgun sequence
>NC_030995.1 Fusarium oxysporum f. sp. lycopersici 4287 chromosome 10, whole genome shotgun sequence
>NC_030996.1 Fusarium oxysporum f. sp. lycopersici 4287 chromosome 11, whole genome shotgun sequence
>NC_030997.1 Fusarium oxysporum f. sp. lycopersici 4287 chromosome 12, whole genome shotgun sequence
>NC_030998.1 Fusarium oxysporum f. sp. lycopersici 4287 chromosome 13, whole genome shotgun sequence
>NC_030999.1 Fusarium oxysporum f. sp. lycopersici 4287 chromosome 14, whole genome shotgun sequence
>NC_031000.1 Fusarium oxysporum f. sp. lycopersici 4287 chromosome 15, whole genome shotgun sequence

I want to clean the sequence header by removing the remainings characters after NCxxxx.1

Solution


cat Fusarium_oxysporum_fsp_lycopersici_15_chromosomes.fasta | awk -F' ' '{print $1}' > Fusarium_oxysporum_fsp_lycopersici_15_chromosomes_header_cleaned.fasta

Let’s check


$ grep "^>" Fusarium_oxysporum_fsp_lycopersici_15_chromosomes_header_cleaned.fasta

The result is:


>NC_030986.1
>NC_030987.1
>NC_030988.1
>NC_030989.1
>NC_030990.1
>NC_030991.1
>NC_030992.1
>NC_030993.1
>NC_030994.1
>NC_030995.1
>NC_030996.1
>NC_030997.1
>NC_030998.1
>NC_030999.1
>NC_031000.1

Great!

In summary the syntax is


$ cat your.file | awk -F'the_character_of_your_interest' '{print $1}' > your_output

This syntax will return the text by excluding all others characters after the specified one.