Brainstorming
I have a fasta file with the following sequence header name. See in my previous post.
>NC_030986.1 Fusarium oxysporum f. sp. lycopersici 4287 chromosome 1, whole genome shotgun sequence
>NC_030987.1 Fusarium oxysporum f. sp. lycopersici 4287 chromosome 2, whole genome shotgun sequence
>NC_030988.1 Fusarium oxysporum f. sp. lycopersici 4287 chromosome 3, whole genome shotgun sequence
>NC_030989.1 Fusarium oxysporum f. sp. lycopersici 4287 chromosome 4, whole genome shotgun sequence
>NC_030990.1 Fusarium oxysporum f. sp. lycopersici 4287 chromosome 5, whole genome shotgun sequence
>NC_030991.1 Fusarium oxysporum f. sp. lycopersici 4287 chromosome 6, whole genome shotgun sequence
>NC_030992.1 Fusarium oxysporum f. sp. lycopersici 4287 chromosome 7, whole genome shotgun sequence
>NC_030993.1 Fusarium oxysporum f. sp. lycopersici 4287 chromosome 8, whole genome shotgun sequence
>NC_030994.1 Fusarium oxysporum f. sp. lycopersici 4287 chromosome 9, whole genome shotgun sequence
>NC_030995.1 Fusarium oxysporum f. sp. lycopersici 4287 chromosome 10, whole genome shotgun sequence
>NC_030996.1 Fusarium oxysporum f. sp. lycopersici 4287 chromosome 11, whole genome shotgun sequence
>NC_030997.1 Fusarium oxysporum f. sp. lycopersici 4287 chromosome 12, whole genome shotgun sequence
>NC_030998.1 Fusarium oxysporum f. sp. lycopersici 4287 chromosome 13, whole genome shotgun sequence
>NC_030999.1 Fusarium oxysporum f. sp. lycopersici 4287 chromosome 14, whole genome shotgun sequence
>NC_031000.1 Fusarium oxysporum f. sp. lycopersici 4287 chromosome 15, whole genome shotgun sequence
I want to clean the sequence header by removing the remainings characters after NCxxxx.1
Solution
cat Fusarium_oxysporum_fsp_lycopersici_15_chromosomes.fasta | awk -F' ' '{print $1}' > Fusarium_oxysporum_fsp_lycopersici_15_chromosomes_header_cleaned.fasta
Let’s check
$ grep "^>" Fusarium_oxysporum_fsp_lycopersici_15_chromosomes_header_cleaned.fasta
The result is:
>NC_030986.1
>NC_030987.1
>NC_030988.1
>NC_030989.1
>NC_030990.1
>NC_030991.1
>NC_030992.1
>NC_030993.1
>NC_030994.1
>NC_030995.1
>NC_030996.1
>NC_030997.1
>NC_030998.1
>NC_030999.1
>NC_031000.1
Great!
In summary the syntax is
$ cat your.file | awk -F'the_character_of_your_interest' '{print $1}' > your_output
This syntax will return the text by excluding all others characters after the specified one.