<![CDATA[BOL: Python script to extract a protein sequence from a genome using a General Feature Format (GFF) file !]]>

<![CDATA[BOL: Python script to extract a protein sequence from a genome using a General Feature Format (GFF) file !]]> https://bioinformaticsonline.com/snippets/view/44448/python-script-to-extract-a-protein-sequence-from-a-genome-using-a-general-feature-format-gff-file? https://bioinformaticsonline.com/snippets/view/44448/python-script-to-extract-a-protein-sequence-from-a-genome-using-a-general-feature-format-gff-file Thu, 01 Feb 2024 01:48:49 -0600 https://bioinformaticsonline.com/snippets/view/44448/python-script-to-extract-a-protein-sequence-from-a-genome-using-a-general-feature-format-gff-file <![CDATA[Python script to extract a protein sequence from a genome using a General Feature Format (GFF) file !]]> #You typically need the corresponding genome sequence file in FASTA format. The GFF file contains information about the #features (such as genes) in the genome, including their locations and annotations. #The outline of the steps : #Parse the GFF file to extract information about the gene locations. #Use the gene locations to extract the corresponding DNA sequences from the genome in FASTA format. #Translate the DNA sequences into protein sequences. #Simple example using Python and Biopython from Bio import SeqIO from Bio.SeqRecord import SeqRecord from Bio.Seq import Seq def extract_protein_sequence(gff_file, genome_fasta, gene_id): # Step 1: Parse the GFF file gene_locations = {} with open(gff_file, 'r') as gff: for line in gff: if not line.startswith('#'): fields = line.strip().split('\t') if fields[2] == 'gene': gene_id = fields[8].split(';')[0].split('=')[1] gene_locations[gene_id] = (int(fields[3]), int(fields[4])) # Step 2: Extract DNA sequence from the genome genome_record = SeqIO.read(genome_fasta, 'fasta') gene_start, gene_end = gene_locations[gene_id] gene_dna_sequence = genome_record.seq[gene_start - 1:gene_end] # Step 3: Translate DNA sequence into protein sequence gene_protein_sequence = gene_dna_sequence.translate() return gene_protein_sequence # Example usage gff_file = 'path/to/your/file.gff' genome_fasta = 'path/to/your/genome.fasta' gene_id_to_extract = 'your_gene_id' protein_sequence = extract_protein_sequence(gff_file, genome_fasta, gene_id_to_extract) print(protein_sequence)]]> LEGE