BOL: Python script to extract a protein sequence from a genome using a General Feature Format (GFF) file !

BioScripts
LEGE
Python script to extract a protein sequence from a genome using a General Feature Format (GFF) file...

Python script to extract a protein sequence from a genome using a General Feature Format (GFF) file !

By LEGE 435 days ago

#You typically need the corresponding genome sequence file in FASTA format. The GFF file contains information about the #features (such as genes) in the genome, including their locations and annotations. #The outline of the steps : #Parse the GFF file to extract information about the gene locations. #Use the gene locations to extract the corresponding DNA sequences from the genome in FASTA format. #Translate the DNA sequences into protein sequences. #Simple example using Python and Biopython from Bio import SeqIO from Bio.SeqRecord import SeqRecord from Bio.Seq import Seq def extract_protein_sequence(gff_file, genome_fasta, gene_id): # Step 1: Parse the GFF file gene_locations = {} with open(gff_file, 'r') as gff: for line in gff: if not line.startswith('#'): fields = line.strip().split('\t') if fields[2] == 'gene': gene_id = fields[8].split(';')[0].split('=')[1] gene_locations[gene_id] = (int(fields[3]), int(fields[4])) # Step 2: Extract DNA sequence from the genome genome_record = SeqIO.read(genome_fasta, 'fasta') gene_start, gene_end = gene_locations[gene_id] gene_dna_sequence = genome_record.seq[gene_start - 1:gene_end] # Step 3: Translate DNA sequence into protein sequence gene_protein_sequence = gene_dna_sequence.translate() return gene_protein_sequence # Example usage gff_file = 'path/to/your/file.gff' genome_fasta = 'path/to/your/genome.fasta' gene_id_to_extract = 'your_gene_id' protein_sequence = extract_protein_sequence(gff_file, genome_fasta, gene_id_to_extract) print(protein_sequence)

BOL

LEGE

Our Sponsors

Python script to extract a protein sequence from a genome using a General Feature Format (GFF) file !