Online Resources

These classes provide access to online resources. Usually, data will be accessed during the creation of extended models. Organism specific parameters need to be defined in the XBA configuration file, sheet ‘general’. Data will be accessed from the specified organism directory. Data will be downloaded from the online resource, only if data cannot be found locally. Data access to BioCyc online resources may require a subscription.

Data resources can be accessed as well from outside a model generation workflow.

UniProt

Access to protein related information available on UniProt knowledge base.

class f2xba.uniprot.uniprot_data.UniprotData(organism_id, organism_dir)[source]

Access to data of UniProt online resource.

Enzyme constraint and resource balance constraint models require access to protein information during model construction. Protein related information can be retrieved from the UniProt online database (uniprot.org) using the taxonomic id (parameter organism_id) as reference.

In case UniProt holds no data for the modelled organism, it is required to compile protein related data from other sources, with data fields and format as in a UniProt download, and store this file named uniprot_organism_<organism_id>.tsv under organism_dir.

Use configuration data in the XBA configuration file, sheet general, to configure organism_id, with the taxonomic id, and organism_dir, were downloaded data is stored locally. Delete the locally stored UniProt data to enforce a retrieval from the online database.

Example: Access UniProt data for E. coli K-12 MG1655 strain (taxonomic id: 83333).

from f2xba.uniprot.uniprot_data import UniprotData

uniprot_data = UniprotData(83333, 'data_ref')

gene = 'b0928'
uid = uniprot_data.locus2uid[gene]
uniprot_data.proteins[uid].__dict__
Parameters:
  • organism_id (int or str) – taxonomic identifier of modelled organism

  • organism_dir (str) – directory where UniProt data is stored

proteins

Protein related information extracted from UniProt.

locus2uid

Map gene locus to UniProt identifyer of related protein.

NCBI

Access to genome information available on NCBI data base.

class f2xba.ncbi.ncbi_data.NcbiData(chromosome2accids, organism_dir)[source]

Access to data from NCBI nucleotide online resource.

Resource balance constraint models require access to genome data during model construction. Genome data can be referenced by GeneBank or RefSeq accession identifiers. Select genome data sets that can be mapped to the gene identifiers used in the model under construction. As model genes may be located on different chromosomes, access to several chromosomes is supported.

Use configuration data in the XBA configuration file, sheet general, to set chromosome2accids, which maps arbitrary chromosome ids to accession ids, and organism_dir, were downloaded data is stored locally. Delete locally stored NCBI data to enforce a retrieval from the online database.

Example: Access chromosome data for E. coli K-12 MG1655 strain (accession id: U00096.3).

from f2xba.ncbi.ncbi_data import NcbiData

ncbi_data = NcbiData({'chromosome':'U00096.3'}, 'data_refs/ncbi')

gene = 'b0928'
ncbi_data.locus2record[gene].__dict__
Parameters:
  • chromosome2accids (dict(str, str)) – Map chromosome ids to accession ids

  • organism_dir (str) – directory where NCBI exports are stored

chromosomes

Chromosome related information.

locus2record

Map gene identifier to NCBI feature record.

locus2protein

Map gene identifier to NCBI protein sequence information.

label2locus

Map gene label to NCBI locus identifiers.

modify_attributes(df_modify_attrs)[source]

modify attribute values of NCBI feature records

e.g. update ‘locus’ or ‘old_locus’ attributes to improve mapping with model loci

Parameters:

df_modify_attrs (pandas.DataFrame) – table with ‘attribute’, ‘value’ columns and index set to gene locus

get_gc_content(chromosome_id=None)[source]

Retrieve GC content across all or one specific chromosome

Parameters:

chromosome_id (str) – (optional) specific chromosome id

Returns:

GC content

Return type:

float

get_mrna_avg_composition(chromosome_id=None)[source]

Retrieve average mRNA composition across all or a chromosome

Parameters:

chromosome_id (str) – (optional) specific chromosome id

Returns:

relative mRNA nucleotide composition

Return type:

dict(str,float)

BioCyc

Access to enzyme information available on BioCyc data base. A subscription may be required to access information from certain organism. Access to BioCyc organism specific information should only be configured in the XBA configuration file, if the data is of high quality for the organism in question.

class f2xba.biocyc.biocyc_data.BiocycData(org_prefix, biocyc_dir)[source]

Access to enzyme related data from BioCyc online database.

Access BioCyc data via BioVelo query, see https://biocyc.org/web-services.shtml

Enzyme constraint and resource balance constraint models require information on enzyme composition during model construction. By default, enzymes will be composed of one copy per gene product, derived for the reaction gene product reaction rule configuration of the SBML model. Alternatively, enzyme composition can be retrieved from BioCyc online resource or loaded from an enzyme composition configuration file.

Enzyme composition derived from BioCyc would be suitable for highly curated organism databases, like E. coli K-12. Access to BioCyc resources requires a paid BioCyc subscription, depending on organism.

Initially, enzyme composition data could be retrieved from BioCyc and exported to file using XbaModel.export_enz_composition(). The enzyme composition could be manually adjusted and used for subsequent model creations.

The organism in the BioCyc database is referenced by org_prefix. In order to use enzyme composition data from Biocyc, use configuration data in the XBA configuration file, sheet general. The parameter biocyc_org_prefix references the organism in BioCyc, the parameter organism_dir specifies the local download directory. Delete locally stored BioCyc files to enforce a retrieval from the online database. Enzyme composition can be loaded from file by configuring the parameter enzyme_comp_fname in the XBA parameter file, sheet general.

Example: Access enzyme configuration for E. coli K-12 MG1655 strain (org_prefix: ecoli).

from f2xba.biocyc.biocyc_data import BiocycData

biocyc_data = BiocycData('ecoli', 'data_refs/biocyc')

gene = 'b0928'
bc_gene = biocyc_data.locus2gene[gene]
bc_protein = biocyc_data.genes[bc_gene].proteins[0]
biocyc_data.proteins[bc_protein].__dict__
Parameters:
  • org_prefix (str) – BioCyc organism reference

  • biocyc_dir (str) – directory name, where downloads of BioCyc are stored

genes

BioCyc gene related data, referenced by BioCyc gene id.

locus2gene

Map gene id (locus) to BioCyc gene id.

proteins

BioCyc protein related data, referenced by BioCyc protein id.

rnas

BioCyc RNA related data, referenced by BioCyc RNA id.

get_gene_composition(protein_id)[source]

Retrieve gene composition of an enzyme (BioCyc protein).

biocyc_data.get_gene_composition('ASPAMINOTRANS-MONOMER')
Parameters:

protein_id (str) – BioCyc protein identifier

Returns:

gene composition of enzyme

Return type:

dict(str, float)

export_enzyme_composition(fname)[source]

Export BioCyc enzyme composition to Excel spreadsheet.

biocyc_data.export_enzyme_composition('BioCyc_enz_composition.xlsx')
Parameters:

fname (str) – file name of export file with extension ‘.xlsx’