Skip to content

Example on E. coli Core Model

This page shows a step-by-step example of the WILDkCAT pipeline on the E. coli core model.

Info

The parameter output_folder specifies the directory where all files generated by the pipeline will be stored.
The purpose of this design is to centralize all results in a single location, with files being added progressively as each step is executed.


Prerequisites

Note

All the files used and created in this tutorial are available in the output folder of the WILDkCAT repository


1 — Extract kcat values from E. coli core model

First, for each combination of reaction, enzyme, and substrate(s) in the model, create a TSV file.

Each row corresponds to a unique combination of reaction, enzyme, and substrate(s) and will be used to retrieve experimental kcat values from BRENDA and SABIO-RK in the next step. The output file is named kcat.tsv and is saved in the specified output folder.

from wildkcat import run_extraction

run_extraction(
    model_path="model/e_coli_core.json",
    output_folder="output"
)

Example of the output file kcat.tsv:

rxn rxn_kegg ec_code direction substrates_name substrates_kegg products_name products_kegg genes uniprot catalytic_enzyme warning
PFK 2.7.1.11 forward ATP C10H12N5O13P3;D-Fructose 6-phosphate C00002;C05345 ADP C10H12N5O10P2;D-Fructose 1,6-bisphosphate;H+ C00008;C00354;C00080 b3916 P0A796 P0A796
ALCD2x R00754 1.1.1.71 forward Ethanol;Nicotinamide adenine dinucleotide C00469;C00003 Acetaldehyde;H+;Nicotinamide adenine dinucleotide - reduced C00084;C00080;C00004 b0356 P25437 P25437

View the generated report


2 — Retrieve experimental kcat values from BRENDA and SABIO-RK

This function searches for experimentally measured turnover numbers (kcat values) in the BRENDA and/or SABIO-RK databases for the kcats listed in the input file. The retrieved values are filtered based on organism, temperature, and pH conditions. The closest matching kcat values are saved to the output file.

from wildkcat import run_retrieval

run_retrieval(
    output_folder="output",
    organism="Escherichia coli",
    temperature_range=(20, 40),
    pH_range=(6.5, 7.5),
    database='both'
    )

Example of the output file kcat_retrieved.tsv:

rxn rxn_kegg ec_code direction substrates_name substrates_kegg products_name products_kegg genes uniprot catalytic_enzyme warning kcat matching_score kcat_substrate kcat_organism kcat_enzyme kcat_temperature kcat_ph kcat_variant kcat_db kcat_id_percent kcat_organism_score
PFK 2.7.1.11 forward ATP C10H12N5O13P3;D-Fructose 6-phosphate C00002;C05345 ADP C10H12N5O10P2;D-Fructose 1,6-bisphosphate;H+ C00008;C00354;C00080 b3916 P0A796 P0A796 0.016 1 D-fructose 6-phosphate Escherichia coli P0A796 30.0 7.2 brenda 100.0 0.0
ALCD2x R00754 1.1.1.71 forward Ethanol;Nicotinamide adenine dinucleotide C00469;C00003 Acetaldehyde;H+;Nicotinamide adenine dinucleotide - reduced C00084;C00080;C00004 b0356 P25437 P25437 13.9 7 ethanol Acinetobacter calcoaceticus brenda 4.0

View the generated report


3 — Predict missing kcat values using machine learning

3.1 - Prepare input file for CataPro

Prepare the input file for CataPro by filtering out the kcat entries that were not found in the previous step and below a limit score (limit_matching_score). The resulting file will be used to predict missing kcat values using machine learning.

The function generates the files named catapro_input.csv and catapro_input_substrates_to_smiles.tsv in the subfolder machine_learning.

Note

The file catapro_input_substrates_to_smiles.tsv that maps substrate names to their corresponding SMILES will be used to match back the predicted kcat values to the original kcat entries after running CataPro.

from wildkcat import run_prediction_part1

run_prediction_part1(
    output_folder="output",
    limit_matching_score=6
    )

The output file catapro_input.csv is formatted according to the requirements of CataPro, meaning it can be directly used as input for kcat prediction.

Note

Before running predictions, make sure you have installed CataPro by following the installation instructions provided in their GitHub repository.

Once installed, you can run CataPro with the following command:

python predict.py \
        -inp_fpath output/machine_learning/ecoli_catapro_input.csv \
        -model_dpath models \
        -batch_size 64 \
        -device cuda:0 \
        -out_fpath ecoli_catapro_output.csv

View the generated report

3.2 - Integrate CataPro predictions

After running CataPro with the prepared input file, integrate the predicted kcat values back into the original kcat entries. The function matches the predicted values to the original entries using the substrate names and SMILES mapping file generated in the previous step.

from wildkcat import run_prediction_part2

run_prediction_part2(
    output_folder="output", 
    catapro_predictions_path="output/machine_learning/catapro_output.csv", 
    limit_matching_score=6
    )

Example of the output file kcat_full.tsv:

rxn rxn_kegg ec_code direction substrates_name substrates_kegg products_name products_kegg genes uniprot catalytic_enzyme warning kcat db matching_score kcat_substrate kcat_organism kcat_enzyme kcat_temperature kcat_ph kcat_variant kcat_id_percent
PFK 2.7.1.11 forward ATP C10H12N5O13P3; D-Fructose 6-phosphate C00002; C05345 ADP C10H12N5O10P2; D-Fructose 1,6-bisphosphate; H+ C00008; C00354; C00080 b3916 P0A796 P0A796 0.016 brenda 1 D-fructose 6-phosphate Escherichia coli P0A796 30.0 7.2 100.0
ALCD2x R00754 1.1.1.71 forward Ethanol;Nicotinamide adenine dinucleotide C00469;C00003 Acetaldehyde;H+;Nicotinamide adenine dinucleotide - reduced C00084;C00080;C00004 b0356 P25437 P25437 16.0905 catapro

4 — Generate summary report

The final output file kcat_full.tsv contains both experimentally retrieved and machine learning predicted kcat values for each combination of reaction, enzyme, and substrate(s) in the E. coli core model. This file can be used for integration into enzyme-constrained metabolic models.

The result can be visualized and summarized using the function generate_summary_report:

from wildkcat.visualization import generate_summary_report

generate_summary_report(
    model_path="model/e_coli_core.json", 
    output_folder="output"
    )

View the generated report