Skip to content

Example on E. coli Core Model

This page shows a step-by-step example of the WILDkCAT pipeline on the E. coli core model.

Info

The parameter output_folder specifies the directory where all files generated by the pipeline will be stored.
The purpose of this design is to centralize all results in a single location, with files being added progressively as each step is executed.


Prerequisites (cf. installation instructions)

mkdir model
curl -o model/e_coli_core.json http://bigg.ucsd.edu/static/models/e_coli_core.json

Your working directory should contain the following folders:

  • venv/ - Folder containing the Python virtual environment
  • model/ - Folder containing the E. coli core model (e_coli_core.json)
  • (Optional) CataPro/ - Folder containing the CataPro repository

Note

All the files used and created in this tutorial are available in the output folder of the WILDkCAT repository


1 — Extract kcat values from E. coli core model

Time: ~3-5 min

First, for each combination of reaction, enzyme, and substrate(s) in the model, create a TSV file.

Each row corresponds to a unique combination of reaction, enzyme, and substrate(s) and will be used to retrieve experimental kcat values from BRENDA and SABIO-RK in the next step. The output file is named kcat.tsv and is saved in the specified output folder.

from wildkcat import run_extraction

run_extraction(
    model_path="model/e_coli_core.json",
    output_folder="output"
)
wildkcat extraction model/e_coli_core.json output

Example of the output file kcat.tsv:

rxn rxn_kegg ec_code direction substrates_name substrates_kegg products_name products_kegg genes uniprot catalytic_enzyme warning_ec warning_enz
PFK 2.7.1.11 forward ATP C10H12N5O13P3;D-Fructose 6-phosphate C00002;C05345 ADP C10H12N5O10P2;D-Fructose 1,6-bisphosphate;H+ C00008;C00354;C00080 b3916 P0A796 P0A796
GLUt2r forward L-Glutamate;H+ C00025;C00080 L-Glutamate;H+ C00025;C00080 b4077 P21345 P21345 missing

View the generated report


2 — Retrieve experimental kcat values from BRENDA and/or SABIO-RK

Time: ~7-10 min

This function searches for experimentally measured turnover numbers (kcat values) in the BRENDA and/or SABIO-RK databases for the kcats listed in the input file. The retrieved values are filtered based on organism, temperature, and pH conditions. The closest matching kcat values are saved to the output file.

from wildkcat import run_retrieval

run_retrieval(
    output_folder="output",
    organism="Escherichia coli",
    temperature_range=(20, 45),
    pH_range=(7, 8),
    database='both'
    )
wildkcat retrieval output 'Escherichia coli' 20 45 7 8

Example of the output file kcat_retrieved.tsv:

rxn rxn_kegg ec_code ec_codes direction substrates_name substrates_kegg products_name products_kegg genes uniprot catalytic_enzyme warning_ec warning_enz kcat db penalty_score kcat_substrate kcat_organism kcat_enzyme kcat_temperature kcat_ph kcat_variant kcat_id_percent kcat_organism_score
PFK 2.7.1.11 2.7.1.11 forward ATP C10H12N5O13P3;D-Fructose 6-phosphate C00002;C05345 ADP C10H12N5O10P2;D-Fructose 1,6-bisphosphate;H+ C00008;C00354;C00080 b3916 P0A796 P0A796 88.0 brenda 1 fructose 6-phosphate Escherichia coli P0A796 30.0 7.2 100.0 0
GLUt2r forward L-Glutamate;H+ C00025;C00080 L-Glutamate;H+ C00025;C00080 b4077 P21345 P21345 missing 16

View the generated report

Note

BRENDA requires a user account to access its data. In contrast, SABIO-RK is openly accessible and does not require registration. If you want to only use SABIO-RK, set the parameter database='sabio_rk' in the function or use the flag --database sabio_rk in the CLI command.


3 — (Optional) Predict missing kcat values using machine learning

3.1 - Prepare input file for CataPro

Time: ~3-5 min

Prepare the input file for CataPro by filtering out the kcat entries that were not found in the previous step and below a limit score (limit_penalty_score). The resulting file will be used to predict missing kcat values using machine learning.

The function generates the files named catapro_input.csv and catapro_input_substrates_to_smiles.tsv in the subfolder machine_learning.

Note

The file catapro_input_substrates_to_smiles.tsv that maps substrate names to their corresponding SMILES will be used to match back the predicted kcat values to the original kcat entries after running CataPro.

from wildkcat import run_prediction_part1

run_prediction_part1(
    output_folder="output",
    limit_penalty_score=9
    )
wildkcat prediction-part1 output 9

The output file catapro_input.csv is formatted according to the requirements of CataPro, meaning it can be directly used as input for kcat prediction.

You can run CataPro with the following command:

python CataPro.inference.predict.py \
        -inp_fpath output/machine_learning/catapro_input.csv \
        -model_dpath CataPro.models \
        -batch_size 64 \
        -device cuda:0 \
        -out_fpath output/machine_learning/catapro_output.csv

View the generated report

3.2 - Integrate CataPro predictions

Time: ~2-5 sec

After running CataPro with the prepared input file, integrate the predicted kcat values back into the original kcat entries. The function matches the predicted values to the original entries using the substrate names and SMILES mapping file generated in the previous step.

from wildkcat import run_prediction_part2

run_prediction_part2(
    output_folder="output", 
    catapro_predictions_path="output/machine_learning/catapro_output.csv", 
    limit_penalty_score=9
    )
wildkcat prediction-part2 output output/machine_learning/catapro_output.csv 9

Example of the output file kcat_full.tsv:

rxn rxn_kegg ec_code ec_codes direction substrates_name substrates_kegg products_name products_kegg genes uniprot catalytic_enzyme warning_ec warning_enz kcat db penalty_score kcat_substrate kcat_organism kcat_enzyme kcat_temperature kcat_ph kcat_variant kcat_id_percent kcat_organism_score
PFK 2.7.1.11 2.7.1.11 forward ATP C10H12N5O13P3;D-Fructose 6-phosphate C00002;C05345 ADP C10H12N5O10P2;D-Fructose 1,6-bisphosphate;H+ C00008;C00354;C00080 b3916 P0A796 P0A796 88.0 brenda 1 fructose 6-phosphate Escherichia coli P0A796 30.0 7.2 100.0 0.0
GLUt2r forward L-Glutamate;H+ C00025;C00080 L-Glutamate;H+ C00025;C00080 b4077 P21345 P21345 missing 23.3748 catapro

4 — Generate summary report

Time: ~2-5 sec

The output files kcat_retrieved.tsv (containing only values retrieved from databases) or kcat_full.tsv (including both retrieved and predicted values) can be used for integration into enzyme-constrained metabolic models.

The result can be visualized and summarized using the function generate_summary_report:

from wildkcat import generate_summary_report

generate_summary_report(
    model_path="model/e_coli_core.json", 
    output_folder="output"
    )
wildkcat report model/e_coli_core.json output

View the generated report