Example on E. coli Core Model

This page shows a step-by-step example of the WILDkCAT pipeline on the E. coli core model.

Info

The parameter output_folder specifies the directory where all files generated by the pipeline will be stored.
The purpose of this design is to centralize all results in a single location, with files being added progressively as each step is executed.

Prerequisites

Install WILDkCAT from PyPI
Install CataPro to predict kcat values using machine learning
Download the E. coli core model

Note

All the files used and created in this tutorial are available in the output folder of the WILDkCAT repository

1 — Extract kcat values from E. coli core model

First, for each combination of reaction, enzyme, and substrate(s) in the model, create a TSV file.

Each row corresponds to a unique combination of reaction, enzyme, and substrate(s) and will be used to retrieve experimental kcat values from BRENDA and SABIO-RK in the next step. The output file is named kcat.tsv and is saved in the specified output folder.

from wildkcat import run_extraction

run_extraction(
    model_path="model/e_coli_core.json",
    output_folder="output"
)

Example of the output file kcat.tsv:

rxn	rxn_kegg	ec_code	direction	substrates_name	substrates_kegg	products_name	products_kegg	genes	uniprot	catalytic_enzyme	warning
PFK		2.7.1.11	forward	ATP C10H12N5O13P3;D-Fructose 6-phosphate	C00002;C05345	ADP C10H12N5O10P2;D-Fructose 1,6-bisphosphate;H+	C00008;C00354;C00080	b3916	P0A796	P0A796
ALCD2x	R00754	1.1.1.71	forward	Ethanol;Nicotinamide adenine dinucleotide	C00469;C00003	Acetaldehyde;H+;Nicotinamide adenine dinucleotide - reduced	C00084;C00080;C00004	b0356	P25437	P25437

View the generated report

2 — Retrieve experimental kcat values from BRENDA and SABIO-RK

This function searches for experimentally measured turnover numbers (kcat values) in the BRENDA and/or SABIO-RK databases for the kcats listed in the input file. The retrieved values are filtered based on organism, temperature, and pH conditions. The closest matching kcat values are saved to the output file.

from wildkcat import run_retrieval

run_retrieval(
    output_folder="output",
    organism="Escherichia coli",
    temperature_range=(20, 40),
    pH_range=(6.5, 7.5),
    database='both'
    )

Example of the output file kcat_retrieved.tsv:

rxn	rxn_kegg	ec_code	direction	substrates_name	substrates_kegg	products_name	products_kegg	genes	uniprot	catalytic_enzyme	warning	kcat	matching_score	kcat_substrate	kcat_organism	kcat_enzyme	kcat_temperature	kcat_ph	kcat_variant	kcat_db	kcat_id_percent	kcat_organism_score
PFK		2.7.1.11	forward	ATP C10H12N5O13P3;D-Fructose 6-phosphate	C00002;C05345	ADP C10H12N5O10P2;D-Fructose 1,6-bisphosphate;H+	C00008;C00354;C00080	b3916	P0A796	P0A796		0.016	1	D-fructose 6-phosphate	Escherichia coli	P0A796	30.0	7.2		brenda	100.0	0.0
ALCD2x	R00754	1.1.1.71	forward	Ethanol;Nicotinamide adenine dinucleotide	C00469;C00003	Acetaldehyde;H+;Nicotinamide adenine dinucleotide - reduced	C00084;C00080;C00004	b0356	P25437	P25437		13.9	7	ethanol	Acinetobacter calcoaceticus					brenda		4.0

View the generated report

3 — Predict missing kcat values using machine learning

3.1 - Prepare input file for CataPro

Prepare the input file for CataPro by filtering out the kcat entries that were not found in the previous step and below a limit score (limit_matching_score). The resulting file will be used to predict missing kcat values using machine learning.

The function generates the files named catapro_input.csv and catapro_input_substrates_to_smiles.tsv in the subfolder machine_learning.

Note

The file catapro_input_substrates_to_smiles.tsv that maps substrate names to their corresponding SMILES will be used to match back the predicted kcat values to the original kcat entries after running CataPro.

from wildkcat import run_prediction_part1

run_prediction_part1(
    output_folder="output",
    limit_matching_score=6
    )

The output file catapro_input.csv is formatted according to the requirements of CataPro, meaning it can be directly used as input for kcat prediction.

Note

Before running predictions, make sure you have installed CataPro by following the installation instructions provided in their GitHub repository.

Once installed, you can run CataPro with the following command:

python predict.py \
        -inp_fpath output/machine_learning/ecoli_catapro_input.csv \
        -model_dpath models \
        -batch_size 64 \
        -device cuda:0 \
        -out_fpath ecoli_catapro_output.csv

View the generated report

3.2 - Integrate CataPro predictions

After running CataPro with the prepared input file, integrate the predicted kcat values back into the original kcat entries. The function matches the predicted values to the original entries using the substrate names and SMILES mapping file generated in the previous step.

from wildkcat import run_prediction_part2

run_prediction_part2(
    output_folder="output", 
    catapro_predictions_path="output/machine_learning/catapro_output.csv", 
    limit_matching_score=6
    )

Example of the output file kcat_full.tsv:

rxn	rxn_kegg	ec_code	direction	substrates_name	substrates_kegg	products_name	products_kegg	genes	uniprot	catalytic_enzyme	warning	kcat	db	matching_score	kcat_substrate	kcat_organism	kcat_enzyme	kcat_temperature	kcat_ph	kcat_variant	kcat_id_percent
PFK		2.7.1.11	forward	ATP C10H12N5O13P3; D-Fructose 6-phosphate	C00002; C05345	ADP C10H12N5O10P2; D-Fructose 1,6-bisphosphate; H+	C00008; C00354; C00080	b3916	P0A796	P0A796		0.016	brenda	1	D-fructose 6-phosphate	Escherichia coli	P0A796	30.0	7.2		100.0
ALCD2x	R00754	1.1.1.71	forward	Ethanol;Nicotinamide adenine dinucleotide	C00469;C00003	Acetaldehyde;H+;Nicotinamide adenine dinucleotide - reduced	C00084;C00080;C00004	b0356	P25437	P25437		16.0905	catapro

4 — Generate summary report

The final output file kcat_full.tsv contains both experimentally retrieved and machine learning predicted kcat values for each combination of reaction, enzyme, and substrate(s) in the E. coli core model. This file can be used for integration into enzyme-constrained metabolic models.

The result can be visualized and summarized using the function generate_summary_report:

from wildkcat.visualization import generate_summary_report

generate_summary_report(
    model_path="model/e_coli_core.json", 
    output_folder="output"
    )

View the generated report