Example on E. coli Core Model
This page shows a step-by-step example of the WILDkCAT pipeline on the E. coli core model.
Info
The parameter output_folder specifies the directory where all files generated by the pipeline will be stored.
The purpose of this design is to centralize all results in a single location, with files being added progressively as each step is executed.
Prerequisites (cf. installation instructions)
- Install WILDkCAT from PyPI
- (Optional) Install CataPro to predict kcat values using machine learning
- Download the E. coli core model :
mkdir model
curl -o model/e_coli_core.json http://bigg.ucsd.edu/static/models/e_coli_core.json
Your working directory should contain the following folders:
venv/- Folder containing the Python virtual environmentmodel/- Folder containing the E. coli core model (e_coli_core.json)- (Optional)
CataPro/- Folder containing the CataPro repository
Note
All the files used and created in this tutorial are available in the output folder of the WILDkCAT repository
1 — Extract kcat values from E. coli core model
Time: ~3-5 min
First, for each combination of reaction, enzyme, and substrate(s) in the model, create a TSV file.
Each row corresponds to a unique combination of reaction, enzyme, and substrate(s) and will be used to retrieve experimental kcat values from BRENDA and SABIO-RK in the next step.
The output file is named kcat.tsv and is saved in the specified output folder.
from wildkcat import run_extraction
run_extraction(
model_path="model/e_coli_core.json",
output_folder="output"
)
wildkcat extraction model/e_coli_core.json output
Example of the output file kcat.tsv:
| rxn | rxn_kegg | ec_code | direction | substrates_name | substrates_kegg | products_name | products_kegg | genes | uniprot | catalytic_enzyme | warning_ec | warning_enz |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PFK | 2.7.1.11 | forward | ATP C10H12N5O13P3;D-Fructose 6-phosphate | C00002;C05345 | ADP C10H12N5O10P2;D-Fructose 1,6-bisphosphate;H+ | C00008;C00354;C00080 | b3916 | P0A796 | P0A796 | |||
| GLUt2r | forward | L-Glutamate;H+ | C00025;C00080 | L-Glutamate;H+ | C00025;C00080 | b4077 | P21345 | P21345 | missing |
2 — Retrieve experimental kcat values from BRENDA and/or SABIO-RK
Time: ~7-10 min
This function searches for experimentally measured turnover numbers (kcat values) in the BRENDA and/or SABIO-RK databases for the kcats listed in the input file. The retrieved values are filtered based on organism, temperature, and pH conditions. The closest matching kcat values are saved to the output file.
from wildkcat import run_retrieval
run_retrieval(
output_folder="output",
organism="Escherichia coli",
temperature_range=(20, 45),
pH_range=(7, 8),
database='both'
)
wildkcat retrieval output 'Escherichia coli' 20 45 7 8
Example of the output file kcat_retrieved.tsv:
| rxn | rxn_kegg | ec_code | ec_codes | direction | substrates_name | substrates_kegg | products_name | products_kegg | genes | uniprot | catalytic_enzyme | warning_ec | warning_enz | kcat | db | penalty_score | kcat_substrate | kcat_organism | kcat_enzyme | kcat_temperature | kcat_ph | kcat_variant | kcat_id_percent | kcat_organism_score |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PFK | 2.7.1.11 | 2.7.1.11 | forward | ATP C10H12N5O13P3;D-Fructose 6-phosphate | C00002;C05345 | ADP C10H12N5O10P2;D-Fructose 1,6-bisphosphate;H+ | C00008;C00354;C00080 | b3916 | P0A796 | P0A796 | 88.0 | brenda | 1 | fructose 6-phosphate | Escherichia coli | P0A796 | 30.0 | 7.2 | 100.0 | 0 | ||||
| GLUt2r | forward | L-Glutamate;H+ | C00025;C00080 | L-Glutamate;H+ | C00025;C00080 | b4077 | P21345 | P21345 | missing | 16 |
Note
BRENDA requires a user account to access its data. In contrast, SABIO-RK is openly accessible and does not require registration. If you want to only use SABIO-RK, set the parameter database='sabio_rk' in the function or use the flag --database sabio_rk in the CLI command.
3 — (Optional) Predict missing kcat values using machine learning
3.1 - Prepare input file for CataPro
Time: ~3-5 min
Prepare the input file for CataPro by filtering out the kcat entries that were not found in the previous step and below a limit score (limit_penalty_score). The resulting file will be used to predict missing kcat values using machine learning.
The function generates the files named catapro_input.csv and catapro_input_substrates_to_smiles.tsv in the subfolder machine_learning.
Note
The file catapro_input_substrates_to_smiles.tsv that maps substrate names to their corresponding SMILES will be used to match back the predicted kcat values to the original kcat entries after running CataPro.
from wildkcat import run_prediction_part1
run_prediction_part1(
output_folder="output",
limit_penalty_score=9
)
wildkcat prediction-part1 output 9
The output file catapro_input.csv is formatted according to the requirements of CataPro, meaning it can be directly used as input for kcat prediction.
You can run CataPro with the following command:
python CataPro.inference.predict.py \
-inp_fpath output/machine_learning/catapro_input.csv \
-model_dpath CataPro.models \
-batch_size 64 \
-device cuda:0 \
-out_fpath output/machine_learning/catapro_output.csv
3.2 - Integrate CataPro predictions
Time: ~2-5 sec
After running CataPro with the prepared input file, integrate the predicted kcat values back into the original kcat entries. The function matches the predicted values to the original entries using the substrate names and SMILES mapping file generated in the previous step.
from wildkcat import run_prediction_part2
run_prediction_part2(
output_folder="output",
catapro_predictions_path="output/machine_learning/catapro_output.csv",
limit_penalty_score=9
)
wildkcat prediction-part2 output output/machine_learning/catapro_output.csv 9
Example of the output file kcat_full.tsv:
| rxn | rxn_kegg | ec_code | ec_codes | direction | substrates_name | substrates_kegg | products_name | products_kegg | genes | uniprot | catalytic_enzyme | warning_ec | warning_enz | kcat | db | penalty_score | kcat_substrate | kcat_organism | kcat_enzyme | kcat_temperature | kcat_ph | kcat_variant | kcat_id_percent | kcat_organism_score |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PFK | 2.7.1.11 | 2.7.1.11 | forward | ATP C10H12N5O13P3;D-Fructose 6-phosphate | C00002;C05345 | ADP C10H12N5O10P2;D-Fructose 1,6-bisphosphate;H+ | C00008;C00354;C00080 | b3916 | P0A796 | P0A796 | 88.0 | brenda | 1 | fructose 6-phosphate | Escherichia coli | P0A796 | 30.0 | 7.2 | 100.0 | 0.0 | ||||
| GLUt2r | forward | L-Glutamate;H+ | C00025;C00080 | L-Glutamate;H+ | C00025;C00080 | b4077 | P21345 | P21345 | missing | 23.3748 | catapro |
4 — Generate summary report
Time: ~2-5 sec
The output files kcat_retrieved.tsv (containing only values retrieved from databases) or kcat_full.tsv (including both retrieved and predicted values) can be used for integration into enzyme-constrained metabolic models.
The result can be visualized and summarized using the function generate_summary_report:
from wildkcat import generate_summary_report
generate_summary_report(
model_path="model/e_coli_core.json",
output_folder="output"
)
wildkcat report model/e_coli_core.json output