Skip to content

Main functions

The WILDkCAT package is organized into modules:

  1. Extraction : Extraction of kcat values from the provided model

  2. Retrieval : Retrieval of kcat values using curated databases (BRENDA and SABIO-RK)

  3. Prediction : Prediction of missing and low confidence kcat values using ML-based CataPro model

  4. Summary : Generates an HTML report summarizing the percentage and quality of kcat values identified for the model, along with their data sources.


Extraction

wildkcat.processing.extract_kcat.run_extraction(model_path, output_folder, report=True)

Extracts kcat-related data from a metabolic model and generates output files and an optional HTML report.

Parameters:

Name Type Description Default
model_path str

Path to the metabolic model file (JSON, MATLAB, or SBML format).

required
output_folder str

Path to the output folder where all the results will be saved.

required
report bool

Whether to generate an HTML report (default: True).

True
Source code in wildkcat/processing/extract_kcat.py
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
def run_extraction(model_path: str, 
                   output_folder: str, 
                   report: bool = True) -> None:
    """
    Extracts kcat-related data from a metabolic model and generates output files and an optional HTML report.

    Parameters:
        model_path (str): Path to the metabolic model file (JSON, MATLAB, or SBML format).
        output_folder (str): Path to the output folder where all the results will be saved.
        report (bool, optional): Whether to generate an HTML report (default: True).
    """
    # Initialize output folder
    os.makedirs(output_folder, exist_ok=True)

    # Intitialize logging
    os.makedirs(os.path.join(output_folder, "logs"), exist_ok=True)
    timestamp = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
    filename = f"logs/extract_{timestamp}.log"
    logging.getLogger().addFilter(DedupFilter())
    logging.basicConfig(filename=os.path.join(output_folder, filename), encoding='utf-8', level=logging.INFO)

    # Run extraction
    model = read_model(model_path)
    df, report_statistics = create_kcat_output(model)

    # Save output
    output_path = os.path.join(output_folder, "kcat.tsv")
    df.to_csv(output_path, sep='\t', index=False)
    logging.info(f"Output saved to '{output_path}'")

    if report:
        report_extraction(model, df, report_statistics, output_folder)

Retrieval

wildkcat.processing.retrieve_kcat.run_retrieval(output_folder, organism, temperature_range, pH_range, database='both', report=True)

Retrieves closest kcat values from specified databases for entries in a kcat file, applies filtering criteria, and saves the results to an output file.

Parameters:

Name Type Description Default
output_folder str

Path to the output folder where the results will be saved.

required
organism str

Organism scientific name (e.g. "Escherichia coli", "Homo sapiens").

required
temperature_range tuple

Acceptable temperature range for filtering (min, max).

required
pH_range tuple

Acceptable pH range for filtering (min, max).

required
database str

Specifies which database(s) to query for kcat values. Options are 'both' (default), 'brenda', or 'sabio_rk'.

'both'
report bool

Whether to generate an HTML report using the retrieved data (default: True).

True
Source code in wildkcat/processing/retrieve_kcat.py
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
def run_retrieval(output_folder: str,
                  organism: str,
                  temperature_range: tuple,
                  pH_range: tuple,
                  database: str = 'both',
                  report: bool = True) -> None:
    """
    Retrieves closest kcat values from specified databases for entries in a kcat file, applies filtering criteria, 
    and saves the results to an output file.

    Parameters:
        output_folder (str): Path to the output folder where the results will be saved.
        organism (str): Organism scientific name (e.g. "Escherichia coli", "Homo sapiens").
        temperature_range (tuple): Acceptable temperature range for filtering (min, max).
        pH_range (tuple): Acceptable pH range for filtering (min, max).
        database (str, optional): Specifies which database(s) to query for kcat values. 
            Options are 'both' (default), 'brenda', or 'sabio_rk'.
        report (bool, optional): Whether to generate an HTML report using the retrieved data (default: True).        
    """
    # Load environment variables
    load_dotenv()

    # Create a dict with the general criterias
    general_criteria = {
        "Organism": organism,
        "Temperature": temperature_range,
        "pH": pH_range
    }

    # Read the kcat file
    if not os.path.exists(output_folder):
        raise FileNotFoundError(f"The specified output folder '{output_folder}' does not exist.")

    # Intitialize logging
    os.makedirs(os.path.join(output_folder, "logs"), exist_ok=True)
    timestamp = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
    filename = f"logs/retrieval_{timestamp}.log"
    logging.getLogger().addFilter(DedupFilter())
    logging.basicConfig(filename=os.path.join(output_folder, filename), encoding='utf-8', level=logging.INFO)

    kcat_file_path = os.path.join(output_folder, "kcat.tsv")
    if not os.path.isfile(kcat_file_path):
        raise FileNotFoundError(f"The specified file '{kcat_file_path}' does not exist in the output folder. Please run the function 'run_extraction()' first.")

    cached_df = load_cached_progress(output_folder)
    if cached_df is not None:
        kcat_df = cached_df
        unprocessed_indices = kcat_df.index[kcat_df['processed'] == False]
        start_index = unprocessed_indices.min() if len(unprocessed_indices) > 0 else len(kcat_df)
    else:
        kcat_df = pd.read_csv(kcat_file_path, sep='\t')
        start_index = 0

        # Initialize new columns
        for col in ['kcat', 'penalty_score', 'kcat_substrate', 'kcat_organism',
                    'kcat_enzyme', 'kcat_temperature', 'kcat_ph', 'kcat_variant',
                    'kcat_db', 'kcat_id_percent', 'kcat_organism_score']:
            if col not in kcat_df.columns:
                kcat_df[col] = None

        # Initialize 'processed' column
        kcat_df['processed'] = False

    # Retrieve kcat values from databases
    request_count = 0
    for row in tqdm(kcat_df.itertuples(), total=len(kcat_df), desc="Retrieving kcat values"):

        if row.Index < start_index:
            continue  

        kcat_dict = row._asdict()

        # Extract kcat and penalty score
        best_match, penalty_score = extract_kcat(kcat_dict, general_criteria, database=database)
        kcat_df.loc[row.Index, 'penalty_score'] = penalty_score

        request_count += 1
        if request_count % 300 == 0:
            time.sleep(10)

        if best_match is not None:
            # Assign results to the main dataframe
            kcat_df.loc[row.Index, 'kcat'] = best_match['adj_kcat']
            kcat_df.loc[row.Index, 'kcat_substrate'] = best_match['Substrate']
            kcat_df.loc[row.Index, 'kcat_organism'] = best_match['Organism']
            kcat_df.loc[row.Index, 'kcat_enzyme'] = best_match['UniProtKB_AC']
            kcat_df.loc[row.Index, 'kcat_temperature'] = best_match['adj_temp']
            kcat_df.loc[row.Index, 'kcat_ph'] = best_match['pH']
            kcat_df.loc[row.Index, 'kcat_variant'] = best_match['EnzymeVariant']
            kcat_df.loc[row.Index, 'kcat_db'] = best_match['db']
            kcat_df.loc[row.Index, 'kcat_id_percent'] = best_match['id_perc']
            kcat_df.loc[row.Index, 'kcat_organism_score'] = best_match['organism_score']

        # Mark the line as processed 
        kcat_df.loc[row.Index, 'processed'] = True
        # Save partial results every 200 rows 
        if row.Index % 200 == 0 and row.Index > 0:
            save_partial_results(kcat_df, output_folder)

    # Save final 
    save_partial_results(kcat_df, output_folder)

    # Remove 'processed' column before final save
    if 'processed' in kcat_df.columns:
        kcat_df.drop(columns=['processed'], inplace=True)
    kcat_df = merge_ec(kcat_df)

    # TODO: Remove it later
    # cache_dir = os.path.join(output_folder, "cache_retrieval")
    # if os.path.exists(cache_dir):
    #     shutil.rmtree(cache_dir)
    #     logging.info("Cache folder removed after successful completion.")

    # Format the df
    kcat_df['penalty_score'] = (
        pd.to_numeric(kcat_df['penalty_score'], errors='coerce')
        .round()
        .astype('Int64')
        )

    output_path = os.path.join(output_folder, "kcat_retrieved.tsv")
    kcat_df.to_csv(output_path, sep='\t', index=False)
    logging.info(f"Output saved to '{output_path}'")

    if report:
        general_criteria.update({
            'database': database
        }) 

        report_retrieval(kcat_df, output_folder, general_criteria)

Prediction

wildkcat.processing.predict_kcat.run_prediction_part1(output_folder, limit_penalty_score, report=True)

Processes kcat data file to generate input files for CataPro prediction. Optionally, it can produce a summary report of the processed data.

Parameters:

Name Type Description Default
output_folder str

Path to the output folder where the results will be saved.

required
limit_penalty_score int

Threshold for filtering entries based on matching score.

required
report bool

Whether to generate a report using the retrieved data (default: True).

True
Source code in wildkcat/processing/predict_kcat.py
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
def run_prediction_part1(output_folder: str,
                         limit_penalty_score: int, 
                         report: bool = True) -> None:
    """
    Processes kcat data file to generate input files for CataPro prediction.
    Optionally, it can produce a summary report of the processed data.

    Parameters:
        output_folder (str): Path to the output folder where the results will be saved.
        limit_penalty_score (int): Threshold for filtering entries based on matching score.
        report (bool, optional): Whether to generate a report using the retrieved data (default: True). 
    """
    # Intitialize logging
    os.makedirs(os.path.join(output_folder, "logs"), exist_ok=True)
    timestamp = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
    filename = f"logs/prediction1_{timestamp}.log"
    logging.getLogger().addFilter(DedupFilter())
    logging.basicConfig(filename=os.path.join(output_folder, filename), encoding='utf-8', level=logging.INFO)

    # Run prediction part 1
    # Read the kcat file
    if not os.path.exists(output_folder):
        raise FileNotFoundError(f"The specified output folder '{output_folder}' does not exist.")

    kcat_file_path = os.path.join(output_folder, "kcat_retrieved.tsv")
    if not os.path.isfile(kcat_file_path):
        raise FileNotFoundError(f"The specified file '{kcat_file_path}' does not exist in the output folder. Please run the function 'run_retrieval()' first.")

    kcat_df = pd.read_csv(kcat_file_path, sep='\t')

    # Subset rows with no values or matching score above the limit
    kcat_df = kcat_df[(kcat_df['penalty_score'] >= limit_penalty_score) | (kcat_df['penalty_score'].isnull())]
    # Drop rows with no UniProt ID or no substrates_kegg
    before_duplicates_filter = len(kcat_df) - 1 
    kcat_df = kcat_df[kcat_df['uniprot'].notnull() & kcat_df['substrates_kegg'].notnull()]
    nb_missing_enzymes = before_duplicates_filter - len(kcat_df) + 1 

    # Generate CataPro input file
    catapro_input_df, substrates_to_smiles_df, report_statistics = create_catapro_input_file(kcat_df)

    # Save the CataPro input file and substrates to SMILES mapping
    os.makedirs(os.path.join(output_folder, "machine_learning"), exist_ok=True)
    output_path = os.path.join(output_folder, "machine_learning/catapro_input.csv")
    kcat_df.to_csv(output_path, sep='\t', index=False)
    catapro_input_df.to_csv(output_path, sep=',', index=True)
    substrates_to_smiles_df.to_csv(output_path.replace('.csv', '_substrates_to_smiles.tsv'), sep='\t', index=False)
    logging.info(f"Output saved to '{output_path}'")

    # Add statistics 
    report_statistics["missing_enzymes"] = nb_missing_enzymes

    if report:
        report_prediction_input(catapro_input_df, report_statistics, output_folder)

wildkcat.processing.predict_kcat.run_prediction_part2(output_folder, catapro_predictions_path, limit_penalty_score)

Runs the second part of the kcat prediction pipeline by integrating Catapro predictions, mapping substrates to SMILES, formatting the output, and optionally generating a report.

Parameters:

Name Type Description Default
output_folder str

Path to the output folder where the results will be saved.

required
catapro_predictions_path str

Path to the CataPro predictions CSV file.

required
limit_penalty_score float

Threshold for taking predictions over retrieved values.

required
Source code in wildkcat/processing/predict_kcat.py
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
def run_prediction_part2(output_folder: str,
                         catapro_predictions_path: str,
                         limit_penalty_score: int) -> None:
    """
    Runs the second part of the kcat prediction pipeline by integrating Catapro predictions,
    mapping substrates to SMILES, formatting the output, and optionally generating a report.

    Parameters:
        output_folder (str): Path to the output folder where the results will be saved.
        catapro_predictions_path (str): Path to the CataPro predictions CSV file.
        limit_penalty_score (float): Threshold for taking predictions over retrieved values.
    """ 
    # Intitialize logging
    os.makedirs(os.path.join(output_folder, "logs"), exist_ok=True)
    timestamp = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
    filename = f"logs/prediction2_{timestamp}.log"
    logging.getLogger().addFilter(DedupFilter())
    logging.basicConfig(filename=os.path.join(output_folder, filename), encoding='utf-8', level=logging.INFO)

    # Run prediction part 2
    if not os.path.exists(output_folder):
        raise FileNotFoundError(f"The specified output folder '{output_folder}' does not exist.")
    kcat_file_path = os.path.join(output_folder, "kcat_retrieved.tsv")
    if not os.path.isfile(kcat_file_path):
        raise FileNotFoundError(f"The specified file '{kcat_file_path}' does not exist in the output folder. Please run the function 'run_extraction()' first.")
    kcat_df = pd.read_csv(kcat_file_path, sep='\t')
    substrates_to_smiles_path = os.path.join(output_folder, "machine_learning/catapro_input_substrates_to_smiles.tsv")
    substrates_to_smiles = pd.read_csv(substrates_to_smiles_path, sep='\t')
    catapro_predictions_df = pd.read_csv(catapro_predictions_path, sep=',')
    kcat_df = integrate_catapro_predictions(kcat_df, 
                                            substrates_to_smiles,
                                            catapro_predictions_df
                                            )

    # Save the output as a TSV file
    kcat_df = format_output(kcat_df, limit_penalty_score)
    output_path = os.path.join(output_folder, "kcat_full.tsv")
    kcat_df.to_csv(output_path, sep='\t', index=False)
    logging.info(f"Output saved to '{output_path}'")

Summary report

wildkcat.processing.summary.generate_summary_report(model_path, output_folder)

Generate a HTML report summarizing the kcat extraction, retrieval and prediction for a given model.

Parameters:

Name Type Description Default
model_path str

Path to the metabolic model file (JSON, MATLAB, or SBML format).

required
output_folder str

Path to the output folder where the kcat file is located.

required
Source code in wildkcat/processing/summary.py
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def generate_summary_report(model_path: str,
                            output_folder: str) -> None:
    """
    Generate a HTML report summarizing the kcat extraction, retrieval and prediction for a given model. 

    Parameters:
        model_path (str): Path to the metabolic model file (JSON, MATLAB, or SBML format).
        output_folder (str): Path to the output folder where the kcat file is located.
    """
    # Read the kcat file
    if not os.path.exists(output_folder):
        raise FileNotFoundError(f"The specified output folder '{output_folder}' does not exist.")

    kcat_full_file_path = os.path.join(output_folder, "kcat_full.tsv")
    kcat_retrieve_file_path = os.path.join(output_folder, "kcat_retrieved.tsv")
    if os.path.isfile(kcat_full_file_path):
        kcat_df = pd.read_csv(kcat_full_file_path, sep='\t')
        model = read_model(model_path)
        report_final(model, kcat_df, output_folder)
    elif os.path.isfile(kcat_retrieve_file_path):
        logging.warning(f"The file 'kcat_full.tsv' is not present in the folder '{output_folder}' the general report will be done without predicted values.")
        model = read_model(model_path)
        kcat_df = pd.read_csv(kcat_retrieve_file_path, sep='\t')
        report_final(model, kcat_df, output_folder)
    else: 
        raise FileNotFoundError(f"The specified folder '{output_folder}' does not contain the files: 'kcat_full.tsv', 'kcat_retrieve.tsv'. Please run at least the extraction step.")

Matching process and scoring

The matching process is designed to select the most appropriate kcat value when multiple candidates are available.
Each candidate is first assigned a score based on several criteria, such as:

  • kcat specific criteria:
    • Substrate
    • Catalytic enzyme(s)
  • General criteria:
    • Organism
    • Temperature
    • pH

If two or more candidates receive the same score, tie-breaking rules are applied in the following order:

  1. Enzyme sequence identity – the value associated with the most similar protein sequence is preferred.
  2. Organism proximity – preference is given to kcat values measured in organisms closest to the target species.
  3. Minimal kcat value – if ambiguity remains, the smallest kcat value is chosen.

wildkcat.utils.matching

check_catalytic_enzyme(candidate, kcat_dict)

Checks whether the enzyme in a candidate entry matches the model's enzyme. Identifies the catalytic enzyme using UniProt API.

Source code in wildkcat/utils/matching.py
166
167
168
169
170
171
172
173
174
175
def check_catalytic_enzyme(candidate, kcat_dict): 
    """
    Checks whether the enzyme in a candidate entry matches the model's enzyme.
    Identifies the catalytic enzyme using UniProt API.
    """
    if pd.notna(kcat_dict['catalytic_enzyme']):
        catalytic_enzymes = str(kcat_dict['catalytic_enzyme']).split(";")
        if candidate["UniProtKB_AC"] in catalytic_enzymes:
            return 0
    return 3

check_organism(candidate, general_criteria)

Checks whether the organism in a candidate entry matches the expected organism.

Source code in wildkcat/utils/matching.py
178
179
180
181
182
183
184
def check_organism(candidate, general_criteria): 
    """
    Checks whether the organism in a candidate entry matches the expected organism.
    """
    if candidate["Organism"] == general_criteria["Organism"]:
        return 0
    return 2

check_pH(candidate, general_criteria)

Checks whether the pH in a candidate entry matches the expected pH.

Source code in wildkcat/utils/matching.py
197
198
199
200
201
202
203
204
205
206
207
208
def check_pH(candidate, general_criteria):
    """
    Checks whether the pH in a candidate entry matches the expected pH.
    """
    ph_min, ph_max = general_criteria["pH"]
    candidate_ph = candidate.get("pH", None)
    if ph_min <= candidate_ph <= ph_max:
        return 0
    elif pd.isna(candidate_ph):
        return 1
    else:  # Out of range
        return 2

check_substrate(entry, kcat_dict=None, candidate=None)

Checks whether the substrate in a candidate entry matches the model's substrates.

Source code in wildkcat/utils/matching.py
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
def check_substrate(entry, kcat_dict=None, candidate=None):
    """
    Checks whether the substrate in a candidate entry matches the model's substrates.
    """
    api = entry.get("db", candidate.get("db") if candidate else None)

    # Normalize names
    entry_subs = entry.get("Substrate", "")
    entry_prods = entry.get("Product", "")
    entry_kegg = entry.get("KeggReactionID")

    cand_subs = candidate.get("Substrate", "") if candidate else ""
    cand_prods = candidate.get("Product", "") if candidate else ""
    cand_kegg = candidate.get("KeggReactionID") if candidate else ""

    model_subs = (kcat_dict or {}).get("substrates_name", "")
    model_prods = (kcat_dict or {}).get("products_name", "")
    model_kegg = (kcat_dict or {}).get("rxn_kegg")

    if api == "sabio_rk":

        entry_kegg = None if pd.isna(entry_kegg) else entry_kegg
        model_kegg = None if pd.isna(model_kegg) else model_kegg
        cand_kegg  = None if pd.isna(cand_kegg) else cand_kegg

        if model_kegg and entry_kegg and _norm_name(model_kegg) == _norm_name(entry_kegg):
            if _any_intersection(entry_subs, model_subs) or _any_intersection(entry_prods, model_prods):
                return 0
        if cand_kegg and entry_kegg and _norm_name(cand_kegg) == _norm_name(entry_kegg):
            if _any_intersection(entry_subs, cand_subs) or _any_intersection(entry_prods, cand_prods):
                return 0
        base_subs = model_subs or cand_subs
        if _any_intersection(entry_subs, base_subs):
            return 0
        return 4

    elif api == "brenda":
        base_subs = model_subs or cand_subs
        if _any_intersection(entry_subs, base_subs):
            return 0
        return 4

    return 4

check_temperature(candidate, general_criteria, api_output, min_r2=0.8, expected_range=(50000, 150000))

Checks whether the temperature in a candidate entry matches the expected temperature. If the temperature is within the specified range is not met, verify if the Arrhenius equation can be applied.

Source code in wildkcat/utils/matching.py
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
def check_temperature(candidate, general_criteria, api_output, min_r2=0.8, expected_range=(50000, 150000)): 
    """
    Checks whether the temperature in a candidate entry matches the expected temperature.
    If the temperature is within the specified range is not met, verify if the Arrhenius equation can be applied.
    """

    temp_min, temp_max = general_criteria["Temperature"]
    candidate_temp = candidate.get("Temperature")

    if temp_min <= candidate_temp <= temp_max:
        return 0, False

    # Try to find a correct the kcat value using the Arrhenius equation
    ph_min, ph_max = general_criteria["pH"]

    # Base filters
    filters = (
        api_output["pH"].between(ph_min, ph_max)
        & (api_output["UniProtKB_AC"] == candidate["UniProtKB_AC"])
        & api_output["Temperature"].notna()
        & api_output["value"].notna()
    )

    valid_idx = api_output.apply(
        lambda row: check_substrate(row.to_dict(), None, candidate) == 0,
        axis=1
        )

    filters = filters & valid_idx

    temps_dispo = api_output.loc[filters, "Temperature"].nunique()
    api_filtered = api_output.loc[filters, ["Temperature", "value"]].copy()

    # Convert temperatures to Kelvin
    api_filtered["Temperature"] = api_filtered["Temperature"] + 273.15

    if temps_dispo >= 2:
        ea, r2 = calculate_ea(api_filtered)
        if r2 >= min_r2 and ea > 0:
            if not (expected_range[0] <= ea <= expected_range[1]):
                logging.warning(f"{candidate.get('ECNumber')}: Estimated Ea ({ea:.0f} J/mol) is outside the expected range {expected_range} J/mol.")
            # Go Arrhenius
            return 0, True

    if pd.isna(candidate_temp):
        return 1, False

    else:
        return 2, False

check_variant(candidate)

Checks whether the enzyme variant in a candidate entry is wildtype or unknown.

Source code in wildkcat/utils/matching.py
187
188
189
190
191
192
193
194
def check_variant(candidate):
    """
    Checks whether the enzyme variant in a candidate entry is wildtype or unknown.
    """
    if candidate["EnzymeVariant"] == "wildtype":
        return 0
    else:  # Unknown
        return 1

compute_score(kcat_dict, candidate, general_criteria, api_output)

Compute a score for the candidate based on the Kcat dictionary and general criteria.

Source code in wildkcat/utils/matching.py
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
def compute_score(kcat_dict, candidate, general_criteria, api_output):
    """
    Compute a score for the candidate based on the Kcat dictionary and general criteria.
    """
    score = 0
    # Check catalytic enzyme
    score += check_catalytic_enzyme(candidate, kcat_dict) # + 0 or 3
    # Check organism
    if score != 0: 
        score += check_organism(candidate, general_criteria) # + 0 or 2  
    # Check variant
    score += check_variant(candidate) # + 0, 1
    # Check pH
    score += check_pH(candidate, general_criteria) # + 0, 1 or 2
    # Check substrate 
    score += check_substrate(candidate, kcat_dict) # + 0 or 4
    # Check temperature 
    temperature_penalty, arrhenius = check_temperature(candidate, general_criteria, api_output) # + 0, 1 or 2
    score += temperature_penalty
    return score, arrhenius

find_best_match(kcat_dict, api_output, general_criteria)

Finds the best matching enzyme entry from the provided API output based on
  • Kcat specific criteria:
    • Substrate
    • Catalytic enzyme(s)
  • General criteria :
    • Organism
    • Temperature
    • pH

This function filters out mutant enzyme variants, orders the remaining entries based on enzyme and organism similarity, and iteratively computes a score for each candidate to identify the best match. If a candidate requires an Arrhenius adjustment, the kcat value is recalculated accordingly.

Parameters:

Name Type Description Default
kcat_dict dict

Dictionary containing enzyme information.

required
api_output DataFrame

DataFrame containing kcat entries and metadata from an API.

required
general_criteria dict

Dictionary specifying matching criteria.

required

Returns:

Name Type Description
tuple Tuple[float, Optional[Dict[str, Any]]]

best_score (float): The lowest score found, representing the best match. best_candidate (dict or None): Dictionary of the best matching candidate's data, or None if no match is found.

Source code in wildkcat/utils/matching.py
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
def find_best_match(kcat_dict, api_output, general_criteria) -> Tuple[float, Optional[Dict[str, Any]]]:
    """
    Finds the best matching enzyme entry from the provided API output based on: 
        - Kcat specific criteria: 
            * Substrate 
            * Catalytic enzyme(s)
        - General criteria : 
            * Organism
            * Temperature
            * pH

    This function filters out mutant enzyme variants, orders the remaining entries based on enzyme and organism similarity,
    and iteratively computes a score for each candidate to identify the best match. If a candidate requires an Arrhenius
    adjustment, the kcat value is recalculated accordingly.

    Parameters:
        kcat_dict (dict): Dictionary containing enzyme information.
        api_output (pd.DataFrame): DataFrame containing kcat entries and metadata from an API.
        general_criteria (dict): Dictionary specifying matching criteria.

    Returns:
        tuple:
            best_score (float): The lowest score found, representing the best match.
            best_candidate (dict or None): Dictionary of the best matching candidate's data, or None if no match is found.
    """

    # 1. Remove mutant enzymes
    api_output = api_output[api_output["EnzymeVariant"].isin(['wildtype', None])].copy()
    if api_output.empty:
        return 15, None

    # 2. Compute score and adjust kcat if needed
    scores = []
    adjusted_kcats, adjusted_temps = [], []

    for _, row in api_output.iterrows():
        candidate_dict = row.to_dict()
        score, arrhenius = compute_score(kcat_dict, candidate_dict, general_criteria, api_output)
        if arrhenius:
            kcat = arrhenius_equation(candidate_dict, api_output, general_criteria)
            if 10e-8 < kcat < 10e+8: 
                candidate_dict['value'] = kcat
                candidate_dict['Temperature'] = np.mean(general_criteria["Temperature"])
            # If the kcat value calculated using the Arrhenius is aberrant, use the non correct value instead
            else:
                logging.warning(f"{candidate_dict.get('ECNumber')}: Corrected kcat ({kcat:.0f} s-1) is outside the expected range of 10e-8, 10e+8.")
                temperature = float(candidate_dict['Temperature'])
                if np.isnan(temperature): 
                    score += 1
                else: 
                    score += 2
        scores.append(score)
        adjusted_kcats.append(candidate_dict.get('value', row['value']))
        adjusted_temps.append(candidate_dict.get('Temperature', row['Temperature']))

    api_output['score'] = scores
    api_output['adj_kcat'] = adjusted_kcats
    api_output['adj_temp'] = adjusted_temps

    api_output["score"] = pd.to_numeric(api_output["score"], errors="coerce").fillna(13)
    api_output["adj_kcat"] = pd.to_numeric(api_output["adj_kcat"], errors="coerce")

    # Initialize columns for tie-breaking
    api_output['id_perc'] = -1
    api_output['organism_score'] = np.inf

    # 3. Keep only best-score candidates
    min_score = api_output['score'].min()
    tied = api_output[api_output['score'] == min_score]

    # 4. Tie-breaking
    if len(tied) > 1:
        # Tie-break with enzyme identity
        tied = closest_enz(kcat_dict, tied)
        if not tied['id_perc'].isna().all():
            max_id = tied['id_perc'].max()
            tied = tied[tied['id_perc'] == max_id]

    if len(tied) > 1:
        # Tie-break with taxonomy
        tied = closest_taxonomy(general_criteria, tied)
        if not tied['organism_score'].isna().all():
            min_tax = tied['organism_score'].min()
            tied = tied[tied['organism_score'] == min_tax]

    if len(tied) > 1:
        # Tie-break with max kcat value
        max_kcat = tied['adj_kcat'].max()
        tied = tied[tied['adj_kcat'] == max_kcat]

    # 5. Select best candidate
    best_candidate = tied.iloc[0].to_dict()
    best_candidate['catalytic_enzyme'] = kcat_dict.get('catalytic_enzyme')
    best_score = best_candidate['score']

    # 6. Compute organism_score and id_perc if not present 
    if best_candidate['organism_score'] == np.inf:
        if best_candidate.get('Organism') == general_criteria['Organism']:
            best_candidate['organism_score'] = 0
        else:
            tmp_df = pd.DataFrame([best_candidate])
            taxonomy_score = closest_taxonomy(general_criteria, tmp_df).iloc[0]['organism_score']
            best_candidate['organism_score'] = taxonomy_score

    if best_candidate['id_perc'] == -1:
        catalytic = kcat_dict.get('catalytic_enzyme')
        catalytic_list = str(catalytic).split(";") if catalytic and pd.notna(catalytic) else []

        if best_candidate.get('UniProtKB_AC') in catalytic_list:
            best_candidate['id_perc'] = 100.0
        else:
            tmp_df = pd.DataFrame([best_candidate])
            best_candidate['id_perc'] = closest_enz(kcat_dict, tmp_df).iloc[0]['id_perc']

    return best_score, best_candidate

Find closest enzyme and organism

wildkcat.utils.organism

closest_enz(kcat_dict, api_output)

Retrieve and ranks the enzymes sequences closest to the sequence of the target enzyme based on the percentage of identity. If the reference UniProt ID is missing, invalid, or the sequence cannot be retrieved, the function returns the input DataFrame with "id_perc" set to None.

Parameters:

Name Type Description Default
kcat_dict dict

Dictionary containing at least the key 'uniprot_model' with the reference UniProt ID.

required
api_output DataFrame

DataFrame containing a column "UniProtKB_AC" with UniProt IDs to compare against.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A copy of api_output with an added "id_perc" column (identity percentage).

Source code in wildkcat/utils/organism.py
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
def closest_enz(kcat_dict, api_output) -> pd.DataFrame:
    """
    Retrieve and ranks the enzymes sequences closest to the sequence of the target enzyme based on the percentage of identity.
    If the reference UniProt ID is missing, invalid, or the sequence cannot be retrieved, the function returns the input DataFrame with "id_perc" set to None.

    Parameters:    
        kcat_dict (dict): Dictionary containing at least the key 'uniprot_model' with the reference UniProt ID.
        api_output (pd.DataFrame): DataFrame containing a column "UniProtKB_AC" with UniProt IDs to compare against.

    Returns:
        pd.DataFrame: A copy of `api_output` with an added "id_perc" column (identity percentage). 
    """

    def _calculate_identity(seq_ref, seq_db):
        """
        Returns the percentage of identical characters between two sequences.
        Adapted from https://gist.github.com/JoaoRodrigues/8c2f7d2fc5ae38fc9cb2 

        Parameters: 
            seq_ref (str): The reference sequence.
            seq_db (str): The sequence to compare against.

        Returns: 
            float: The percentage of identical characters between the two sequences.
        """
        matches = [a == b for a, b in zip(seq_ref, seq_db)]
        return (100 * sum(matches)) / len(seq_ref)

    ref_uniprot_id = kcat_dict.get('catalytic_enzyme')
    if pd.isna(ref_uniprot_id) or (";" in str(ref_uniprot_id)):
        api_output = api_output.copy()
        api_output["id_perc"] = None
        return api_output

    ref_seq = convert_uniprot_to_sequence(ref_uniprot_id)
    if ref_seq is None:
        api_output = api_output.copy()
        api_output["id_perc"] = None
        return api_output

    aligner = Align.PairwiseAligner()
    identity_scores = []

    for uniprot_id in api_output["UniProtKB_AC"]:
        if pd.isna(uniprot_id):
            identity_scores.append(None)
            continue
        seq = convert_uniprot_to_sequence(uniprot_id)
        if seq is None:
            identity_scores.append(None)
            continue
        elif len(seq) == 0:
            identity_scores.append(0)
            continue

        alignments = aligner.align(ref_seq, seq)
        aligned_ref, aligned_db = alignments[0]
        id_score = _calculate_identity(aligned_ref, aligned_db)
        identity_scores.append(id_score)

    api_output = api_output.copy()
    api_output["id_perc"] = identity_scores

    return api_output

closest_taxonomy(general_criteria, api_output)

Retrieve and ranks the organisms based on their taxonomic similarity to the reference organism.

Parameters:

Name Type Description Default
general_criteria dict

Dictionary containing at least the key 'organism' with the reference organism.

required
api_output DataFrame

DataFrame containing a column "Organism".

required

Returns:

Type Description
DataFrame

pd.DataFrame: A copy of api_output with an added "organism_score" column.

Source code in wildkcat/utils/organism.py
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
def closest_taxonomy(general_criteria, api_output) -> pd.DataFrame: 
    """
    Retrieve and ranks the organisms based on their taxonomic similarity to the reference organism.

    Parameters:    
        general_criteria (dict): Dictionary containing at least the key 'organism' with the reference organism.
        api_output (pd.DataFrame): DataFrame containing a column "Organism". 

    Returns:
        pd.DataFrame: A copy of `api_output` with an added "organism_score" column.
    """
    @lru_cache(maxsize=None)
    def _fetch_taxonomy(species_name): 
        """
        Fetches the taxonomic lineage for a given species name using NCBI Entrez.

        Parameters:
            species_name (str): The name of the species.

        Returns: 
            list: A list of scientific names representing the taxonomic lineage.
        """
        Entrez.email = os.getenv("ENTREZ_EMAIL")

        for attempt in range(1, 4): # Retry up to 3 times
            try:
                handle = Entrez.esearch(db="taxonomy", term=species_name)
                record = Entrez.read(handle)
                if not record["IdList"]:
                    return []
                tax_id = record["IdList"][0]

                handle = Entrez.efetch(db="taxonomy", id=tax_id, retmode="xml")
                records = Entrez.read(handle, validate=False)
                if not records:
                    return []

                lineage = [taxon["ScientificName"] for taxon in records[0]["LineageEx"]]
                lineage.append(records[0]["ScientificName"])  # include the species itself
                return lineage

            except (HTTPError, URLError) as e:
                if attempt < 3:
                    sleep_time = 5
                    time.sleep(sleep_time)
                else:
                    return []

            except Exception as e:
                print(f"[Error] Unexpected error for '{species_name}': {e}")
                return []

    @lru_cache(maxsize=None)
    def _calculate_taxonomy_score(ref_organism, target_organism): 
        """
        Calculate a taxonomy distance score between reference and target organisms.

        Parameters: 
            ref_organism (str): The reference organism's name.
            target_organism (str): The target organism's name.

        Returns:
            int: distance between reference and target organisms (0 = identical species, higher = more distant).
        """
        ref_lineage = _fetch_taxonomy(ref_organism)
        target_lineage = _fetch_taxonomy(target_organism)

        if not target_lineage: # If target organism is not found
            return len(ref_lineage) + 1  # Penalize missing taxonomy

        similarity = 0

        for taxon in target_lineage: 
            if taxon in ref_lineage:
                similarity += 1
            else:
                break
        return len(ref_lineage) - similarity

    ref_organism = general_criteria['Organism']
    api_output = api_output.copy()
    api_output["organism_score"] = [
        _calculate_taxonomy_score(ref_organism, target) 
        for target in api_output["Organism"]
    ]
    return api_output

Correct the kcat value using Arrhenius equation

wildkcat.utils.temperature

arrhenius_equation(candidate, api_output, general_criteria)

Estimates the kcat value at a target temperature using the Arrhenius equation, based on available experimental data.

Parameters:

Name Type Description Default
candidate dict

Information about the enzyme candidate.

required
api_output DataFrame

DataFrame containing experimental kcat values.

required
general_criteria dict

Dictionary specifying selection criteria, including 'Temperature' and 'pH'.

required

Returns:

Name Type Description
float float

Estimated kcat value at the objective temperature, calculated using the Arrhenius equation.

Source code in wildkcat/utils/temperature.py
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
def arrhenius_equation(candidate, api_output, general_criteria) -> float:
    """
    Estimates the kcat value at a target temperature using the Arrhenius equation, based on available experimental data.

    Parameters:
        candidate (dict): Information about the enzyme candidate.
        api_output (pd.DataFrame): DataFrame containing experimental kcat values.
        general_criteria (dict): Dictionary specifying selection criteria, including 'Temperature' and 'pH'.

    Returns:
        float: Estimated kcat value at the objective temperature, calculated using the Arrhenius equation.
    """

    def calculate_kcat(temp_obj, ea, kcat_ref, temp_ref): 
        """
        Calculates the catalytic rate constant (kcat) at a given temperature using the Arrhenius equation.

        Parameters: 
            temp_obj (float): The target temperature (in Kelvin) at which to calculate kcat.
            ea (float): The activation energy calculated using find_ea(). 
            kcat_ref (float): The reference kcat value measured at temp_ref.
            temp_ref (float): The reference temperature (in Kelvin) at which kcat_ref was measured.

        Returns: 
            float: The calculated kcat value at temp_obj.
        """
        r = 8.314
        kcat_obj = kcat_ref * np.exp(ea / r * (1/temp_ref - 1/temp_obj))
        return kcat_obj

    # Objective temperature
    obj_temp = np.mean(general_criteria["Temperature"]) + 273.15

    # Format the api_output DataFrame
    ph_min, ph_max = general_criteria["pH"]
    filters = (
        (api_output["UniProtKB_AC"] == candidate["UniProtKB_AC"]) &
        api_output["Temperature"].notna() &
        api_output["value"].notna() &
        api_output["pH"].between(ph_min, ph_max)
    )
    api_filtered = api_output.loc[filters, ["Temperature", "value"]].copy()

    # Convert temperatures to Kelvin
    api_filtered["Temperature"] = api_filtered["Temperature"] + 273.15

    # Estimate the activation energy (Ea)
    ea, _ = calculate_ea(api_filtered)

    # Select one kcat for the ref
    kcat_ref = float(api_filtered['value'].iloc[0])
    temp_ref = float(api_filtered['Temperature'].iloc[0])

    kcat = calculate_kcat(obj_temp, ea, kcat_ref, temp_ref)

    return kcat

calculate_ea(df)

Estimate the activation energy (Ea) using the Arrhenius equation from kcat values at different temperatures.

Parameters:

Name Type Description Default
df DataFrame

DataFrame with at least 'Temperature' (°C) and 'value' (kcat) columns.

required

Returns:

Name Type Description
float float

Estimated activation energy (Ea) in J/mol.

Source code in wildkcat/utils/temperature.py
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
def calculate_ea(df) -> float:
    """
    Estimate the activation energy (Ea) using the Arrhenius equation from kcat values at different temperatures.

    Parameters:
        df (pd.DataFrame): DataFrame with at least 'Temperature' (°C) and 'value' (kcat) columns.

    Returns:
        float: Estimated activation energy (Ea) in J/mol. 
    """

    r = 8.314  # Gas constant in J/(mol*K)

    # Filter out rows with missing values
    valid = df[['Temperature', 'value']].dropna()

    temps_K = valid['Temperature'].values
    kcats = pd.to_numeric(valid['value'], errors='coerce').values

    x = 1 / temps_K
    y = np.log(kcats)
    slope, intercept = np.polyfit(x, y, 1)

    # R2 
    y_pred = slope * x + intercept
    ss_res = np.sum((y - y_pred) ** 2)
    ss_tot = np.sum((y - np.mean(y)) ** 2)
    r2 = 1 - ss_res / ss_tot if ss_tot != 0 else np.nan

    # Activation energy 
    ea = float(-slope * r)

    return ea, r2 

API

wildkcat.api.brenda_api

create_brenda_client(wsdl_url='https://www.brenda-enzymes.org/soap/brenda_zeep.wsdl')

Creates and configures a persistent SOAP client for the BRENDA API.

Parameters:

Name Type Description Default
wsdl_url str

URL to the BRENDA WSDL file.

'https://www.brenda-enzymes.org/soap/brenda_zeep.wsdl'

Returns:

Type Description
Client

zeep.Client: Configured SOAP client.

Source code in wildkcat/api/brenda_api.py
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
def create_brenda_client(wsdl_url: str = "https://www.brenda-enzymes.org/soap/brenda_zeep.wsdl") -> Client:
    """
    Creates and configures a persistent SOAP client for the BRENDA API.

    Parameters:
        wsdl_url (str): URL to the BRENDA WSDL file.

    Returns:
        zeep.Client: Configured SOAP client.
    """
    # Configure retry logic for network resilience
    session = Session()
    retry = Retry(total=3, backoff_factor=0.5, status_forcelist=[500, 502, 503, 504])
    adapter = HTTPAdapter(max_retries=retry)
    session.mount("http://", adapter)
    session.mount("https://", adapter)

    # Set a custom User-Agent (some servers block default Python UA)
    session.headers.update({"User-Agent": "BRENDA-Client"})

    # Create zeep transport and settings
    transport = Transport(session=session, cache=InMemoryCache())
    settings = Settings(strict=False, xml_huge_tree=True) 

    return Client(wsdl_url, settings=settings, transport=transport)

format_brenda_response(df, df_org, ec_number=None)

Merge and formats the BRENDA API response DataFrame.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing turnover number entries.

required
df_org DataFrame

DataFrame containing organism entries.

required
ec_number str

EC number for cofactor retrieval.

None

Returns:

Name Type Description
df DataFrame

DataFrame containing both information from turnover number and organism entries.

Source code in wildkcat/api/brenda_api.py
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
def format_brenda_response(df, df_org, ec_number=None) -> pd.DataFrame:
    """
    Merge and formats the BRENDA API response DataFrame.

    Parameters:
        df (pd.DataFrame): DataFrame containing turnover number entries.
        df_org (pd.DataFrame): DataFrame containing organism entries.
        ec_number (str, optional): EC number for cofactor retrieval.

    Returns:
        df (pd.DataFrame): DataFrame containing both information from turnover number and organism entries.
    """
    # Format the organism response
    df_org.drop(columns=['commentary', 'textmining'], inplace=True, errors='ignore')

    # Merge on the literature column TODO: Check if this can be improved 
    df_org['literature'] = df_org['literature'].apply(lambda x: x[0] if isinstance(x, list) and len(x) > 0 else x)
    df['literature'] = df['literature'].apply(lambda x: x[0] if isinstance(x, list) and len(x) > 0 else x)
    df = pd.merge(df, df_org, on=['literature', 'organism'], how='inner')
    df.drop_duplicates(inplace=True)

    # Rename columns for consistency with other APIs
    df.rename(columns={
        'turnoverNumber': 'value',
        'sequenceCode' : 'UniProtKB_AC',
        'substrate': 'Substrate',
        'organism': 'Organism',
        'ecNumber': 'ECNumber'}, inplace=True) 

    # Extract pH from commentary
    df["pH"] = df["commentary"].str.extract(r"pH\s*([\d\.]+)")
    # Extract temperature from commentary
    df["Temperature"] = df["commentary"].str.extract(r"([\d\.]+)\?C")
    # Convert Temperature and pH to numeric, coercing errors to NaN
    df['Temperature'] = pd.to_numeric(df['Temperature'], errors='coerce')
    df['pH'] = pd.to_numeric(df['pH'], errors='coerce')
    # Extract enzyme variant from commentary
    df["EnzymeVariant"] = df["commentary"].apply(get_variant)
    # Drop unnecessary columns
    df.drop(columns=["literature", "turnoverNumberMaximum", "parameter.endValue", "commentary", "ligandStructureId"], inplace=True, errors='ignore')

    if ec_number is not None:
        # Remove the cofactor from the output 
        cofactor = get_cofactor(ec_number)
        # Drop the lines where the substrate is a cofactor
        df = df[~df['Substrate'].isin(cofactor)]   

    # Drop duplicates
    df.drop_duplicates(inplace=True)
    # Add a column for the db 
    df['db'] = 'brenda' 
    return df

get_brenda_credentials()

Retrieves and hashes BRENDA API credentials from environment variables.

Returns:

Type Description
tuple[str, str]

tuple[str, str]: (email, hashed_password)

Source code in wildkcat/api/brenda_api.py
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
def get_brenda_credentials() -> tuple[str, str]:
    """
    Retrieves and hashes BRENDA API credentials from environment variables.

    Returns:
        tuple[str, str]: (email, hashed_password)
    """
    email = os.getenv("BRENDA_EMAIL")
    password = os.getenv("BRENDA_PASSWORD")

    if not email or not password:
        raise ValueError("BRENDA_EMAIL and BRENDA_PASSWORD environment variables must be set.")

    hashed_password = hashlib.sha256(password.encode("utf-8")).hexdigest()
    return email, hashed_password

get_cofactor(ec_number) cached

Queries the BRENDA SOAP API to retrieve cofactor information for a given Enzyme Commission (EC) number.

Parameters:

Name Type Description Default
ec_number str

EC number (e.g., '1.1.1.1').

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame containing turnover number entries.

Source code in wildkcat/api/brenda_api.py
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
@lru_cache(maxsize=None)
def get_cofactor(ec_number) -> pd.DataFrame:
    """
    Queries the BRENDA SOAP API to retrieve cofactor information for a given Enzyme Commission (EC) number.

    Parameters:
        ec_number (str): EC number (e.g., '1.1.1.1').

    Returns:
        pd.DataFrame: A DataFrame containing turnover number entries.
    """
    # Call the SOAP API
    email, hashed_password = get_brenda_credentials()
    client = create_brenda_client()

    parameters_cofactor = [
        email,
        hashed_password,
        f'ecNumber*{ec_number}',
        "cofactor*", 
        "commentary*", 
        "organism*", 
        "ligandStructureId*", 
        "literature*"
    ]

    result_cofactor = client.service.getCofactor(*parameters_cofactor)
    data = serialize_object(result_cofactor)
    df = pd.DataFrame(data)
    if df.empty:
        return []
    cofactor = df['cofactor'].unique().tolist()
    return cofactor

get_enzyme_brenda(uniprot_id, organism) cached

Queries the BRENDA SOAP API to retrieve turnover number values for a Uniprot enzyme.

Parameters:

Name Type Description Default
uniprot_id str

UniProt ID of the enzyme (e.g., 'P12345').

required
organism str

Name of the organism.

required

Returns:

Name Type Description
df DataFrame

DataFrame containing both information from turnover number and organism entries.

Source code in wildkcat/api/brenda_api.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
@lru_cache(maxsize=None)
def get_enzyme_brenda(uniprot_id, organism) -> pd.DataFrame:
    """
    Queries the BRENDA SOAP API to retrieve turnover number values for a Uniprot enzyme.

    Parameters:
        uniprot_id (str): UniProt ID of the enzyme (e.g., 'P12345').
        organism (str): Name of the organism.

    Returns:
        df (pd.DataFrame): DataFrame containing both information from turnover number and organism entries.
    """

    email, hashed_password = get_brenda_credentials()
    client = create_brenda_client()

    # Define the parameters for the SOAP request
    parameters = [
        email,
        hashed_password,
        "ecNumber*",
        f"organism*{organism}",
        f"sequenceCode*{uniprot_id}",
        "commentary*", 
        "literature*",
        "textmining*"
    ]

    result = client.service.getOrganism(*parameters)

    data = serialize_object(result)

    if not data:
        logging.warning('%s: No data found for the query in BRENDA.' % f"{uniprot_id}")
        return pd.DataFrame()

    df_enz = pd.DataFrame(data)
    df_org = get_kcat_from_organism(organism)

    df = format_brenda_response(df_org, df_enz)

    if df.empty:
        logging.warning('%s: No valid data found for the query in BRENDA.' % f"{uniprot_id}")
        return pd.DataFrame()

    return df

get_kcat_from_organism(organism) cached

Queries the BRENDA SOAP API to retrieve organism information.

Parameters:

Name Type Description Default
organism str

Name of the organism.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame containing organism information.

Source code in wildkcat/api/brenda_api.py
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
@lru_cache(maxsize=None)
def get_kcat_from_organism(organism) -> pd.DataFrame:
    """
    Queries the BRENDA SOAP API to retrieve organism information.

    Parameters:
        organism (str): Name of the organism.

    Returns:
        pd.DataFrame: A DataFrame containing organism information.
    """
    email, hashed_password = get_brenda_credentials()
    client = create_brenda_client()

    parameters = [
        email,
        hashed_password,
        "ecNumber*",
        "turnoverNumber*", 
        "turnoverNumberMaximum*", 
        "substrate*", 
        "commentary*", 
        f"organism*{organism}", 
        "ligandStructureId*", 
        "literature*"
    ]

    result = client.service.getTurnoverNumber(*parameters)
    data = serialize_object(result)

    if not data:
        raise ValueError(f"The specified organism {organism} does not exist in the BRENDA database. Please verify the organism name.")

    # Remove None values (-999)
    data = [entry for entry in data if entry.get('turnoverNumber') is not None and entry.get('turnoverNumber') != '-999']
    if data == []:
        raise ValueError(f"The specified organism {organism} does not exist in the BRENDA database. Please verify the organism name.")

    df = pd.DataFrame(data)

    return df

get_turnover_number_brenda(ec_number) cached

Queries the BRENDA SOAP API to retrieve turnover number values for a Enzyme Commission (EC) Number.

Parameters:

Name Type Description Default
ec_number str

EC number (e.g., '1.1.1.1').

required

Returns:

Name Type Description
df DataFrame

DataFrame containing both information from turnover number and organism entries.

Source code in wildkcat/api/brenda_api.py
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
@lru_cache(maxsize=None)
def get_turnover_number_brenda(ec_number) -> pd.DataFrame:
    """
    Queries the BRENDA SOAP API to retrieve turnover number values for a Enzyme Commission (EC) Number.

    Parameters:
        ec_number (str): EC number (e.g., '1.1.1.1').

    Returns:
        df (pd.DataFrame): DataFrame containing both information from turnover number and organism entries.
    """
    email, hashed_password = get_brenda_credentials()
    client = create_brenda_client()

    # Define the parameters for the SOAP request

    parameters_kcat = [
        email,
        hashed_password,
        f'ecNumber*{ec_number}',
        "turnoverNumber*", 
        "turnoverNumberMaximum*", 
        "substrate*", 
        "commentary*", 
        "organism*", 
        "ligandStructureId*", 
        "literature*"
    ]

    parameters_org = [
        email,
        hashed_password,
        f'ecNumber*{ec_number}',
        "organism*",
        "sequenceCode*", 
        "commentary*", 
        "literature*",
        "textmining*"
    ]

    # print(client.service.__getattr__('getTurnoverNumber').__doc__)
    # print(client.service.__getattr__('getOrganism').__doc__)

    result_kcat = client.service.getTurnoverNumber(*parameters_kcat)
    result_organism = client.service.getOrganism(*parameters_org)

    # Format the response into a DataFrame
    data = serialize_object(result_kcat)
    data_organism = serialize_object(result_organism)

    if not data:
        logging.warning('%s: No data found for the query in BRENDA.' % f"{ec_number}")
        return pd.DataFrame()

    # Remove None values (-999)
    data = [entry for entry in data if entry.get('turnoverNumber') is not None and entry.get('turnoverNumber') != '-999']
    if data == []:
        logging.warning('%s: No valid data found for the query in BRENDA.' % f"{ec_number}")
        return pd.DataFrame()

    df = pd.DataFrame(data)
    df_org = pd.DataFrame(data_organism)

    # Format and merge the response
    df_formatted = format_brenda_response(df, df_org, ec_number)
    return df_formatted

get_variant(text)

Extracts the enzyme variant information from the commentary text.

Parameters:

Name Type Description Default
text str

Commentary text from BRENDA API response.

required

Returns:

Name Type Description
str str | None

The extracted enzyme variant information: wildtype, mutant, or None if not found.

Source code in wildkcat/api/brenda_api.py
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
def get_variant(text) -> str | None:
    """
    Extracts the enzyme variant information from the commentary text.

    Parameters:
        text (str): Commentary text from BRENDA API response.

    Returns:
        str: The extracted enzyme variant information: wildtype, mutant, or None if not found.
    """
    if text is None or pd.isna(text):
        return None
    text = text.lower()
    if "wild" in text:  # wild-type, wildtype or wild type
        return "wildtype"
    elif any(word in text for word in ["mutant", "mutated", "mutation"]):
        return "mutant"
    return None

wildkcat.api.sabio_rk_api

get_enzyme_sabio(uniprot_id)

Retrieve enzyme data from SABIO-RK for a given UniProtKB accession.

Parameters:

Name Type Description Default
uniprot_id str

UniProtKB accession.

required
Source code in wildkcat/api/sabio_rk_api.py
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
def get_enzyme_sabio(uniprot_id) -> pd.DataFrame:
    """
    Retrieve enzyme data from SABIO-RK for a given UniProtKB accession.

    Parameters:
        uniprot_id (str): UniProtKB accession.
    """
    base_url = 'https://sabiork.h-its.org/sabioRestWebServices/searchKineticLaws/entryIDs'
    entryIDs = []

    # -- Retrieve entryIDs --
    query = {'format': 'txt', 'q': f'Parametertype:"kcat" AND UniProtKB_AC:"{uniprot_id}"'}

    # Make GET request
    request = requests.get(base_url, params=query)
    request.raise_for_status()
    if request.text == "no data found":
        logging.warning('%s: No data found for the query in SABIO-RK.' % f"{uniprot_id}")
        return pd.DataFrame()  # Return empty DataFrame if no data found

    entryIDs = [int(x) for x in request.text.strip().split('\n')]
    df = query_sabio(entryIDs)

    return df

get_turnover_number_sabio(ec_number) cached

Retrieve turnover number (kcat) data from SABIO-RK for a given EC number.

Parameters:

Name Type Description Default
ec_number str

Enzyme Commission number.

required

Returns:

Type Description
DataFrame

pd.DataFrame: DataFrame containing SABIO-RK entries for kcat.

Source code in wildkcat/api/sabio_rk_api.py
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
@lru_cache(maxsize=None)
def get_turnover_number_sabio(ec_number) -> pd.DataFrame:
    """
    Retrieve turnover number (kcat) data from SABIO-RK for a given EC number.

    Parameters:
        ec_number (str): Enzyme Commission number.

    Returns:
        pd.DataFrame: DataFrame containing SABIO-RK entries for kcat.
    """
    base_url = 'https://sabiork.h-its.org/sabioRestWebServices/searchKineticLaws/entryIDs'
    entryIDs = []

    # -- Retrieve entryIDs --
    query = {'format': 'txt', 'q': f'Parametertype:"kcat" AND ECNumber:"{ec_number}"'}

    # Make GET request
    request = requests.get(base_url, params=query)
    request.raise_for_status()
    if request.text == "no data found":
        logging.warning('%s: No data found for the query in SABIO-RK.' % f"{ec_number}")
        return pd.DataFrame()  # Return empty DataFrame if no data found

    entryIDs = [int(x) for x in request.text.strip().split('\n')]
    df = query_sabio(entryIDs)

    return df

query_sabio(entryIDs)

Retrieve SABIO-RK entries for given entry IDs.

Parameters:

Name Type Description Default
entryIDs list

List of SABIO-RK entry IDs.

required

Returns:

Type Description
DataFrame

pd.DataFrame: DataFrame containing SABIO-RK entries.

Source code in wildkcat/api/sabio_rk_api.py
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
def query_sabio(entryIDs) -> pd.DataFrame:
    """
    Retrieve SABIO-RK entries for given entry IDs.

    Parameters:
        entryIDs (list): List of SABIO-RK entry IDs.

    Returns:
        pd.DataFrame: DataFrame containing SABIO-RK entries.
    """
    parameters = 'https://sabiork.h-its.org/entry/exportToExcelCustomizable'

    data_field = {'entryIDs[]': entryIDs}
    # Possible fields to retrieve:
    # EntryID, Reaction, Buffer, ECNumber, CellularLocation, UniProtKB_AC, Tissue, Enzyme Variant, Enzymename, Organism
    # Temperature, pH, Activator, Cofactor, Inhibitor, KeggReactionID, KineticMechanismType, Other Modifier, Parameter,
    # Pathway, Product, PubMedID, Publication, Rate Equation, SabioReactionID, Substrate
    query = {'format':'tsv', 'fields[]':['EntryID', 'ECNumber', 'KeggReactionID', 'Reaction', 'Substrate', 'Product', 
                                         'UniProtKB_AC', 'Organism', 'Enzyme Variant', 'Temperature', 'pH', 
                                         'Parameter']}

    # Make POST request
    request = requests.post(parameters, params=query, data=data_field)
    request.raise_for_status()

    # Format the response into a DataFrame
    df = pd.read_csv(StringIO(request.text), sep='\t')
    df = df[df['parameter.name'].str.lower() == 'kcat'].reset_index(drop=True) # Keep only kcat parameters
    # Convert Temperature and pH to numeric, coercing errors to NaN
    df['Temperature'] = pd.to_numeric(df['Temperature'], errors='coerce')
    df['pH'] = pd.to_numeric(df['pH'], errors='coerce')
    # Drop unnecessary columns
    df.drop(columns=['EntryID', 'parameter.name', 'parameter.type', 'parameter.associatedSpecies', 
                     'parameter.endValue', 'parameter.standardDeviation'], inplace=True, errors='ignore')
    # Drop duplicates based on normalized Substrate and Product sets
    df["Substrate_set"] = df["Substrate"].fillna("").str.split(";").apply(lambda x: tuple(sorted(s.strip() for s in x if s.strip())))
    df["Product_set"] = df["Product"].fillna("").str.split(";").apply(lambda x: tuple(sorted(s.strip() for s in x if s.strip())))
    dedup_cols = [col for col in df.columns if col not in ["Substrate", "Product"]]
    df = df.drop_duplicates(subset=dedup_cols + ["Substrate_set", "Product_set"], keep="first")
    df = df.drop(columns=["Substrate_set", "Product_set"])
    # Rename columns for consistency
    df.rename(columns={
        'ECNumber': 'ECNumber',
        'KeggReactionID': 'KeggReactionID',
        'Reaction': 'Reaction',
        'Substrate': 'Substrate',
        'Product': 'Product',
        'UniProtKB_AC': 'UniProtKB_AC',
        'Organism': 'Organism',
        'Enzyme Variant': 'EnzymeVariant',
        'Temperature': 'Temperature',
        'pH': 'pH',
        'parameter.startValue': 'value',
        'parameter.unit': 'unit'
    }, inplace=True)
    # Add a column for the db
    df['db'] = 'sabio_rk'
    return df

wildkcat.api.uniprot_api

catalytic_activity(uniprot_id) cached

Retrieves the EC (Enzyme Commission) numbers associated with the catalytic activity of a given UniProt ID.

Parameters:

Name Type Description Default
uniprot_id str

The UniProt identifier for the protein of interest.

required

Returns:

Type Description
list[str] | None

list[str] or None: A list of EC numbers if found, otherwise None.

Source code in wildkcat/api/uniprot_api.py
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
@lru_cache(maxsize=None)
def catalytic_activity(uniprot_id) -> list[str] | None:
    """
    Retrieves the EC (Enzyme Commission) numbers associated with the catalytic activity of a given UniProt ID.

    Parameters:
        uniprot_id (str): The UniProt identifier for the protein of interest.

    Returns:
        list[str] or None: A list of EC numbers if found, otherwise None.
    """
    url = f"https://rest.uniprot.org/uniprotkb/{uniprot_id}?fields=cc_catalytic_activity"
    response = requests.get(url)

    if response.status_code == 200:
        data = response.json()
        ec_numbers = []
        for comment in data.get('comments', []):
            if comment.get('commentType') == 'CATALYTIC ACTIVITY':
                reaction = comment.get('reaction', {})
                ec_number = reaction.get('ecNumber')
                if ec_number:
                    ec_numbers.append(ec_number)
        if len(ec_numbers) != 0:
            return ec_numbers
    else:
        # logging.warning(f"No catalytic activity found for UniProt ID {uniprot_id}")
        return None

convert_uniprot_to_sequence(uniprot_id) cached

Convert a UniProt accession ID to its corresponding amino acid sequence.

Parameters:

Name Type Description Default
uniprot_id str

The UniProt accession ID.

required

Returns:

Name Type Description
str str | None

The amino acid sequence, or None if not found.

Source code in wildkcat/api/uniprot_api.py
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
@lru_cache(maxsize=None)
def convert_uniprot_to_sequence(uniprot_id) -> str | None:
    """
    Convert a UniProt accession ID to its corresponding amino acid sequence.

    Parameters:
        uniprot_id (str): The UniProt accession ID.

    Returns:
        str: The amino acid sequence, or None if not found.
    """
    url = f"https://rest.uniprot.org/uniprotkb/{uniprot_id}.fasta"
    response = requests.get(url)

    if response.status_code == 200:
        fasta = response.text
        lines = fasta.splitlines()
        sequence = ''.join(lines[1:])  # Skip the header
        return sequence
    else:
        # logging.warning(f"Failed to retrieve sequence for UniProt ID {uniprot_id}")
        return None

identify_catalytic_enzyme(lst_uniprot_ids, ec)

Identifies the catalytic enzyme from a list of UniProt IDs for a given EC number.

Parameters:

Name Type Description Default
lst_uniprot_ids str

A semicolon-separated string of UniProt IDs representing enzyme candidates.

required
ec str

The Enzyme Commission (EC) number to match against the catalytic activity.

required

Returns:

Type Description
str | None

str or None: The UniProt ID of the catalytic enzyme if exactly one match is found; None if no match or multiple matches are found.

Source code in wildkcat/api/uniprot_api.py
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
def identify_catalytic_enzyme(lst_uniprot_ids, ec) -> str | None:
    """
    Identifies the catalytic enzyme from a list of UniProt IDs for a given EC number.

    Parameters:
        lst_uniprot_ids (str): A semicolon-separated string of UniProt IDs representing enzyme candidates.
        ec (str): The Enzyme Commission (EC) number to match against the catalytic activity.

    Returns:
        str or None: The UniProt ID of the catalytic enzyme if exactly one match is found; 
                     None if no match or multiple matches are found.
    """ 
    enzymes_model = lst_uniprot_ids.split(';')
    catalytic_enzyme = []
    for enzyme in enzymes_model:
        if catalytic_activity(enzyme):
            if ec in catalytic_activity(enzyme):
                catalytic_enzyme.append(enzyme)
    if catalytic_enzyme == []:
        logging.warning(f"{ec}: No catalytic enzyme found for the complex {lst_uniprot_ids}.")
        catalytic_enzyme = None 
    elif len(catalytic_enzyme) > 1:
        logging.warning(f"{ec}: Multiple catalytic enzymes found for the complex {lst_uniprot_ids}.")
        catalytic_enzyme = ';'.join(catalytic_enzyme)
    else:
        catalytic_enzyme = catalytic_enzyme[0]
    return catalytic_enzyme

Machine Learning preprocessing

wildkcat.machine_learning.catapro

convert_cid_to_smiles(cid)

Converts a PubChem Compound ID (CID) to its corresponding SMILES representation.

Parameters:

Name Type Description Default
cid str

PubChem Compound ID.

required

Returns:

Type Description
list | None

list or None: A list of SMILES strings if found, otherwise None.

Source code in wildkcat/machine_learning/catapro.py
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
def convert_cid_to_smiles(cid) -> list | None:    
    """
    Converts a PubChem Compound ID (CID) to its corresponding SMILES representation.

    Parameters:
        cid (str): PubChem Compound ID.

    Returns:
       list or None: A list of SMILES strings if found, otherwise None.
    """
    url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/property/smiles/txt"
    try:
        safe_get_with_retry = retry_api()(safe_requests_get)
        response = safe_get_with_retry(url)

        if response is None:
            return None

        response.raise_for_status()
        smiles = response.text.strip().split('\n')
        return smiles
    except:
        return None

convert_kegg_compound_to_sid(kegg_compound_id)

Convert the KEGG compound ID to the PubChem Substance ID (SID).

Parameters:

Name Type Description Default
kegg_compound_id str

KEGG compound ID.

required

Returns:

Name Type Description
str str | None

The PubChem SID if found, otherwise None.

Source code in wildkcat/machine_learning/catapro.py
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
def convert_kegg_compound_to_sid(kegg_compound_id) -> str | None:
    """
    Convert the KEGG compound ID to the PubChem Substance ID (SID).

    Parameters:
        kegg_compound_id (str): KEGG compound ID.

    Returns:
        str: The PubChem SID if found, otherwise None.
    """
    url = f"https://rest.kegg.jp/conv/pubchem/compound:{kegg_compound_id}"
    safe_get_with_retry = retry_api()(safe_requests_get)
    response = safe_get_with_retry(url)

    if response is None:
        return None

    if response.status_code != 200:
        return None

    match = re.search(r'pubchem:\s*(\d+)', response.text)
    sid = match.group(1) if match else None
    return sid

convert_kegg_to_smiles(kegg_compound_id) cached

Convert the KEGG compound ID to the PubChem Compound ID (CID).

Parameters:

Name Type Description Default
kegg_compound_id str

KEGG compound ID.

required

Returns:

Type Description
list | None

list or None: A list of SMILES strings if found, otherwise None.

Source code in wildkcat/machine_learning/catapro.py
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
@lru_cache(maxsize=None)
def convert_kegg_to_smiles(kegg_compound_id) -> list | None:
    """
    Convert the KEGG compound ID to the PubChem Compound ID (CID).

    Parameters:
        kegg_compound_id (str): KEGG compound ID.

    Returns:
        list or None: A list of SMILES strings if found, otherwise None.
    """
    sid = convert_kegg_compound_to_sid(kegg_compound_id)
    if sid is None:
        logging.warning('%s: Failed to retrieve SID for KEGG compound ID' % (kegg_compound_id))
        return None
    cid = convert_sid_to_cid(sid)
    if cid is None:
        logging.warning('%s: Failed to retrieve CID for KEGG compound ID' % (kegg_compound_id))
        return None
    smiles = convert_cid_to_smiles(cid)
    if smiles is None:
        logging.warning('%s: Failed to retrieve SMILES for KEGG compound ID' % (kegg_compound_id))
        return None
    return smiles

convert_sid_to_cid(sid)

Converts a PubChem Substance ID (SID) to the corresponding Compound ID (CID).

Parameters:

Name Type Description Default
sid str

PubChem Substance ID.

required

Returns:

Type Description
int | None

int or None: The corresponding PubChem Compound ID (CID), or None if not found.

Source code in wildkcat/machine_learning/catapro.py
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
def convert_sid_to_cid(sid) -> int | None:
    """
    Converts a PubChem Substance ID (SID) to the corresponding Compound ID (CID).

    Parameters:
        sid (str): PubChem Substance ID.

    Returns:
        int or None: The corresponding PubChem Compound ID (CID), or None if not found.
    """
    url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/substance/sid/{sid}/cids/JSON"
    safe_get_with_retry = retry_api()(safe_requests_get)
    response = safe_get_with_retry(url)

    if response is None:
        return None

    if response.status_code == 200:
        try:
            cid = response.json()['InformationList']['Information'][0]['CID'][0]
        except (KeyError, IndexError):
            cid = None
    return cid

create_catapro_input_file(kcat_df)

Generate CataPro input file and a mapping of substrate KEGG IDs to SMILES.

Parameters:

Name Type Description Default
kcat_df DataFrame

Input DataFrame containing kcat information.

required

Returns:

Name Type Description
catapro_input_df DataFrame

DataFrame for CataPro input.

substrates_to_smiles dict

Mapping KEGG ID <-> SMILES.

Source code in wildkcat/machine_learning/catapro.py
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
def create_catapro_input_file(kcat_df):
    """
    Generate CataPro input file and a mapping of substrate KEGG IDs to SMILES.

    Parameters: 
        kcat_df (pd.DataFrame): Input DataFrame containing kcat information.

    Returns:
        catapro_input_df (pd.DataFrame): DataFrame for CataPro input.
        substrates_to_smiles (dict): Mapping KEGG ID <-> SMILES.
    """
    catapro_input = []
    substrates_to_smiles = {}

    counter_no_catalytic, counter_kegg_no_matching, counter_rxn_covered, counter_cofactor = 0, 0, 0, 0
    for _, row in tqdm(kcat_df.iterrows(), total=len(kcat_df), desc="Generating CataPro input"):
        uniprot = row['uniprot']
        ec_code = row['ec_code']

        if len(uniprot.split(';')) > 1:       
            catalytic_enzyme = identify_catalytic_enzyme(uniprot, ec_code)
            if catalytic_enzyme is None or (";" in str(catalytic_enzyme)):
                counter_no_catalytic += 1
                continue
            else: 
                uniprot = catalytic_enzyme

        # If the number of KEGG Compound IDs is not matching the number of names  
        if len([s for s in row['substrates_kegg'].split(';') if s]) != len(row['substrates_name'].split(';')):
            logging.warning(f"Number of KEGG compounds IDs does not match number of names for {ec_code}: {uniprot}.")
            counter_kegg_no_matching += 1
            # continue

        sequence = convert_uniprot_to_sequence(uniprot) 
        if sequence is None:
            continue

        smiles_list = []
        names = row['substrates_name'].split(';')
        kegg_ids = row['substrates_kegg'].split(';')

        # Get the cofactor for the EC code
        cofactor = get_cofactor(ec_code) 

        for name, kegg_compound_id in zip(names, kegg_ids):
            if kegg_compound_id == '':
                continue
            if name.lower() in [c.lower() for c in cofactor]:  # TODO: Should we add a warning if no cofactor is found for a reaction? 
                counter_cofactor += 1
                continue
            smiles = convert_kegg_to_smiles(kegg_compound_id)
            if smiles is not None:
                smiles_str = smiles[0]  # TODO: If multiple SMILES, take the first one ? 
                smiles_list.append(smiles_str)
                substrates_to_smiles[kegg_compound_id] = smiles_str

        if len(smiles_list) > 0:
            for smiles in smiles_list:
                catapro_input.append({
                    "Enzyme_id": uniprot,
                    "type": "wild",
                    "sequence": sequence,
                    "smiles": smiles
                })

        counter_rxn_covered += 1

    # Generate CataPro input file
    catapro_input_df = pd.DataFrame(catapro_input)
    # Remove duplicates
    before_duplicates_filter = len(catapro_input_df)
    catapro_input_df = catapro_input_df.drop_duplicates().reset_index(drop=True)
    nb_lines_dropped = before_duplicates_filter - len(catapro_input_df)
    # Remove 'nan' values
    catapro_input_df = catapro_input_df.dropna(subset=['sequence', 'smiles'])
    catapro_input_df = catapro_input_df[(catapro_input_df['sequence'].str.strip() != '') & (catapro_input_df['smiles'].str.strip() != '')]

    # Generate reverse mapping from SMILES to KEGG IDs as TSV
    substrates_to_smiles_df = pd.DataFrame(list(substrates_to_smiles.items()), columns=['kegg_id', 'smiles'])

    report_statistics = {
        "rxn_covered": counter_rxn_covered,
        "cofactor_identified": counter_cofactor,
        "no_catalytic": counter_no_catalytic,
        "kegg_no_matching": counter_kegg_no_matching,
        "duplicates_enzyme_substrates": nb_lines_dropped,
    }

    return catapro_input_df, substrates_to_smiles_df, report_statistics

integrate_catapro_predictions(kcat_df, substrates_to_smiles, catapro_predictions_df)

Integrates Catapro predictions into an kcat file. If multiple values are provided for a single combination of EC, Enzyme, Substrate, the minimum value is taken.

Parameters:

Name Type Description Default
kcat_df DataFrame

Input DataFrame containing kcat information.

required
substrates_to_smiles DataFrame

DataFrame mapping KEGG ID <-> SMILES.

required
catapro_predictions_df DataFrame

DataFrame containing Catapro model predictions

required

Returns:

Type Description
DataFrame

pd.DataFrame: The input kcat_df with an additional column 'catapro_predicted_kcat_s' containing the integrated Catapro predicted kcat(s^-1) values.

Source code in wildkcat/machine_learning/catapro.py
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
def integrate_catapro_predictions(kcat_df, substrates_to_smiles, catapro_predictions_df) -> pd.DataFrame:
    """
    Integrates Catapro predictions into an kcat file.
    If multiple values are provided for a single combination of EC, Enzyme, Substrate, the minimum value is taken.

    Parameters:
        kcat_df (pd.DataFrame): Input DataFrame containing kcat information.
        substrates_to_smiles (pd.DataFrame): DataFrame mapping KEGG ID <-> SMILES.
        catapro_predictions_df (pd.DataFrame): DataFrame containing Catapro model predictions

    Returns:
        pd.DataFrame: The input kcat_df with an additional column 'catapro_predicted_kcat_s' containing
            the integrated Catapro predicted kcat(s^-1) values.
    """
    # Convert pred_log10[kcat(s^-1)] to kcat(s^-1)
    catapro_predictions_df['kcat_s'] = 10 ** catapro_predictions_df['pred_log10[kcat(s^-1)]']
    catapro_predictions_df['uniprot'] = catapro_predictions_df['fasta_id'].str.replace('_wild', '', regex=False) # Extract UniProt ID

    # Match the SMILES to KEGG IDs using substrates_to_smiles
    # If multiple KEGG IDs are found for a single SMILES, they are concatenated
    smiles_to_kegg = (
        substrates_to_smiles.groupby('smiles')['kegg_id']
        .apply(lambda x: ';'.join(sorted(set(x))))
    )
    catapro_predictions_df['substrates_kegg'] = catapro_predictions_df['smiles'].map(smiles_to_kegg)

    catapro_map = catapro_predictions_df.set_index(['uniprot', 'substrates_kegg'])['kcat_s'].to_dict()

    def get_min_pred_kcat(row):
        uniprot = row['uniprot']
        kegg_ids = str(row['substrates_kegg']).split(';')
        kcat_values = [
            catapro_map.get((uniprot, kegg_id))
            for kegg_id in kegg_ids
            if (uniprot, kegg_id) in catapro_map
        ]
        return min(kcat_values) if kcat_values else None  # If multiple substrates, take the minimum kcat value

    kcat_df['catapro_predicted_kcat_s'] = kcat_df.apply(get_min_pred_kcat, axis=1)
    return kcat_df

Generate reports

wildkcat.utils.generate_reports

report_extraction(model, df, report_statistics, output_folder, shader=False)

Generates a detailed HTML report summarizing kcat extraction results from a metabolic model.

Parameters:

Name Type Description Default
model Model

The metabolic model object containing reactions, metabolites, and genes.

required
df DataFrame

DataFrame containing data from the run_extraction function.

required
report_statistics dict

Dictionary with statistics about EC code assignment and extraction issues.

required
output_folder str

Path to the output folder where the report will be saved.

required
shader bool

If True, includes a shader canvas background in the report. Default is False.

False

Returns:

Name Type Description
None None

The function saves the generated HTML report to 'reports/extract_kcat_report.html'.

Source code in wildkcat/utils/generate_reports.py
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
def report_extraction(model, df, report_statistics, output_folder, shader=False) -> None:
    """
    Generates a detailed HTML report summarizing kcat extraction results from a metabolic model.

    Parameters: 
        model (cobra.Model): The metabolic model object containing reactions, metabolites, and genes.
        df (pandas.DataFrame): DataFrame containing data from the run_extraction function.
        report_statistics (dict): Dictionary with statistics about EC code assignment and extraction issues.
        output_folder (str): Path to the output folder where the report will be saved.
        shader (bool, optional): If True, includes a shader canvas background in the report. Default is False.

    Returns: 
        None: The function saves the generated HTML report to 'reports/extract_kcat_report.html'. 
    """
    # Model statistics
    nb_model_reactions = len(model.reactions)
    nb_model_metabolites = len(model.metabolites)
    nb_model_genes = len(model.genes)
    rxn_with_ec = 0
    unique_ec_codes = []

    for rxn in model.reactions:
        ec_code = rxn.annotation.get('ec-code')
        if ec_code:
            rxn_with_ec += 1
            if isinstance(ec_code, str):
                ec_code = [ec_code.strip()]
            elif isinstance(ec_code, list):
                ec_code = [x.strip() for x in ec_code if x.strip()]
            else:
                ec_code = []
            unique_ec_codes.extend(ec_code)

    nb_model_ec_codes = len(set(unique_ec_codes))

    # Kcat statistics
    nb_reactions = df['rxn'].nunique()
    nb_ec_codes = df.loc[df["warning_ec"].fillna("") == "", "ec_code"].nunique()


    # nb_missing_ec = report_statistics.get('nb_missing_ec', np.nan)
    nb_incomplete_ec = report_statistics.get('nb_incomplete_ec', np.nan)
    nb_transferred_ec = report_statistics.get('nb_transferred_ec', np.nan)
    nb_missing_gpr = report_statistics.get('nb_missing_gpr', np.nan)
    nb_missing_catalytic_enzyme = report_statistics.get('nb_missing_catalytic_enzyme', 0)
    # nb_multiple_catalytic_enzymes = report_statistics.get('nb_multiple_catalytic_enzymes', np.nan)
    nb_of_lines_dropped_no_ec_no_enzyme = report_statistics.get('nb_of_lines_dropped_no_ec_no_enzyme', np.nan)
    nb_of_reactions_dropped_no_ec_no_enzyme = report_statistics.get('nb_of_reactions_dropped_no_ec_no_enzyme', np.nan)

    rxn_coverage = 100.0 * nb_reactions / nb_model_reactions if nb_model_reactions else 0
    # percent_ec_retrieved = 100.0 * nb_ec_codes / nb_model_ec_codes if nb_model_ec_codes else 0

    rxn_ec_coverage = 100.0 * rxn_with_ec / nb_model_reactions if nb_model_ec_codes else 0

    # Pie Chart
    pie_data = {
        "Retrieved": nb_ec_codes,
        "Transferred": nb_transferred_ec, 
        "Incomplete": nb_incomplete_ec,
    }

    pie_data = {k: v for k, v in pie_data.items() if v > 0}

    fig = px.pie(
        names=list(pie_data.keys()),
        values=list(pie_data.values()),
        color_discrete_sequence=["#55bb55", "#ee9944", "#cc4455"]
    )
    fig.update_traces(textinfo="percent+label", textfont_size=16)
    fig.update_layout(
        title="",
        title_font=dict(size=30, color="black"),
        showlegend=True
    )

    pie_chart_html = fig.to_html(full_html=False, include_plotlyjs="cdn")

    # Time
    generated_time = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")

    # Html report
    html = f"""
    <!DOCTYPE html>
    <html lang="en">
    <head>
        <meta charset="UTF-8">
        <meta name="viewport" content="width=device-width, initial-scale=1.0">
        <title>Extract kcat Report</title>
        {report_style()}
    </head>
    <body>
        <header>
            <canvas id="shader-canvas"></canvas>
            <div class="overlay">
                <h1>Extract k<sub>cat</sub> Report</h1>
                <p>Generated on {generated_time}</p>
            </div>
        </header>

        <div class="container">
            <!-- Model Overview -->
            <div class="card">
                <h2>Model Overview</h2>
                <div class="stats-grid">
                    <div class="stat-box">
                        <h3>{model.id}</h3>
                        <p>Model ID</p>
                    </div>
                    <div class="stat-box">
                        <h3>{nb_model_reactions}</h3>
                        <p>Reactions</p>
                    </div>
                    <div class="stat-box">
                        <h3>{nb_model_metabolites}</h3>
                        <p>Metabolites</p>
                    </div>
                    <div class="stat-box">
                        <h3>{nb_model_genes}</h3>
                        <p>Genes</p>
                    </div>
                </div>
            </div>

            <!-- kcat Extraction Table -->
            <div class="card">
                <h2>k<sub>cat</sub> Extraction Statistics</h2>
                <table>
                    <tr>
                        <th>Metric</th>
                        <th>Value</th>
                        <th>Visualization</th>
                    </tr>
                    <tr>
                        <td>Reaction with k<sub>cat</sub> information</td>
                        <td>{nb_reactions} ({rxn_coverage:.1f}%)</td>
                        <td>
                            <div class="progress">
                                <div class="progress-bar-table" style="width:{rxn_coverage}%;"></div>
                            </div>
                        </td>
                    </tr>
                    <tr>
                        <td>Reactions with EC information</td>
                        <td>{rxn_with_ec} ({rxn_ec_coverage:.1f}%)</td>
                        <td>
                            <div class="progress">
                                <div class="progress-bar-table" style="width:{rxn_ec_coverage}%;"></div>
                            </div>
                        </td>
                    </tr>
                    <tr>
                        <td>Total k<sub>cat</sub> in output</td>
                        <td>{len(df)}</td>
                        <td>-</td>
                    </tr>
                </table>
            </div>

            <!-- EC Issues Table -->
            <div class="card">
                <h2>Quality Control</h2>

                <p style="text-align: justify">
                    During the extraction process, several issues may arise that can impact the 
                    quality of the retrieved k<sub>cat</sub> data.
                    <br>
                    When an EC number is incomplete or has been transferred, WILDkCAT performs the 
                    retrieval based only on the available enzyme information. Such cases are indicated as 
                    'incomplete' or 'transferred' in the 'warning_ec' column. 
                    These situations reduce the likelihood of finding alternative k<sub>cat</sub> values.
                    <br>
                    Additionally, WILDkCAT attempts to identify the catalytic enzyme associated with 
                    each reaction. If no Gene–Protein–Reaction rule is available, or if the catalytic 
                    enzyme cannot be found via the UniProt API, the entry is labeled as 
                    'none' in the 'warning_enz' column.
                    <br>
                    If both the EC number and the catalytic enzyme information are missing for a given 
                    reaction, the corresponding row is removed due to insufficient information to assign 
                    a k<sub>cat</sub> value.
                </p>

                <table>
                    <tr>
                        <th>Cases</th>
                        <th>Count</th>
                    </tr>
                    <tr>
                        <td>Transferred EC codes</td>
                        <td>{nb_transferred_ec}</td>
                    </tr>
                    <tr>
                        <td>Incomplete EC codes</td>
                        <td>{nb_incomplete_ec}</td>
                    </tr>
                    <tr>
                        <td>Number of reactions without catalytic enzyme</td>
                        <td>{nb_missing_gpr + nb_missing_catalytic_enzyme}</td>
                    </tr>
                    <tr>
                        <td>Number of reactions dropped due to inconsistent or absent EC codes and enzymes</td>
                        <td>{nb_of_reactions_dropped_no_ec_no_enzyme}</td>
                    </tr>
                    <tr>
                        <td>Number of k<sub>cat</sub> values dropped due to inconsistent or absent EC codes and enzymes</td>
                        <td>{nb_of_lines_dropped_no_ec_no_enzyme}</td>
                    </tr>
                </table>
            </div>

            <!-- Pie Chart Section -->
            <div class="card">
                <h2>EC Distribution</h2>
                {pie_chart_html}
            </div>
        </div>

        <footer>WILDkCAT</footer>
    """
    if shader:
        html += report_shader()
    else: 
        html += report_simple()
    html += """
    </body>
    </html>
    """

    # Save report
    os.makedirs(os.path.join(output_folder, "reports"), exist_ok=True)
    report_path = os.path.join(output_folder, "reports/extract_report.html")
    with open(report_path, "w", encoding="utf-8") as f:
        f.write(html)
    logging.info(f"HTML report saved to '{report_path}'")

report_final(model, final_df, output_folder, shader=False)

Generate a full HTML report summarizing retrieval results, including kcat distributions and coverage.

Parameters:

Name Type Description Default
model Model

The metabolic model object containing reactions, metabolites, and genes.

required
final_df DataFrame

DataFrame containing the final kcat assignments from run_prediction_part2 function

required

Returns:

Name Type Description
None None

The function saves the generated HTML report to 'reports/general_report.html'.

Source code in wildkcat/utils/generate_reports.py
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
def report_final(model, final_df, output_folder, shader=False) -> None:
    """
    Generate a full HTML report summarizing retrieval results, including kcat distributions and coverage.

    Parameters:
        model (cobra.Model): The metabolic model object containing reactions, metabolites, and genes.
        final_df (pd.DataFrame): DataFrame containing the final kcat assignments from run_prediction_part2 function

    Returns: 
        None: The function saves the generated HTML report to 'reports/general_report.html'.
    """
    # Model information 
    nb_model_reactions = len(model.reactions)
    nb_model_metabolites = len(model.metabolites)
    nb_model_genes = len(model.genes)


    df = final_df.copy()
    df["db"] = df["db"].fillna("Unknown")
    generated_time = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")

    # Utility to convert matplotlib figures to base64 <img>
    def fig_to_base64(fig):
        buf = BytesIO()
        fig.savefig(buf, format="png", bbox_inches="tight")
        buf.seek(0)
        encoded = base64.b64encode(buf.read()).decode("utf-8")
        plt.close(fig)
        return f'<div class="plot-container"><img src="data:image/png;base64,{encoded}"></div>'

    # Distribution plots
    def plot_kcat_distribution_stacked(column_name, title, source):
        # Ensure numeric kcat
        df[column_name] = pd.to_numeric(df[column_name], errors='coerce')

        # Drop NaNs for both columns
        valid_df = df.dropna(subset=[column_name, source])
        kcat_values = valid_df[column_name]

        total = len(df)
        matched = len(kcat_values)
        match_percent = matched / total * 100 if total else 0

        if not kcat_values.empty:
            # Define log bins
            min_exp = int(np.floor(np.log10(max(1e-6, kcat_values.min()))))
            max_exp = int(np.ceil(np.log10(kcat_values.max())))
            bins = np.logspace(min_exp, max_exp, num=40)

            # Prepare data for stacked histogram
            sources = valid_df[source].unique()
            grouped_values = [valid_df.loc[valid_df[source] == src, column_name] for src in sources]

            # Fixed color mapping
            color_map = {
                "brenda": "#55bb55",   
                "sabio_rk": "#2277cc", 
                "catapro": "#eedd00",  
                "Unknown": "#dddddd" 
            }

            label_map = {
                "brenda": "Brenda",
                "sabio_rk": "Sabio-RK",
                "catapro": "CataPro",
                "Unknown": "Unknown"
            }

            sources = [src for src in sources if src in valid_df[source].unique()]
            colors = [color_map.get(src, "#999999") for src in sources]

            # Plot
            fig, ax = plt.subplots(figsize=(12, 6))
            ax.hist(grouped_values, bins=bins, stacked=True,
                    color=colors, label=[label_map[s] for s in sources],
                    edgecolor="white", linewidth=0.7)

            ax.set_xscale("log")
            ax.set_xlim([10**min_exp / 1.5, 10**max_exp * 1.5])
            ax.xaxis.set_major_formatter(LogFormatter(10))
            ax.yaxis.set_major_locator(MaxNLocator(integer=True))

            ax.set_xlabel("kcat (s⁻¹)", fontsize=12)
            ax.set_ylabel("Count", fontsize=12)
            ax.set_title(f"{title} (n={matched}, {match_percent:.1f}%)", fontsize=13)

            ax.legend(
                title="Source", 
                fontsize=10, 
                title_fontsize=11,
                loc='center left', 
                bbox_to_anchor=(1, 0.5),
                frameon=False
            )

            # Style
            ax.spines['top'].set_visible(False)
            ax.spines['right'].set_visible(False)
            ax.spines['left'].set_color('#444444')
            ax.spines['bottom'].set_color('#444444')

            ax.grid(True, which='major', axis='y', linestyle='--', linewidth=0.6, alpha=0.4)
            ax.grid(False, which='major', axis='x') 

            plt.tight_layout(rect=[0, 0, 0.85, 1])

            return fig_to_base64(fig)

        return "<p>No valid values available for plotting.</p>"

    img_final = plot_kcat_distribution_stacked(
        'kcat', rf"{model.id} - $k_{{\mathrm{{cat}}}}$ Distribution", "db"
    )

    db_counts = df["db"].fillna("Unknown").value_counts()
    total_db = db_counts.sum()

    # Couleurs
    colors = {
        "brenda": "#55bb55",
        "sabio_rk": "#2277cc",
        "catapro": "#eedd00",
        "Unknown": "#ddd"
    }

    # Ordre imposé
    ordered_dbs = ["brenda", "sabio_rk", "catapro", "Unknown"]

    progress_segments = ""
    legend_items = ""

    for db in ordered_dbs:
        count = db_counts.get(db, 0)
        if total_db > 0:
            percent = count / total_db * 100
        else:
            percent = 0

        color = colors.get(db, "#ddd")

        progress_segments += f"""
            <div class="progress-segment" style="width:{percent:.1f}%; background-color:{color};"
                title="{db.capitalize()}: {percent:.1f}%"></div>
        """

        legend_items += f"""
            <span style="display:flex; align-items:center; margin-right:15px; margin-bottom:5px;">
                <span style="display:flex; align-items:center; width:16px; height:16px; 
                            background:{color}; border:1px solid #000; margin-right:5px;"></span>
                {db.capitalize()} ({percent:.1f}%)
            </span>
        """

    progress_bar = f"""
        <div class="progress-multi" style="height: 18px; margin-bottom:18px; display:flex;">
            {progress_segments}
        </div>
        <div style="margin-top:10px; display:flex; justify-content:center; flex-wrap: wrap;">
            {legend_items}
        </div>
    """

    # Statistics 
    grouped = df.groupby("rxn")
    rxns_with_kcat = grouped["kcat"].apply(lambda x: x.notna().any())
    nb_reactions = df['rxn'].nunique()
    nb_rxn_with_kcat = rxns_with_kcat.sum()
    coverage = nb_rxn_with_kcat / nb_reactions
    coverage_total = nb_rxn_with_kcat / nb_model_reactions

    kcat_values = df["kcat"].dropna()
    total = len(df)
    matched = len(kcat_values)
    match_percent = matched / total

    # HTML
    html = f"""
    <!DOCTYPE html>
    <html lang="en">
    <head>
        <meta charset="UTF-8">
        <title>WILDkCAT Report</title>
        {report_style()}
    </head>
    <body>
        <header>
            <canvas id="shader-canvas"></canvas>
            <div class="overlay">
                <h1>WILDkCAT Report</h1>
                <p>Generated on {generated_time}</p>
            </div>
        </header>

        <div class="container">
            <div class="card">
                <h2>Introduction</h2>
                <p style="text-align: justify;">
                    This report provides a summary of the performance of k<sub>cat</sub> value extraction, retrieval, and prediction for the specified metabolic model. 
                    It presents statistics on k<sub>cat</sub> values successfully retrieved, whether experimental or predicted.
                </p>
            </div>

            <div class="card">
                <h2>Model Overview</h2>
                <div class="stats-grid">
                    <div class="stat-box">
                        <h3>{model.id}</h3>
                        <p>Model ID</p>
                    </div>
                    <div class="stat-box">
                        <h3>{nb_model_reactions}</h3>
                        <p>Reactions</p>
                    </div>
                    <div class="stat-box">
                        <h3>{nb_model_metabolites}</h3>
                        <p>Metabolites</p>
                    </div>
                    <div class="stat-box">
                        <h3>{nb_model_genes}</h3>
                        <p>Genes</p>
                    </div>
                </div>
            </div>

            <div class="card" style="padding:20px; margin-bottom:20px;">
                <h2 style="margin-bottom:10px;">Coverage</h2>

                <!-- Explanation -->
                <p style="text-align: justify;">
                    The coverage section reports the number of k<sub>cat</sub> values retrieved for the model and the number of reactions that have at least one 
                    associated k<sub>cat</sub> value. This provides a measure of how extensively the model’s reactions are 
                    annotated with kinetic data.
                </p>
                <p style="text-align: justify;">
                    Higher coverage indicates that a larger fraction of reactions are constrained by k<sub>cat</sub> values, 
                    improving the accuracy and reliability of enzyme-constrained simulations.
                </p>

                <!-- Global coverage progress bar -->
                {progress_bar}        

                <!-- Detailed stats -->
                <table class="table" style="width:100%; border-spacing:0; border-collapse: collapse;">
                    <tbody>
                    <tr>
                            <td style="padding:8px 12px;">Eligible-reactions with at least one kcat value</td>
                            <td style="padding:8px 12px;">{nb_rxn_with_kcat} ({coverage:.1%})</td>
                            <td style="width:40%;">
                                <div class="progress" style="height:18px;">
                                    <div class="progress-bar-table" 
                                        style="width:{coverage:.1%}; background-color:#4caf50;">
                                    </div>
                                </div>
                            </td>
                        </tr>
                        <tr>
                            <td style="padding:8px 12px;">Model-wide reactions with at least one kcat value</td>
                            <td style="padding:8px 12px;">{nb_rxn_with_kcat} ({coverage_total:.1%})</td>
                            <td style="width:40%;">
                                <div class="progress" style="height:18px;">
                                    <div class="progress-bar-table" 
                                        style="width:{coverage_total:.1%}; background-color:#4caf50;">
                                    </div>
                                </div>
                            </td>
                        </tr>
                        <tr>
                            <td style="padding:8px 12px;">k<sub>cat</sub> values retrieved </td>
                            <td style="padding:8px 12px;">{matched} ({match_percent:.1%})</td>
                            <td style="width:40%;">
                                <div class="progress" style="height:18px;">
                                    <div class="progress-bar-table" 
                                        style="width:{match_percent:.1%}; background-color:#4caf50;">
                                    </div>
                                </div>
                            </td>
                        </tr>
                    </tbody>
                </table>
            </div>

            <div class="card">
                <h2>k<sub>cat</sub> Distribution</h2>
                <div class="img-section">
                    {img_final}
                </div>
            </div>
        </div>

        <footer>WILDkCAT</footer>
    """
    if shader:
        html += report_shader()
    else: 
        html += report_simple()
    html += """
    </body>
    </html>
    """

    os.makedirs(os.path.join(output_folder, "reports"), exist_ok=True)
    report_path = os.path.join(output_folder, "reports/general_report.html")
    with open(report_path, "w", encoding="utf-8") as f:
        f.write(html)

    logging.info(f"HTML report saved to '{report_path}'")
    return report_path

report_prediction_input(catapro_df, report_statistics, output_folder, shader=False)

Generate a detailed HTML report summarizing the kcat prediction input statistics.

Parameters:

Name Type Description Default
catapro_df DataFrame

DataFrame containing the CataPro input data.

required
report_statistics dict

Dictionary with statistics about the prediction input.

required
output_folder str

Path to the output folder where the report will be saved.

required
shader bool

If True, includes a shader canvas background in the report. Default is False.

False

Returns:

Name Type Description
None None

The function saves the generated HTML report to 'reports/predict_kcat_report.html'.

Source code in wildkcat/utils/generate_reports.py
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
def report_prediction_input(catapro_df, report_statistics, output_folder, shader=False) -> None: 
    """
    Generate a detailed HTML report summarizing the kcat prediction input statistics.

    Parameters:
        catapro_df (pd.DataFrame): DataFrame containing the CataPro input data.
        report_statistics (dict): Dictionary with statistics about the prediction input.
        output_folder (str): Path to the output folder where the report will be saved.
        shader (bool, optional): If True, includes a shader canvas background in the report. Default is False.

    Returns:
        None: The function saves the generated HTML report to 'reports/predict_kcat_report.html'.
    """
    # CataPro Statistics 
    total_catapro_entries = len(catapro_df) - 1

    # Report Statistics
    rxn_covered = report_statistics['rxn_covered']
    cofactors_covered = report_statistics['cofactor_identified']
    no_catalytic = report_statistics['no_catalytic']
    kegg_missing = report_statistics['kegg_no_matching']
    duplicates = report_statistics['duplicates_enzyme_substrates']
    missing_enzyme = report_statistics['missing_enzymes']

    total_rxn = rxn_covered + no_catalytic + kegg_missing + missing_enzyme
    rxn_coverage = (rxn_covered / total_rxn * 100) if total_rxn > 0 else 0

    # Time
    generated_time = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")

    # Html report
    html = f"""
    <!DOCTYPE html>
    <html lang="en">
    <head>
        <meta charset="UTF-8">
        <meta name="viewport" content="width=device-width, initial-scale=1.0">
        <title>Predict kcat Report</title>
        {report_style()}
    </head>
    <body>
        <header>
            <canvas id="shader-canvas"></canvas>
            <div class="overlay">
                <h1>Predict k<sub>cat</sub> Report</h1>
                <p>Generated on {generated_time}</p>
            </div>
        </header>

        <div class="container">
            <!-- CataPro Overview -->
            <div class="card">
                <h2>Overview</h2>
                <div class="stats-grid">
                    <div class="stat-box">
                        <h3>{total_rxn}</h3>
                        <p>Total k<sub>cat</sub> values</p>
                    </div>
                    <div class="stat-box">
                        <h3>{rxn_covered}</h3>
                        <p>k<sub>cat</sub> to be predicted ({rxn_coverage:.2f}%)</p>
                    </div>
                </div>
            </div>

            <!-- Prediction kcat Table -->
            <div class="card">
                <h2>k<sub>cat</sub> Prediction Statistics</h2>
                <table>
                    <tr>
                        <th>Metric</th>
                        <th>Value</th>
                    </tr>
                    <tr>
                        <td>Total of entries in CataPro input file</td>
                        <td>{total_catapro_entries}</td>
                    </tr>
                    <tr>
                        <td>Number of cofactor identified</td>
                        <td>{cofactors_covered}</td>
                    </tr>
                </table>
            </div>

            <div class="card">
                <h2>Issues in k<sub>cat</sub> Predictions</h2>
                <table>
                    <tr>
                        <th>Metric</th>
                        <th>Value</th>
                    </tr>
                    <tr>
                        <td>Entries with no catalytic enzyme identified</td>
                        <td>{no_catalytic}</td>
                    </tr>
                    <tr>
                        <td>Entries with missing KEGG IDs</td>
                        <td>{kegg_missing}</td>
                    </tr>
                    <tr>
                        <td>Entries with missing enzyme information</td>
                        <td>{missing_enzyme}</td>
                    </tr>
                </table>
            </div>

            <div class="card">
                <h2>Duplicates</h2>
                <table>
                    <tr>
                        <th>Metric</th>
                        <th>Value</th>
                    </tr>
                    <tr>
                        <td>Number of duplicates</td>
                        <td>{duplicates}</td>
                    </tr>
                </table>
                <p>
                    Duplicates occur when multiple reactions share the same enzyme-substrate combination. 
                    A high number of duplicates may result from multiple enzyme complexes sharing the same catalytic enzyme.
                </p>
            </div>

            <!-- Prediction Instructions -->
            <div class="card">
                <h2>Running k<sub>cat</sub> Predictions with CataPro</h2>
                <p>
                    This report provides the input needed to run the CataPro machine learning model 
                    (<a href="https://github.com/zchwang/CataPro" target="_blank">CataPro repository</a>). 
                    Follow the instructions in the repository to set up the environment and generate k<sub>cat</sub> predictions.
                </p>
            </div>

    <footer>WILDkCAT</footer>
    """
    if shader:
        html += report_shader()
    else: 
        html += report_simple()
    html += """
    </body>
    </html>
    """

    # Save report
    os.makedirs(os.path.join(output_folder, "reports"), exist_ok=True)
    report_path = os.path.join(output_folder, "reports/predict_report.html")
    with open(report_path, "w", encoding="utf-8") as f:
        f.write(html)
    logging.info(f"HTML report saved to '{report_path}'")

report_retrieval(df, output_folder, parameters, shader=False)

Generate a styled HTML report summarizing the kcat matching results, including kcat value distribution and matching score repartition.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing data from the run_retrieval function.

required
output_folder str

Path to the output folder where the report will be saved.

required
shader bool

If True, includes a shader canvas background in the report. Default is False.

False

Returns:

Name Type Description
None None

The function saves the generated HTML report to 'reports/retrieve_kcat_report.html'.

Source code in wildkcat/utils/generate_reports.py
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
def report_retrieval(df, output_folder, parameters, shader=False) -> None:
    """
    Generate a styled HTML report summarizing the kcat matching results,
    including kcat value distribution and matching score repartition.

    Parameters:
        df (pd.DataFrame): DataFrame containing data from the run_retrieval function.
        output_folder (str): Path to the output folder where the report will be saved.
        shader (bool, optional): If True, includes a shader canvas background in the report. Default is False.

    Returns:
        None: The function saves the generated HTML report to 'reports/retrieve_kcat_report.html'.
    """
    # Ensure numeric kcat values to avoid TypeError on comparisons
    kcat_values = pd.to_numeric(df['kcat'], errors='coerce').dropna()

    # Only use scores present in the data
    present_scores = sorted(df['penalty_score'].dropna().unique())
    score_counts = df['penalty_score'].value_counts().reindex(present_scores, fill_value=0)
    total = len(df)
    matched = len(kcat_values)
    match_percent = matched / total * 100 if total else 0
    score_percent = (score_counts / total * 100).round(2) if total else pd.Series(0, index=present_scores)

    # Gradient colors from green (best score) to red (worst score) # TODO: It could be better to create the scale dynamically based on present scores
    distinct_colors = [
        "#27ae60",
        "#43b76e",
        "#60c07c",
        "#7cc98a",
        "#98d298",
        "#b5dbb6",
        "#d1e4c4",
        "#e8e9b9",
        "#f1e9b6",
        "#f7d97c",
        "#f9c74f",
        "#f8961e",
        "#f3722c",
        "#e67e22",
        "#e74c3c",
        "#d35400",
        "#c0392b",
        "#a93226",
        "#7b241c"
    ]

    def score_color(score):
        idx = present_scores.index(score)
        return distinct_colors[idx % len(distinct_colors)]

    generated_time = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")

    # Parameters
    temp = parameters.get('Temperature')
    formatted_temp = f"{temp[0]} - {temp[1]}"

    ph = parameters.get('pH')
    formatted_ph = f"{ph[0]} - {ph[1]}"

    # Histogram with stacked bars for scores
    kcat_hist_base64 = ""
    if not kcat_values.empty:
        min_exp = int(np.floor(np.log10(max(1e-6, kcat_values.min()))))
        max_exp = int(np.ceil(np.log10(kcat_values.max())))
        bins = np.logspace(min_exp, max_exp, num=40)

        # Rm empty score groups (15 - 16)
        hist_data = []
        valid_scores = []
        for score in present_scores:
            vals = pd.to_numeric(df[df['penalty_score'] == score]['kcat'], errors='coerce')
            vals = vals[vals.notna()]
            if not vals.empty:
                hist_data.append(vals)
                valid_scores.append(score)

        fig, ax = plt.subplots(figsize=(12, 6))

        # Stacked histogram by score
        ax.hist(hist_data, bins=bins, stacked=True, 
                color=[score_color(s) for s in valid_scores],
                label=[f"{s}" for s in valid_scores],
                edgecolor='white')

        ax.set_xscale('log')
        ax.set_xlim([10**min_exp / 1.5, 10**max_exp * 1.5])
        ax.xaxis.set_major_formatter(LogFormatter(10))
        ax.yaxis.set_major_locator(MaxNLocator(integer=True))

        ax.set_xlabel("kcat (s⁻¹)", fontsize=12)
        ax.set_ylabel("Count", fontsize=12)
        ax.set_title(f"", fontsize=13)

        ax.legend(
            title="Penalty Score", 
            fontsize=10, 
            title_fontsize=11,
            loc='center left', 
            bbox_to_anchor=(1, 0.5),
            frameon=False
        )

        # Style
        ax.spines['top'].set_visible(False)
        ax.spines['right'].set_visible(False)
        ax.spines['left'].set_color('#444444')
        ax.spines['bottom'].set_color('#444444')

        ax.grid(True, which='major', axis='y', linestyle='--', linewidth=0.6, alpha=0.4)
        ax.grid(False, which='major', axis='x') 

        plt.tight_layout(rect=[0, 0, 0.85, 1])
        buf = io.BytesIO()
        plt.savefig(buf, format='png', bbox_inches='tight')
        plt.close(fig)
        kcat_hist_base64 = base64.b64encode(buf.getvalue()).decode('utf-8')

    # HTML start
    html = f"""
    <!DOCTYPE html>
    <html lang="en">
    <head>
        <meta charset="UTF-8">
        <meta name="viewport" content="width=device-width, initial-scale=1.0">
        <title>Retrieve kcat Report</title>
        {report_style()}
    </head>
    <body>
        <header>
            <canvas id="shader-canvas"></canvas>
            <div class="overlay">
                <h1>Retrieve k<sub>cat</sub> Report</h1>
                <p>Generated on {generated_time}</p>
            </div>
        </header>

        <div class="container">
            <div class="card">
                <h2>Overview</h2>
                <div class="stats-grid">
                    <div class="stat-box">
                        <h3>{total}</h3>
                        <p>Total Entries</p>
                    </div>
                    <div class="stat-box">
                        <h3>{matched}</h3>
                        <p>Matched k<sub>cat</sub> ({match_percent:.2f}%)</p>
                    </div>
                </div>
            </div>

            <div class="card">
                <div class="card">
                <h2>Parameters</h2>
                <div class="stats-grid">
                    <div class="stat-box">
                        <h3>{parameters.get('Organism')}</h3>
                        <p>Organism name</p>
                    </div>
                    <div class="stat-box">
                        <h3>{formatted_ph}</h3>
                        <p>pH range</p>
                    </div>
                    <div class="stat-box">
                        <h3>{formatted_temp}</h3>
                        <p>Temperature range</p>
                    </div>
                     <div class="stat-box">
                        <h3>{parameters.get('database').capitalize()}</h3>
                        <p>Database(s)</p>
                    </div>
                </div>
            </div>
            </div>

            <div class="card">
                <h2>Penalty Score Distribution</h2>
                <div class="progress-stacked">
    """

    # Add progress bars only for present scores
    for score in present_scores:
        percent = score_percent.get(score, 0)
        if percent > 0:
            html += f'<div class="progress-bar" style="width:{percent}%;background:{score_color(score)};" title="Score {score}: {percent:.2f}%"></div>'

    html += """
            </div>
            <div class="legend">
    """

    # Add legend only for present scores
    for score in present_scores:
        html += f'<div class="legend-item"><div class="legend-color" style="background:{score_color(score)};"></div> Score {score}</div>'

    html += """
            </div>
            <table>
                <tr>
                    <th>Score</th>
                    <th>Count</th>
                    <th>Percent</th>
                </tr>
    """

    # Table rows only for present scores
    for score in present_scores:
        html += f'<tr><td>{score}</td><td>{score_counts[score]}</td><td>{score_percent[score]:.2f}%</td></tr>'

    html += """
            </table>
        </div>
    """

    # Histogram section (stacked by score)
    html += """
        <div class="card">
            <h2>Distribution of k<sub>cat</sub> values (Stacked by Penalty Score)</h2>
            <div class="img-section">
    """
    if kcat_hist_base64:
        html += f'<img src="data:image/png;base64,{kcat_hist_base64}" alt="k<sub>cat</sub> Distribution">'
    html += """
            </div>
        </div>
    """

    # Metadata section
    html += f"""
            <div class="card">
                <h2>Penalty Score</h2>
                <p>
                    The penalty score evaluates how well a candidate k<sub>cat</sub> entry fits the query enzyme and conditions. 
                    A lower score indicates a better match (0 = Best possible, 16 = No match).
                </p>
                <h3>Scoring process:</h3>
                <ul>
                    <li><b>Catalytic enzyme:</b> Check if the reported enzyme matches the expected catalytic enzyme(s).</li>
                    <li><b>Organism:</b> Penalize mismatches between the source organism and the target organism.</li>
                    <li><b>Enzyme variant:</b> Exclude or penalize mutant/engineered variants (wildtype preferred).</li>
                    <li><b>pH:</b> Check whether the reported pH is consistent with the desired experimental range.</li>
                    <li><b>Substrate:</b> Verify substrate compatibility with the catalytic reaction.</li>
                    <li><b>Temperature:</b> Penalize deviations from the target temperature; 
                        if possible, adjust kcat values using the Arrhenius equation.</li>
                </ul>

                <h3>Score breakdown:</h3>
                <table border="1" cellpadding="6" cellspacing="0" style="border-collapse: collapse; text-align: left;">
                    <tr>
                        <th>Criterion</th>
                        <th>Penalty</th>
                    </tr>
                    <tr>
                        <td>Substrate mismatch</td>
                        <td>+4</td>
                    </tr>
                    <tr>
                        <td>Catalytic enzyme mismatch</td>
                        <td>+3</td>
                    </tr>
                    <tr>
                        <td>Organism mismatch</td>
                        <td>+2</td>
                    </tr>
                    <tr>
                        <td>pH unknown</td>
                        <td>+1</td>
                    </tr>
                    <tr>
                        <td>pH out of range</td>
                        <td>+2</td>
                    </tr>
                    <tr>
                        <td>Temperature unknown</td>
                        <td>+1</td>
                    </tr>
                    <tr>
                        <td>Temperature out of range</td>
                        <td>+2</td>
                    </tr>
                    <tr>
                        <td>Enzyme variant unknown</td>
                        <td>+1</td>
                    </tr>
                </table>

                <p>
                    Candidates are then ranked by:
                    <ol>
                        <li>Lowest penalty-score</li>
                        <li>Highest sequence identity percentage to the target enzyme</li>
                        <li>Closest organism compared to the target organism.</li>
                        <li>Adjusted k<sub>cat</sub> value (favoring the highest value by default)</li>
                    </ol>

                <i>Please check the <a href="https://h-escoffier.github.io/WILDkCAT/explanation/explanation/#2-retrieve-experimental-kcat-values-from-brenda-andor-sabio-rk" target="_blank" rel="noopener noreferrer">documentation</a> for more details on the scoring system and the retrieval process.</i>
                </p>
            </div>

            <div class="card">
                <h2>Notes</h2>
                <p style="text-align: justify">
                    Please note that the number of rows may differ between the extraction and retrieval stages. 
                    Indeed, when a single reaction–enzyme combination is associated with multiple EC numbers, WILDkCAT 
                    automatically merges these rows after retrieval, keeping only the best entry according to the criteria 
                    described above. The EC number of the selected kcat is stored in the 'ec_code' column, while all EC 
                    numbers associated with the reaction are stored in the 'ec_codes' column.
                </p>
            </div>
        </div>

        <footer>WILDkCAT</footer>
    """
    if shader: 
        html += report_shader()
    else: 
        html += report_simple()
    html += """
    </body>
    </html>
    """

    # Save HTML
    os.makedirs(os.path.join(output_folder, "reports"), exist_ok=True)
    report_path = os.path.join(output_folder, "reports/retrieve_report.html")
    with open(report_path, "w", encoding="utf-8") as f:
        f.write(html)

    logging.info(f"HTML report saved to '{report_path}'")

report_shader()

Return HTML and GLSL shader code for report background. Adapted from localthunk (https://localthunk.com)

Source code in wildkcat/utils/generate_reports.py
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
def report_shader(): 
    """Return HTML and GLSL shader code for report background. Adapted from localthunk (https://localthunk.com)"""
    return """
    <!-- Background adapted from original work by localthunk (https://localthunk.com) -->
    <script id="fragShader" type="x-shader/x-fragment">
    precision highp float;
    uniform vec2 iResolution;
    uniform float iTime;
    #define SPIN_ROTATION -1.0
    #define SPIN_SPEED 3.5
    #define OFFSET vec2(0.0)
    #define COLOUR_1 vec4(0.2, 0.4, 0.7, 1.0)
    #define COLOUR_2 vec4(0.6, 0.75, 0.9, 1.0)
    #define COLOUR_3 vec4(0.2, 0.2, 0.25, 1.0)
    #define CONTRAST 3.5
    #define LIGTHING 0.4
    #define SPIN_AMOUNT 0.25
    #define PIXEL_FILTER 745.0
    #define SPIN_EASE 1.0
    #define PI 3.14159265359
    #define IS_ROTATE false
    vec4 effect(vec2 screenSize, vec2 screen_coords) {
        float pixel_size = length(screenSize.xy) / PIXEL_FILTER;
        vec2 uv = (floor(screen_coords.xy*(1./pixel_size))*pixel_size - 0.5*screenSize.xy)/length(screenSize.xy) - OFFSET;
        float uv_len = length(uv);
        float speed = (SPIN_ROTATION*SPIN_EASE*0.2);
        if(IS_ROTATE) {
        speed = iTime * speed;
        }
        speed += 302.2;
        float new_pixel_angle = atan(uv.y, uv.x) + speed - SPIN_EASE*20.*(1.*SPIN_AMOUNT*uv_len + (1. - 1.*SPIN_AMOUNT));
        vec2 mid = (screenSize.xy/length(screenSize.xy))/2.;
        uv = (vec2((uv_len * cos(new_pixel_angle) + mid.x), (uv_len * sin(new_pixel_angle) + mid.y)) - mid);
        uv *= 30.;
        speed = iTime*(SPIN_SPEED);
        vec2 uv2 = vec2(uv.x+uv.y);
        for(int i=0; i < 5; i++) {
            uv2 += sin(max(uv.x, uv.y)) + uv;
            uv  += 0.5*vec2(cos(5.1123314 + 0.353*uv2.y + speed*0.131121),sin(uv2.x - 0.113*speed));
            uv  -= 1.0*cos(uv.x + uv.y) - 1.0*sin(uv.x*0.711 - uv.y);
        }
        float contrast_mod = (0.25*CONTRAST + 0.5*SPIN_AMOUNT + 1.2);
        float paint_res = min(2., max(0.,length(uv)*(0.035)*contrast_mod));
        float c1p = max(0.,1. - contrast_mod*abs(1.-paint_res));
        float c2p = max(0.,1. - contrast_mod*abs(paint_res));
        float c3p = 1. - min(1., c1p + c2p);
        float light = (LIGTHING - 0.2)*max(c1p*5. - 4., 0.) + LIGTHING*max(c2p*5. - 4., 0.);
        return (0.3/CONTRAST)*COLOUR_1 + (1. - 0.3/CONTRAST)*(COLOUR_1*c1p + COLOUR_2*c2p + vec4(c3p*COLOUR_3.rgb, c3p*COLOUR_1.a)) + light;
    }
    void mainImage(out vec4 fragColor, in vec2 fragCoord) {
        vec2 uv = fragCoord/iResolution.xy;
        fragColor = effect(iResolution.xy, uv * iResolution.xy);
    }
    void main() { mainImage(gl_FragColor, gl_FragCoord.xy); }
    </script>
    <script>
    const canvas = document.getElementById("shader-canvas");
    const gl = canvas.getContext("webgl");
    function resize() {
        canvas.width = canvas.clientWidth * window.devicePixelRatio;
        canvas.height = canvas.clientHeight * window.devicePixelRatio;
        gl.viewport(0, 0, canvas.width, canvas.height);
    }
    window.addEventListener("resize", resize);
    resize();
    const vertexSrc = `
    attribute vec2 position;
    void main() {
        gl_Position = vec4(position, 0.0, 1.0);
    }
    `;
    const fragSrc = document.getElementById("fragShader").text;
    function compileShader(src, type) {
    const shader = gl.createShader(type);
    gl.shaderSource(shader, src);
    gl.compileShader(shader);
    if (!gl.getShaderParameter(shader, gl.COMPILE_STATUS)) {
        console.error(gl.getShaderInfoLog(shader));
    }
    return shader;
    }
    const vertexShader = compileShader(vertexSrc, gl.VERTEX_SHADER);
    const fragmentShader = compileShader(fragSrc, gl.FRAGMENT_SHADER);
    const program = gl.createProgram();
    gl.attachShader(program, vertexShader);
    gl.attachShader(program, fragmentShader);
    gl.linkProgram(program);
    gl.useProgram(program);
    const positionBuffer = gl.createBuffer();
    gl.bindBuffer(gl.ARRAY_BUFFER, positionBuffer);
    gl.bufferData(gl.ARRAY_BUFFER, new Float32Array([
    -1, -1, 1, -1, -1, 1,
    -1, 1, 1, -1, 1, 1
    ]), gl.STATIC_DRAW);
    const positionLoc = gl.getAttribLocation(program, "position");
    gl.enableVertexAttribArray(positionLoc);
    gl.vertexAttribPointer(positionLoc, 2, gl.FLOAT, false, 0, 0);
    const iResolutionLoc = gl.getUniformLocation(program, "iResolution");
    const iTimeLoc = gl.getUniformLocation(program, "iTime");
    function render(time) {
    resize();
    gl.uniform2f(iResolutionLoc, canvas.width, canvas.height);
    gl.uniform1f(iTimeLoc, time * 0.001);
    gl.drawArrays(gl.TRIANGLES, 0, 6);
    requestAnimationFrame(render);
    }
    requestAnimationFrame(render);
    </script>
    """

report_simple()

Return HTML code for report background.

Source code in wildkcat/utils/generate_reports.py
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
def report_simple():
    """Return HTML code for report background."""
    return """
    <style>
        header {
            background-color: #2980b9; /* simple blue background */
            margin: 0;
            padding: 0;
        }
    </style>
    """

report_style()

Return CSS script for report style.

Source code in wildkcat/utils/generate_reports.py
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
def report_style():
    """Return CSS script for report style."""
    return """
    <style>
        body {
            font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
            font-size: 1rem;
            line-height: 1.7;
            padding: 2rem 1.5rem;
            background-color: #f4f6f9;
            margin: 0;
            padding: 0;
            color: #333;
        }
        header {
            position: relative;
            width: 100%;
            height: 150px;
            overflow: hidden;
            display: flex;
            align-items: center;
            justify-content: center;
            color: #fff;
            text-align: center;
        }
        header canvas {
            position: absolute;
            top: 0; left: 0;
            width: 100%;
            height: 100%;
            z-index: 0;
        }
        header::before {
            content: "";
            position: absolute;
            top: 0; left: 0; right: 0; bottom: 0;
            background: linear-gradient(
                rgba(0,0,0,0.5),
                rgba(0,0,0,0.3)
            );
            z-index: 1;
        }
        header .overlay {
            position: relative;
            z-index: 2;
            padding: 10px 20px;
            border-radius: 8px;
        }
        header h1 {
            margin: 0;
            font-size: 2.5rem;
            font-weight: bold;
            text-shadow: 0 2px 6px rgba(0,0,0,0.6);
        }
        header p {
            margin: 8px 0 0;
            font-size: 1.1rem;
            text-shadow: 0 1px 4px rgba(0,0,0,0.6);
        }
        p {
            margin-bottom: 1.2rem;
        }
        .container {
            max-width: 1100px;
            margin: 30px auto;
            padding: 20px;
        }
        .card {
            background: #fff;
            border-radius: 12px;
            padding: 20px;
            margin-bottom: 20px;
            box-shadow: 0 2px 8px rgba(0,0,0,0.05);
        }
        .card h2 {
            margin-top: 0;
            color: #2980b9;
            border-bottom: 2px solid #e6e6e6;
            padding-bottom: 10px;
            font-size: 1.5rem;
        }
        .stats-grid {
            display: grid;
            grid-template-columns: repeat(auto-fit, minmax(200px, 1fr));
            gap: 15px;
            margin-top: 15px;
        }
        .stat-box {
            background: #f9fafc;
            border-radius: 8px;
            padding: 15px;
            text-align: center;
            border: 1px solid #e2e2e2;
        }
        table {
            width: 100%;
            border-collapse: collapse;
            margin-top: 20px;
            font-size: 0.95rem;
        }
        table th, table td {
            border: 1px solid #ddd;
            padding: 10px;
            text-align: left;
        }
        table th {
            background-color: #2980b9;
            color: #fff;
        }
        table tr:nth-child(even) {
            background-color: #f2f2f2;
        }
        .progress {
            background-color: #ddd;
            border-radius: 10px;
            overflow: hidden;
            height: 18px;
            width: 100%;
            margin-top: 5px;
        }
        .progress-stacked {
            display: flex;
            height: 18px;
            border-radius: 10px;
            overflow: hidden;
            background-color: #ddd;
            font-size: 0.75rem;
            line-height: 18px;
            color: white;
            text-shadow: 0 1px 1px rgba(0,0,0,0.2);
            margin-bottom: 10px;
        }
        .progress-bar {
            display: flex;
            align-items: center;
            justify-content: center;
            height: 100%;
            white-space: nowrap;
            overflow: hidden;
        }
        .progress-bar-table {
            background-color: #27ae60;
            height: 100%;
            text-align: right;
            padding-right: 5px;
            color: white;
            font-size: 0.8rem;
            line-height: 18px;
        }
        .progress-multi {
            display: flex;
            width: 100%;
            height: 25px;
            border-radius: 12px;
            overflow: hidden;
            border: 1px solid #ccc;
        }
        .progress-segment {
            height: 100%;
        }
        .legend {
            display: flex;
            flex-wrap: wrap;
            gap: 10px;
            font-size: 0.85rem;
            margin-top: 5px;
        }
        .legend-item {
            display: flex;
            align-items: center;
            gap: 5px;
        }
        .legend-color {
            width: 14px;
            height: 14px;
            border-radius: 3px;
            border: 1px solid #aaa;
        }
        .img-section {
            display: flex;
            flex-wrap: wrap;
            gap: 30px;
            justify-content: center;
            align-items: flex-start;
            margin-top: 20px;
        }
        footer {
            text-align: center;
            font-size: 0.9rem;
            color: #777;
            padding: 15px;
            margin-top: 20px;
            border-top: 1px solid #ddd;
        }
    </style>
    """