Calculating Error and Correlation Metrics Manually#
cinnabar’s scatter plots automatically annotate figures with statistics such as RMSE or MUE based on the recommend best practices for the observable being plotted as described in the companion paper.
Sometimes, however, you may need to access those numbers to run a statistical comparison, or feed results into a downstream analysis pipeline and cinnabar provides a convenient way to do that as well using the API.
In this tutorial we will demonstrate how to recreate the analysis metrics shown by default in the example scatter plots manually, this involves the following steps:
Extract the predicted and experimental values from a
FEMapusing the dataframe helper functions.Compute the appropriate metrics for the given observable using analysis functions available in
cinnabar.stats.Obtain bootstrap confidence intervals around each metric using
cinnabar.stats.bootstrap_statistic.
Available metrics#
The cinnabar.stats module exposes are wide range of deviation and correlation metrics, the table below maps the short-hand string identifiers (used by cinnabar internally) for each metric to their corresponding function names and descriptions:
Metric |
Function name |
Description |
|---|---|---|
|
|
Root mean squared error. |
|
|
Mean unsigned error. |
|
|
Relative absolute error, a measure of the average absolute error relative to the average magnitude of the true values. |
|
|
Normalized root mean squared error, using the true mean to normalize the RMSE. |
|
|
Coefficient of determination. |
|
|
Pearson correlation coefficient. |
|
|
Kendall Tau correlation coefficient. |
|
|
Predictive index, weighted Kendall Tau using the difference of the true values to weight ordinal comparisons. |
see the cinnabar.stats API documentation for more details.
Loading Example RBFE Results#
For this example, we will use RBFE data generated with OpenFE, which is bundled with cinnabar’s data directory. The dataset contains one set of simulated relative free energy edges (Method A) and a corresponding set of experimental absolute free energies. To demonstrate comparing two computational protocols, we also create a second method (Method B) by adding ±0.5 kcal/mol Gaussian noise to the Method A values. You can easily swap this out with your own RBFE outputs.
For more information on creating a FEMap and adding data to it, see the cinnabar API tutorial.
%matplotlib inline
import numpy as np
import pandas as pd
from openff.units import unit
from cinnabar import FEMap, stats
# Load the computational results bundled with cinnabar
rbfe_results = pd.read_csv("../cinnabar/data/computational_data.csv")
# Fix the random seed so bootstrap results are reproducible
rng = np.random.default_rng(seed=42)
femap = FEMap()
# Method A: the original calculated values
for _, result in rbfe_results.iterrows():
# add each calculated relative free energy to the FEMap
femap.add_relative_calculation(
labelA=result["Ligand1"], # string identifier for ligandA
labelB=result["Ligand2"], # string identifier for ligandB
value=result["calc_DDG"] * unit.kilocalorie_per_mole, # the calculated relative free energy with units
uncertainty=result["calc_dDDG(MBAR)"]
* unit.kilocalorie_per_mole, # the uncertainty in the calculated relative free energy with units
source="Method A", # optional string describing the source of the calculation
)
# Method B: same edges, slightly different values (simulating a second protocol)
for _, result in rbfe_results.iterrows():
# add each calculated relative free energy to the FEMap
femap.add_relative_calculation(
labelA=result["Ligand1"],
labelB=result["Ligand2"],
# add ±0.5 kcal/mol noise
value=(result["calc_DDG"] + rng.normal(0, 0.5)) * unit.kilocalorie_per_mole,
uncertainty=result["calc_dDDG(MBAR)"] * unit.kilocalorie_per_mole,
source="Method B",
)
# Experimental absolute values (shared by both methods)
experimental_results = pd.read_csv("../cinnabar/data/experimental_data.csv")
for _, exp_row in experimental_results.iterrows():
femap.add_experimental_measurement(
label=exp_row["Ligand"],
value=exp_row["expt_DG"] * unit.kilocalorie_per_mole,
uncertainty=exp_row["expt_dDG"] * unit.kilocalorie_per_mole,
source="Experimental",
)
# Generate MLE absolute values for both sources in one call.
# This produces sources "MLE(Method A)" and "MLE(Method B)" in the absolute dataframe.
femap.generate_absolute_values()
print(f"FEMap contains {femap.n_ligands} ligands")
rel_df = femap.get_relative_dataframe()
print("Computational sources:", rel_df.loc[rel_df["computational"], "source"].unique().tolist())
FEMap contains 36 ligands
Computational sources: ['Method A', 'Method B']
Extracting data from the FEMap#
Relative ΔΔG data#
get_relative_dataframe() returns a DataFrame of every directly simulated edge together with the corresponding experimental ΔΔG derived from the absolute experimental values. The dataframe is also grouped and sorted by the source column to make computing metrics for a particular method straightforward.
The columns are:
labelA,labelB: labels for the two ligands involved in the transformationDDG (kcal/mol): the simulated free energy difference (ΔΔG = ΔG_B − ΔG_A)uncertainty (kcal/mol): the reported uncertainty for the simulationsource: which method / dataset the row comes fromcomputational:Truefor calculated results,Falsefor experimental
rel_df = femap.get_relative_dataframe()
# show some method A results
rel_df.head(10)
| labelA | labelB | DDG (kcal/mol) | uncertainty (kcal/mol) | source | computational | |
|---|---|---|---|---|---|---|
| 0 | CAT-13a | CAT-13m | -0.95 | 0.13 | Method A | True |
| 1 | CAT-13a | CAT-17g | -0.02 | 0.10 | Method A | True |
| 2 | CAT-13a | CAT-17i | -0.76 | 0.11 | Method A | True |
| 3 | CAT-13b | CAT-17g | 0.36 | 0.11 | Method A | True |
| 4 | CAT-13c | CAT-17i | 0.26 | 0.11 | Method A | True |
| 5 | CAT-13d | CAT-13b | 2.12 | 0.12 | Method A | True |
| 6 | CAT-13d | CAT-13f | 0.13 | 0.13 | Method A | True |
| 7 | CAT-13d | CAT-13h | 1.46 | 0.10 | Method A | True |
| 8 | CAT-13d | CAT-13i | -0.59 | 0.13 | Method A | True |
| 9 | CAT-13d | CAT-17a | -0.78 | 0.07 | Method A | True |
# show some experimental results
rel_df.tail(10)
| labelA | labelB | DDG (kcal/mol) | uncertainty (kcal/mol) | source | computational | |
|---|---|---|---|---|---|---|
| 164 | CAT-4m | CAT-4c | 1.30 | 0.141421 | experimental | False |
| 165 | CAT-4m | CAT-4j | 0.13 | 0.141421 | experimental | False |
| 166 | CAT-4m | CAT-4k | 1.30 | 0.141421 | experimental | False |
| 167 | CAT-4m | CAT-4l | -0.19 | 0.141421 | experimental | False |
| 168 | CAT-4m | CAT-4n | 0.06 | 0.141421 | experimental | False |
| 169 | CAT-4m | CAT-4p | -0.93 | 0.141421 | experimental | False |
| 170 | CAT-4n | CAT-13k | -0.61 | 0.141421 | experimental | False |
| 171 | CAT-4o | CAT-4b | -0.25 | 0.141421 | experimental | False |
| 172 | CAT-4o | CAT-4d | 0.27 | 0.141421 | experimental | False |
| 173 | CAT-4p | CAT-13k | 0.38 | 0.141421 | experimental | False |
# show all the sources in the dataframe
print(f"Sources present: {rel_df['source'].unique().tolist()}")
Sources present: ['Method A', 'Method B', 'experimental']
To compute error metrics for a particular method we need to pair each computational edge with the corresponding experimental value. Both are already in the same DataFrame with matching (labelA, labelB) keys we simply merge on those columns, while making sure to filter out any rows with missing values (e.g. if an experimental value is missing for a particular edge, that row will be dropped from the merged dataframe).
# make a helper function to align a source in the dataframe with the experimental values
def get_paired_ddg(df: pd.DataFrame, source: str) -> pd.DataFrame:
"""Return a DataFrame with matched computational and experimental ΔΔG columns."""
# Computational rows for the requested source
calc = df[df["source"] == source][["labelA", "labelB", "DDG (kcal/mol)", "uncertainty (kcal/mol)"]]
calc = calc.rename(columns={"DDG (kcal/mol)": "DDG_calc", "uncertainty (kcal/mol)": "dDDG_calc"})
# Experimental rows (source label is always "experimental" inside the relative dataframe)
expt = df[df["source"] == "experimental"][["labelA", "labelB", "DDG (kcal/mol)", "uncertainty (kcal/mol)"]]
expt = expt.rename(columns={"DDG (kcal/mol)": "DDG_expt", "uncertainty (kcal/mol)": "dDDG_expt"})
# merge the aligned dataframes
paired = calc.merge(expt, on=["labelA", "labelB"], how="left")
# drop any rows with a missing experimental value
paired = paired.dropna(subset=["DDG_expt"])
return paired
paired_A = get_paired_ddg(rel_df, "Method A")
paired_B = get_paired_ddg(rel_df, "Method B")
print(f"Paired edges — Method A: {len(paired_A)}, Method B: {len(paired_B)}\n")
paired_A.head()
Paired edges — Method A: 58, Method B: 58
| labelA | labelB | DDG_calc | dDDG_calc | DDG_expt | dDDG_expt | |
|---|---|---|---|---|---|---|
| 0 | CAT-13a | CAT-13m | -0.95 | 0.13 | 0.08 | 0.141421 |
| 1 | CAT-13a | CAT-17g | -0.02 | 0.10 | -0.90 | 0.141421 |
| 2 | CAT-13a | CAT-17i | -0.76 | 0.11 | -0.63 | 0.141421 |
| 3 | CAT-13b | CAT-17g | 0.36 | 0.11 | -0.62 | 0.141421 |
| 4 | CAT-13c | CAT-17i | 0.26 | 0.11 | -0.15 | 0.141421 |
Absolute ΔG data#
get_absolute_dataframe() returns per-ligand absolute free energies as a DataFrame. After calling generate_absolute_values() the MLE-derived computational estimates appear alongside the experimental absolute values, directly simulated absolute values are also included if present. The dataframe is grouped and sorted by the source column to make computing metrics for a particular method straightforward.
The columns are:
label: ligand identifierDG (kcal/mol): the absolute binding free energy, either experimental, MLE-derived or directly simulateduncertainty (kcal/mol): the reported uncertaintysource: which method / dataset the row comes fromcomputational:Truefor calculated results,Falsefor experimental
abs_df = femap.get_absolute_dataframe()
# print the experimental values
abs_df.head(10)
| label | DG (kcal/mol) | uncertainty (kcal/mol) | source | computational | |
|---|---|---|---|---|---|
| 0 | CAT-13a | -8.83 | 0.1 | Experimental | False |
| 1 | CAT-13b | -9.11 | 0.1 | Experimental | False |
| 2 | CAT-13c | -9.31 | 0.1 | Experimental | False |
| 3 | CAT-13d | -10.46 | 0.1 | Experimental | False |
| 4 | CAT-13e | -9.95 | 0.1 | Experimental | False |
| 5 | CAT-13f | -9.08 | 0.1 | Experimental | False |
| 6 | CAT-13g | -9.08 | 0.1 | Experimental | False |
| 7 | CAT-13h | -9.62 | 0.1 | Experimental | False |
| 8 | CAT-13i | -9.26 | 0.1 | Experimental | False |
| 9 | CAT-13j | -8.72 | 0.1 | Experimental | False |
# print the MLE values calculated using Method B
abs_df.tail(10)
| label | DG (kcal/mol) | uncertainty (kcal/mol) | source | computational | |
|---|---|---|---|---|---|
| 98 | CAT-4c | 1.907047 | 0.110923 | MLE(Method B) | True |
| 99 | CAT-4d | -1.208185 | 0.117046 | MLE(Method B) | True |
| 100 | CAT-4i | 2.829656 | 0.109747 | MLE(Method B) | True |
| 101 | CAT-4j | 1.381073 | 0.100368 | MLE(Method B) | True |
| 102 | CAT-4k | 1.650562 | 0.105023 | MLE(Method B) | True |
| 103 | CAT-4l | 2.157689 | 0.117528 | MLE(Method B) | True |
| 104 | CAT-4m | 0.903243 | 0.089598 | MLE(Method B) | True |
| 105 | CAT-4n | 0.209548 | 0.103610 | MLE(Method B) | True |
| 106 | CAT-4o | 0.921020 | 0.096058 | MLE(Method B) | True |
| 107 | CAT-4p | -0.259442 | 0.098955 | MLE(Method B) | True |
# get all sources in the dataframe
print(f"Sources present: {abs_df['source'].unique().tolist()}")
Sources present: ['Experimental', 'MLE(Method A)', 'MLE(Method B)']
# make a helper function to align the chosen source with the experimental data
def get_paired_dg(df: pd.DataFrame, source: str) -> pd.DataFrame:
"""Return a DataFrame with matched MLE computational and experimental ΔG columns."""
calc = df[df["source"] == source][["label", "DG (kcal/mol)", "uncertainty (kcal/mol)"]]
calc = calc.rename(columns={"DG (kcal/mol)": "DG_calc", "uncertainty (kcal/mol)": "dDG_calc"})
expt = df[df["source"] == "Experimental"][["label", "DG (kcal/mol)", "uncertainty (kcal/mol)"]]
expt = expt.rename(columns={"DG (kcal/mol)": "DG_expt", "uncertainty (kcal/mol)": "dDG_expt"})
# merge the aligned dataframes
paired = calc.merge(expt, on="label", how="left")
# drop any rows with a missing experimental value
paired = paired.dropna(subset=["DG_expt"])
return paired
# MLE source names are automatically composed as "MLE(<original_source>)"
abs_paired_A = get_paired_dg(abs_df, "MLE(Method A)")
abs_paired_B = get_paired_dg(abs_df, "MLE(Method B)")
print(f"Paired ligands — MLE(Method A): {len(abs_paired_A)}, MLE(Method B): {len(abs_paired_B)}\n")
abs_paired_A.head()
Paired ligands — MLE(Method A): 36, MLE(Method B): 36
| label | DG_calc | dDG_calc | DG_expt | dDG_expt | |
|---|---|---|---|---|---|
| 0 | CAT-13a | 0.503213 | 0.066349 | -8.83 | 0.1 |
| 1 | CAT-13b | 0.378864 | 0.102983 | -9.11 | 0.1 |
| 2 | CAT-13c | 0.028110 | 0.099052 | -9.31 | 0.1 |
| 3 | CAT-13d | -1.211377 | 0.075449 | -10.46 | 0.1 |
| 4 | CAT-13e | -1.136890 | 0.099052 | -9.95 | 0.1 |
Computing metrics with cinnabar.stats#
Each metric function implemented in cinnabar.stats has a similar API, they each take two numpy arrays: y_true and y_pred — and return a single scalar. The convention throughout cinnabar is:
y_true: experimental valuesy_pred: calculated / predicted values
The arguments can be swaped however or used to compare two different calculated results as you wish.
Relative ΔΔG metrics#
y_true_A = paired_A["DDG_expt"].to_numpy()
y_pred_A = paired_A["DDG_calc"].to_numpy()
y_pred_B = paired_B["DDG_calc"].to_numpy()
# calculate the RMSE and MUE for Method A/B relative to experiment and report in a pandas DataFrame
data = []
for source, y_pred in [("Method A", y_pred_A), ("Method B", y_pred_B)]:
for metric, func in [("RMSE", stats.calculate_rmse), ("MUE", stats.calculate_mue)]:
value = func(y_true_A, y_pred)
data.append({"Method": source, "Metric": metric, "Value": value})
metrics_df = pd.DataFrame(data)
print("Relative ΔΔG metrics:")
metrics_df
Relative ΔΔG metrics:
| Method | Metric | Value | |
|---|---|---|---|
| 0 | Method A | RMSE | 1.053002 |
| 1 | Method A | MUE | 0.867586 |
| 2 | Method B | RMSE | 1.134862 |
| 3 | Method B | MUE | 0.898565 |
You can cross-check these values with the default annotations shown in the scatter plots in the cinnabar API tutorial and see that they match to 2 decimal places.
Absolute ΔG metrics#
The same functions work equally well on the MLE-derived absolute free energies, we can also calculate correlation metrics for these values matching the defaults on the absolute ΔG scatter plots in the API tutorial:
y_true_A_abs = abs_paired_A["DG_expt"].to_numpy()
y_pred_A_abs = abs_paired_A["DG_calc"].to_numpy()
y_pred_B_abs = abs_paired_B["DG_calc"].to_numpy()
# calculate the default metrics for Method A/B relative to experiment and report in a pandas DataFrame
data = []
for source, y_pred in [("MLE(Method A)", y_pred_A_abs), ("MLE(Method B)", y_pred_B_abs)]:
for metric, func in [
("RMSE", stats.calculate_rmse),
("MUE", stats.calculate_mue),
("R2", stats.calculate_r2),
("rho", stats.calculate_pearson_r),
]:
value = func(y_true_A_abs, y_pred)
data.append({"Method": source, "Metric": metric, "Value": value})
metrics_df = pd.DataFrame(data)
print("Absolute ΔG metrics:")
metrics_df
Absolute ΔG metrics:
| Method | Metric | Value | |
|---|---|---|---|
| 0 | MLE(Method A) | RMSE | 9.364494 |
| 1 | MLE(Method A) | MUE | 9.326389 |
| 2 | MLE(Method A) | R2 | 0.614966 |
| 3 | MLE(Method A) | rho | 0.784198 |
| 4 | MLE(Method B) | RMSE | 9.376977 |
| 5 | MLE(Method B) | MUE | 9.326389 |
| 6 | MLE(Method B) | R2 | 0.561993 |
| 7 | MLE(Method B) | rho | 0.749662 |
# calculate the RMSE and MUE error again after applying a shift to align the center of the predicted and experimental values
y_true_A_abs_centered = y_true_A_abs - np.mean(y_true_A_abs)
y_pred_A_abs_centered = y_pred_A_abs - np.mean(y_pred_A_abs)
y_pred_B_abs_centered = y_pred_B_abs - np.mean(y_pred_B_abs)
# calculate the default metrics for Method A/B relative to experiment and report in a pandas DataFrame
data = []
for source, y_pred in [("MLE(Method A)", y_pred_A_abs_centered), ("MLE(Method B)", y_pred_B_abs_centered)]:
for metric, func in [
("RMSE", stats.calculate_rmse),
("MUE", stats.calculate_mue),
("R2", stats.calculate_r2),
("rho", stats.calculate_pearson_r),
]:
value = func(y_true_A_abs_centered, y_pred)
data.append({"Method": source, "Metric": metric, "Value": value})
metrics_df = pd.DataFrame(data)
print("Relative ΔΔG metrics:")
metrics_df
Relative ΔΔG metrics:
| Method | Metric | Value | |
|---|---|---|---|
| 0 | MLE(Method A) | RMSE | 0.843931 |
| 1 | MLE(Method A) | MUE | 0.664549 |
| 2 | MLE(Method A) | R2 | 0.614966 |
| 3 | MLE(Method A) | rho | 0.784198 |
| 4 | MLE(Method B) | RMSE | 0.972712 |
| 5 | MLE(Method B) | MUE | 0.770764 |
| 6 | MLE(Method B) | R2 | 0.561993 |
| 7 | MLE(Method B) | rho | 0.749662 |
Bootstrap confidence intervals#
A sample estimate of a metric is not useful in isolation without an estimate of its uncertainity. bootstrap_statistic wraps any metric with bootstrap resampling to create a distribution of values from which confidence intervals can be derived (see here for more information on confidence intervals and their derivation). It returns a dictionary with the following keys:
mle: the statistic on the original (unsampled) data, or sample statisticmean: the mean value for the metric over all bootstrap samplesstderr: standard error of the statistic over all bootstrap sampleslow/high: the confidence interval bounds (default 95 %)
The statistic argument accepts any of the available metrics as short-hand strings: "RMSE", "MUE", "R2", "rho", "KTAU", "RAE", "NRMSE", or "PI".
You can also pass the experimental and predicted uncertainty arrays and ask the bootstrap to incorporate them by setting include_true_uncertainty=True and/or include_pred_uncertainty=True. This adds Gaussian noise drawn from those uncertainties during each replicate of bootstrap sampling.
# calculate the RMSE of the DDGs again with a bootstrap CI estimate
rmse_boot_A = stats.bootstrap_statistic(
y_true=y_true_A,
y_pred=y_pred_A,
dy_true=paired_A["dDDG_expt"].to_numpy(),
dy_pred=paired_A["dDDG_calc"].to_numpy(),
statistic="RMSE",
ci=0.95,
nbootstrap=1000,
include_pred_uncertainty=False,
)
print("RMSE bootstrap results for Method A (ΔΔG):")
for key, value in rmse_boot_A.items():
print(f" {key:8s}: {value:.3f} kcal/mol")
RMSE bootstrap results for Method A (ΔΔG):
mle : 1.053 kcal/mol
stderr : 0.086 kcal/mol
mean : 1.053 kcal/mol
low : 0.879 kcal/mol
high : 1.220 kcal/mol
This can then be repeated for every metric of interest using a simple loop as above.
Common tasks with FEMap dataframes#
The DataFrames returned by get_relative_dataframe() and get_absolute_dataframe() are standard pandas DataFrames, so any standard DataFrame operation works on them. A few useful patterns which can help you explore the data and prepare it for downstream analysis:
# List all unique sources
rel_df["source"].unique()
# Filter to a single method
method_a_rows = rel_df[rel_df["source"] == "Method A"]
# Filter to experimental rows only
expt_rows = rel_df[~rel_df["computational"]]
Recap#
Use
femap.get_relative_dataframe()to obtain the directly simulated ΔΔG edges together with the corresponding experimental ΔΔG values; merge on(labelA, labelB)to pair them for metric computation.Use
femap.get_absolute_dataframe()for per-ligand ΔG values; merge onlabelto pair them. Requiresfemap.generate_absolute_values()to have been called first for the MLE estimates or directly simulated absolute values to be present.All point-estimate metrics are in
cinnabar.statsand accept plainnumpyarrays.bootstrap_statisticwraps any of these with bootstrap resampling and returns the MLE estimate alongside a mean, standard error, and confidence interval.