Calculating Error and Correlation Metrics Manually#

cinnabar’s scatter plots automatically annotate figures with statistics such as RMSE or MUE based on the recommend best practices for the observable being plotted as described in the companion paper.

Sometimes, however, you may need to access those numbers to run a statistical comparison, or feed results into a downstream analysis pipeline and cinnabar provides a convenient way to do that as well using the API.

In this tutorial we will demonstrate how to recreate the analysis metrics shown by default in the example scatter plots manually, this involves the following steps:

  1. Extract the predicted and experimental values from a FEMap using the dataframe helper functions.

  2. Compute the appropriate metrics for the given observable using analysis functions available in cinnabar.stats.

  3. Obtain bootstrap confidence intervals around each metric using cinnabar.stats.bootstrap_statistic.

Available metrics#

The cinnabar.stats module exposes are wide range of deviation and correlation metrics, the table below maps the short-hand string identifiers (used by cinnabar internally) for each metric to their corresponding function names and descriptions:

Metric

Function name

Description

"RMSE"

calculate_rmse

Root mean squared error.

"MUE"

calculate_mue

Mean unsigned error.

"RAE"

calculate_rae

Relative absolute error, a measure of the average absolute error relative to the average magnitude of the true values.

"NRMSE"

calculate_nrmse

Normalized root mean squared error, using the true mean to normalize the RMSE.

"R2"

calculate_r2

Coefficient of determination.

"rho"

calculate_pearson_r

Pearson correlation coefficient.

"KTAU"

calculate_kendalls_tau

Kendall Tau correlation coefficient.

"PI"

calculate_predictive_index

Predictive index, weighted Kendall Tau using the difference of the true values to weight ordinal comparisons.

see the cinnabar.stats API documentation for more details.

Loading Example RBFE Results#

For this example, we will use RBFE data generated with OpenFE, which is bundled with cinnabar’s data directory. The dataset contains one set of simulated relative free energy edges (Method A) and a corresponding set of experimental absolute free energies. To demonstrate comparing two computational protocols, we also create a second method (Method B) by adding ±0.5 kcal/mol Gaussian noise to the Method A values. You can easily swap this out with your own RBFE outputs.

For more information on creating a FEMap and adding data to it, see the cinnabar API tutorial.

%matplotlib inline
import numpy as np
import pandas as pd
from openff.units import unit

from cinnabar import FEMap, stats

# Load the computational results bundled with cinnabar
rbfe_results = pd.read_csv("../cinnabar/data/computational_data.csv")

# Fix the random seed so bootstrap results are reproducible
rng = np.random.default_rng(seed=42)

femap = FEMap()

# Method A: the original calculated values
for _, result in rbfe_results.iterrows():
    # add each calculated relative free energy to the FEMap
    femap.add_relative_calculation(
        labelA=result["Ligand1"],  # string identifier for ligandA
        labelB=result["Ligand2"],  # string identifier for ligandB
        value=result["calc_DDG"] * unit.kilocalorie_per_mole,  # the calculated relative free energy with units
        uncertainty=result["calc_dDDG(MBAR)"]
        * unit.kilocalorie_per_mole,  # the uncertainty in the calculated relative free energy with units
        source="Method A",  # optional string describing the source of the calculation
    )

# Method B: same edges, slightly different values (simulating a second protocol)
for _, result in rbfe_results.iterrows():
    # add each calculated relative free energy to the FEMap
    femap.add_relative_calculation(
        labelA=result["Ligand1"],
        labelB=result["Ligand2"],
        # add ±0.5 kcal/mol noise
        value=(result["calc_DDG"] + rng.normal(0, 0.5)) * unit.kilocalorie_per_mole,
        uncertainty=result["calc_dDDG(MBAR)"] * unit.kilocalorie_per_mole,
        source="Method B",
    )

# Experimental absolute values (shared by both methods)
experimental_results = pd.read_csv("../cinnabar/data/experimental_data.csv")
for _, exp_row in experimental_results.iterrows():
    femap.add_experimental_measurement(
        label=exp_row["Ligand"],
        value=exp_row["expt_DG"] * unit.kilocalorie_per_mole,
        uncertainty=exp_row["expt_dDG"] * unit.kilocalorie_per_mole,
        source="Experimental",
    )

# Generate MLE absolute values for both sources in one call.
# This produces sources "MLE(Method A)" and "MLE(Method B)" in the absolute dataframe.
femap.generate_absolute_values()
print(f"FEMap contains {femap.n_ligands} ligands")
rel_df = femap.get_relative_dataframe()
print("Computational sources:", rel_df.loc[rel_df["computational"], "source"].unique().tolist())
FEMap contains 36 ligands
Computational sources: ['Method A', 'Method B']

Extracting data from the FEMap#

Relative ΔΔG data#

get_relative_dataframe() returns a DataFrame of every directly simulated edge together with the corresponding experimental ΔΔG derived from the absolute experimental values. The dataframe is also grouped and sorted by the source column to make computing metrics for a particular method straightforward.

The columns are:

  • labelA, labelB: labels for the two ligands involved in the transformation

  • DDG (kcal/mol): the simulated free energy difference (ΔΔG = ΔG_B − ΔG_A)

  • uncertainty (kcal/mol): the reported uncertainty for the simulation

  • source: which method / dataset the row comes from

  • computational: True for calculated results, False for experimental

rel_df = femap.get_relative_dataframe()
# show some method A results
rel_df.head(10)
labelA labelB DDG (kcal/mol) uncertainty (kcal/mol) source computational
0 CAT-13a CAT-13m -0.95 0.13 Method A True
1 CAT-13a CAT-17g -0.02 0.10 Method A True
2 CAT-13a CAT-17i -0.76 0.11 Method A True
3 CAT-13b CAT-17g 0.36 0.11 Method A True
4 CAT-13c CAT-17i 0.26 0.11 Method A True
5 CAT-13d CAT-13b 2.12 0.12 Method A True
6 CAT-13d CAT-13f 0.13 0.13 Method A True
7 CAT-13d CAT-13h 1.46 0.10 Method A True
8 CAT-13d CAT-13i -0.59 0.13 Method A True
9 CAT-13d CAT-17a -0.78 0.07 Method A True
# show some experimental results
rel_df.tail(10)
labelA labelB DDG (kcal/mol) uncertainty (kcal/mol) source computational
164 CAT-4m CAT-4c 1.30 0.141421 experimental False
165 CAT-4m CAT-4j 0.13 0.141421 experimental False
166 CAT-4m CAT-4k 1.30 0.141421 experimental False
167 CAT-4m CAT-4l -0.19 0.141421 experimental False
168 CAT-4m CAT-4n 0.06 0.141421 experimental False
169 CAT-4m CAT-4p -0.93 0.141421 experimental False
170 CAT-4n CAT-13k -0.61 0.141421 experimental False
171 CAT-4o CAT-4b -0.25 0.141421 experimental False
172 CAT-4o CAT-4d 0.27 0.141421 experimental False
173 CAT-4p CAT-13k 0.38 0.141421 experimental False
# show all the sources in the dataframe
print(f"Sources present: {rel_df['source'].unique().tolist()}")
Sources present: ['Method A', 'Method B', 'experimental']

To compute error metrics for a particular method we need to pair each computational edge with the corresponding experimental value. Both are already in the same DataFrame with matching (labelA, labelB) keys we simply merge on those columns, while making sure to filter out any rows with missing values (e.g. if an experimental value is missing for a particular edge, that row will be dropped from the merged dataframe).

# make a helper function to align a source in the dataframe with the experimental values
def get_paired_ddg(df: pd.DataFrame, source: str) -> pd.DataFrame:
    """Return a DataFrame with matched computational and experimental ΔΔG columns."""
    # Computational rows for the requested source
    calc = df[df["source"] == source][["labelA", "labelB", "DDG (kcal/mol)", "uncertainty (kcal/mol)"]]
    calc = calc.rename(columns={"DDG (kcal/mol)": "DDG_calc", "uncertainty (kcal/mol)": "dDDG_calc"})

    # Experimental rows (source label is always "experimental" inside the relative dataframe)
    expt = df[df["source"] == "experimental"][["labelA", "labelB", "DDG (kcal/mol)", "uncertainty (kcal/mol)"]]
    expt = expt.rename(columns={"DDG (kcal/mol)": "DDG_expt", "uncertainty (kcal/mol)": "dDDG_expt"})

    # merge the aligned dataframes
    paired = calc.merge(expt, on=["labelA", "labelB"], how="left")
    # drop any rows with a missing experimental value
    paired = paired.dropna(subset=["DDG_expt"])
    return paired


paired_A = get_paired_ddg(rel_df, "Method A")
paired_B = get_paired_ddg(rel_df, "Method B")

print(f"Paired edges — Method A: {len(paired_A)}, Method B: {len(paired_B)}\n")
paired_A.head()
Paired edges — Method A: 58, Method B: 58
labelA labelB DDG_calc dDDG_calc DDG_expt dDDG_expt
0 CAT-13a CAT-13m -0.95 0.13 0.08 0.141421
1 CAT-13a CAT-17g -0.02 0.10 -0.90 0.141421
2 CAT-13a CAT-17i -0.76 0.11 -0.63 0.141421
3 CAT-13b CAT-17g 0.36 0.11 -0.62 0.141421
4 CAT-13c CAT-17i 0.26 0.11 -0.15 0.141421

Absolute ΔG data#

get_absolute_dataframe() returns per-ligand absolute free energies as a DataFrame. After calling generate_absolute_values() the MLE-derived computational estimates appear alongside the experimental absolute values, directly simulated absolute values are also included if present. The dataframe is grouped and sorted by the source column to make computing metrics for a particular method straightforward.

The columns are:

  • label: ligand identifier

  • DG (kcal/mol): the absolute binding free energy, either experimental, MLE-derived or directly simulated

  • uncertainty (kcal/mol): the reported uncertainty

  • source: which method / dataset the row comes from

  • computational: True for calculated results, False for experimental

abs_df = femap.get_absolute_dataframe()
# print the experimental values
abs_df.head(10)
label DG (kcal/mol) uncertainty (kcal/mol) source computational
0 CAT-13a -8.83 0.1 Experimental False
1 CAT-13b -9.11 0.1 Experimental False
2 CAT-13c -9.31 0.1 Experimental False
3 CAT-13d -10.46 0.1 Experimental False
4 CAT-13e -9.95 0.1 Experimental False
5 CAT-13f -9.08 0.1 Experimental False
6 CAT-13g -9.08 0.1 Experimental False
7 CAT-13h -9.62 0.1 Experimental False
8 CAT-13i -9.26 0.1 Experimental False
9 CAT-13j -8.72 0.1 Experimental False
# print the MLE values calculated using Method B
abs_df.tail(10)
label DG (kcal/mol) uncertainty (kcal/mol) source computational
98 CAT-4c 1.907047 0.110923 MLE(Method B) True
99 CAT-4d -1.208185 0.117046 MLE(Method B) True
100 CAT-4i 2.829656 0.109747 MLE(Method B) True
101 CAT-4j 1.381073 0.100368 MLE(Method B) True
102 CAT-4k 1.650562 0.105023 MLE(Method B) True
103 CAT-4l 2.157689 0.117528 MLE(Method B) True
104 CAT-4m 0.903243 0.089598 MLE(Method B) True
105 CAT-4n 0.209548 0.103610 MLE(Method B) True
106 CAT-4o 0.921020 0.096058 MLE(Method B) True
107 CAT-4p -0.259442 0.098955 MLE(Method B) True
# get all sources in the dataframe
print(f"Sources present: {abs_df['source'].unique().tolist()}")
Sources present: ['Experimental', 'MLE(Method A)', 'MLE(Method B)']
# make a helper function to align the chosen source with the experimental data
def get_paired_dg(df: pd.DataFrame, source: str) -> pd.DataFrame:
    """Return a DataFrame with matched MLE computational and experimental ΔG columns."""
    calc = df[df["source"] == source][["label", "DG (kcal/mol)", "uncertainty (kcal/mol)"]]
    calc = calc.rename(columns={"DG (kcal/mol)": "DG_calc", "uncertainty (kcal/mol)": "dDG_calc"})

    expt = df[df["source"] == "Experimental"][["label", "DG (kcal/mol)", "uncertainty (kcal/mol)"]]
    expt = expt.rename(columns={"DG (kcal/mol)": "DG_expt", "uncertainty (kcal/mol)": "dDG_expt"})

    # merge the aligned dataframes
    paired = calc.merge(expt, on="label", how="left")
    # drop any rows with a missing experimental value
    paired = paired.dropna(subset=["DG_expt"])
    return paired


# MLE source names are automatically composed as "MLE(<original_source>)"
abs_paired_A = get_paired_dg(abs_df, "MLE(Method A)")
abs_paired_B = get_paired_dg(abs_df, "MLE(Method B)")

print(f"Paired ligands — MLE(Method A): {len(abs_paired_A)}, MLE(Method B): {len(abs_paired_B)}\n")
abs_paired_A.head()
Paired ligands — MLE(Method A): 36, MLE(Method B): 36
label DG_calc dDG_calc DG_expt dDG_expt
0 CAT-13a 0.503213 0.066349 -8.83 0.1
1 CAT-13b 0.378864 0.102983 -9.11 0.1
2 CAT-13c 0.028110 0.099052 -9.31 0.1
3 CAT-13d -1.211377 0.075449 -10.46 0.1
4 CAT-13e -1.136890 0.099052 -9.95 0.1

Computing metrics with cinnabar.stats#

Each metric function implemented in cinnabar.stats has a similar API, they each take two numpy arrays: y_true and y_pred — and return a single scalar. The convention throughout cinnabar is:

  • y_true: experimental values

  • y_pred: calculated / predicted values

The arguments can be swaped however or used to compare two different calculated results as you wish.

Relative ΔΔG metrics#

y_true_A = paired_A["DDG_expt"].to_numpy()
y_pred_A = paired_A["DDG_calc"].to_numpy()
y_pred_B = paired_B["DDG_calc"].to_numpy()

# calculate the RMSE and MUE for Method A/B relative to experiment and report in a pandas DataFrame
data = []

for source, y_pred in [("Method A", y_pred_A), ("Method B", y_pred_B)]:
    for metric, func in [("RMSE", stats.calculate_rmse), ("MUE", stats.calculate_mue)]:
        value = func(y_true_A, y_pred)
        data.append({"Method": source, "Metric": metric, "Value": value})


metrics_df = pd.DataFrame(data)
print("Relative ΔΔG metrics:")
metrics_df
Relative ΔΔG metrics:
Method Metric Value
0 Method A RMSE 1.053002
1 Method A MUE 0.867586
2 Method B RMSE 1.134862
3 Method B MUE 0.898565

You can cross-check these values with the default annotations shown in the scatter plots in the cinnabar API tutorial and see that they match to 2 decimal places.

Absolute ΔG metrics#

The same functions work equally well on the MLE-derived absolute free energies, we can also calculate correlation metrics for these values matching the defaults on the absolute ΔG scatter plots in the API tutorial:

y_true_A_abs = abs_paired_A["DG_expt"].to_numpy()
y_pred_A_abs = abs_paired_A["DG_calc"].to_numpy()
y_pred_B_abs = abs_paired_B["DG_calc"].to_numpy()

# calculate the default metrics for Method A/B relative to experiment and report in a pandas DataFrame
data = []
for source, y_pred in [("MLE(Method A)", y_pred_A_abs), ("MLE(Method B)", y_pred_B_abs)]:
    for metric, func in [
        ("RMSE", stats.calculate_rmse),
        ("MUE", stats.calculate_mue),
        ("R2", stats.calculate_r2),
        ("rho", stats.calculate_pearson_r),
    ]:
        value = func(y_true_A_abs, y_pred)
        data.append({"Method": source, "Metric": metric, "Value": value})
metrics_df = pd.DataFrame(data)
print("Absolute ΔG metrics:")
metrics_df
Absolute ΔG metrics:
Method Metric Value
0 MLE(Method A) RMSE 9.364494
1 MLE(Method A) MUE 9.326389
2 MLE(Method A) R2 0.614966
3 MLE(Method A) rho 0.784198
4 MLE(Method B) RMSE 9.376977
5 MLE(Method B) MUE 9.326389
6 MLE(Method B) R2 0.561993
7 MLE(Method B) rho 0.749662
NOTE The accuracy metrics (RMSE and MUE) are significantly worse than those reported on the absolute ΔG scatter plots in the API tutorial. This is due the MLE values being centered around 0 rather than the experimental mean, this offset is automatically removed by default in the scatter plot. The correlation metrics (R² and Pearson r) are unaffected by this shift and match the API tutorial values.
# calculate the RMSE and MUE error again after applying a shift to align the center of the predicted and experimental values
y_true_A_abs_centered = y_true_A_abs - np.mean(y_true_A_abs)
y_pred_A_abs_centered = y_pred_A_abs - np.mean(y_pred_A_abs)
y_pred_B_abs_centered = y_pred_B_abs - np.mean(y_pred_B_abs)

# calculate the default metrics for Method A/B relative to experiment and report in a pandas DataFrame
data = []
for source, y_pred in [("MLE(Method A)", y_pred_A_abs_centered), ("MLE(Method B)", y_pred_B_abs_centered)]:
    for metric, func in [
        ("RMSE", stats.calculate_rmse),
        ("MUE", stats.calculate_mue),
        ("R2", stats.calculate_r2),
        ("rho", stats.calculate_pearson_r),
    ]:
        value = func(y_true_A_abs_centered, y_pred)
        data.append({"Method": source, "Metric": metric, "Value": value})
metrics_df = pd.DataFrame(data)
print("Relative ΔΔG metrics:")
metrics_df
Relative ΔΔG metrics:
Method Metric Value
0 MLE(Method A) RMSE 0.843931
1 MLE(Method A) MUE 0.664549
2 MLE(Method A) R2 0.614966
3 MLE(Method A) rho 0.784198
4 MLE(Method B) RMSE 0.972712
5 MLE(Method B) MUE 0.770764
6 MLE(Method B) R2 0.561993
7 MLE(Method B) rho 0.749662

Bootstrap confidence intervals#

A sample estimate of a metric is not useful in isolation without an estimate of its uncertainity. bootstrap_statistic wraps any metric with bootstrap resampling to create a distribution of values from which confidence intervals can be derived (see here for more information on confidence intervals and their derivation). It returns a dictionary with the following keys:

  • mle: the statistic on the original (unsampled) data, or sample statistic

  • mean: the mean value for the metric over all bootstrap samples

  • stderr: standard error of the statistic over all bootstrap samples

  • low / high: the confidence interval bounds (default 95 %)

The statistic argument accepts any of the available metrics as short-hand strings: "RMSE", "MUE", "R2", "rho", "KTAU", "RAE", "NRMSE", or "PI".

You can also pass the experimental and predicted uncertainty arrays and ask the bootstrap to incorporate them by setting include_true_uncertainty=True and/or include_pred_uncertainty=True. This adds Gaussian noise drawn from those uncertainties during each replicate of bootstrap sampling.

# calculate the RMSE of the DDGs again with a bootstrap CI estimate
rmse_boot_A = stats.bootstrap_statistic(
    y_true=y_true_A,
    y_pred=y_pred_A,
    dy_true=paired_A["dDDG_expt"].to_numpy(),
    dy_pred=paired_A["dDDG_calc"].to_numpy(),
    statistic="RMSE",
    ci=0.95,
    nbootstrap=1000,
    include_pred_uncertainty=False,
)

print("RMSE bootstrap results for Method A (ΔΔG):")
for key, value in rmse_boot_A.items():
    print(f"  {key:8s}: {value:.3f} kcal/mol")
RMSE bootstrap results for Method A (ΔΔG):
  mle     : 1.053 kcal/mol
  stderr  : 0.086 kcal/mol
  mean    : 1.053 kcal/mol
  low     : 0.879 kcal/mol
  high    : 1.220 kcal/mol

This can then be repeated for every metric of interest using a simple loop as above.

Common tasks with FEMap dataframes#

The DataFrames returned by get_relative_dataframe() and get_absolute_dataframe() are standard pandas DataFrames, so any standard DataFrame operation works on them. A few useful patterns which can help you explore the data and prepare it for downstream analysis:

# List all unique sources
rel_df["source"].unique()

# Filter to a single method
method_a_rows = rel_df[rel_df["source"] == "Method A"]

# Filter to experimental rows only
expt_rows = rel_df[~rel_df["computational"]]

Recap#

  • Use femap.get_relative_dataframe() to obtain the directly simulated ΔΔG edges together with the corresponding experimental ΔΔG values; merge on (labelA, labelB) to pair them for metric computation.

  • Use femap.get_absolute_dataframe() for per-ligand ΔG values; merge on label to pair them. Requires femap.generate_absolute_values() to have been called first for the MLE estimates or directly simulated absolute values to be present.

  • All point-estimate metrics are in cinnabar.stats and accept plain numpy arrays.

  • bootstrap_statistic wraps any of these with bootstrap resampling and returns the MLE estimate alongside a mean, standard error, and confidence interval.