Calculating Error and Correlation Metrics Manually#

cinnabar’s scatter plots automatically annotate figures with statistics such as RMSE or MUE based on the recommend best practices for the observable being plotted as described in the companion paper.

Sometimes, however, you may need to access those numbers to run a statistical comparison, or feed results into a downstream analysis pipeline and cinnabar provides a convenient way to do that as well using the API.

In this tutorial we will demonstrate how to recreate the analysis metrics shown by default in the example scatter plots manually, this involves the following steps:

Extract the predicted and experimental values from a FEMap using the dataframe helper functions.
Compute the appropriate metrics for the given observable using analysis functions available in cinnabar.stats.
Obtain bootstrap confidence intervals around each metric using cinnabar.stats.bootstrap_statistic.

Available metrics#

The cinnabar.stats module exposes are wide range of deviation and correlation metrics, the table below maps the short-hand string identifiers (used by cinnabar internally) for each metric to their corresponding function names and descriptions:

Metric	Function name	Description
`"RMSE"`	`calculate_rmse`	Root mean squared error.
`"MUE"`	`calculate_mue`	Mean unsigned error.
`"RAE"`	`calculate_rae`	Relative absolute error, a measure of the average absolute error relative to the average magnitude of the true values.
`"NRMSE"`	`calculate_nrmse`	Normalized root mean squared error, using the true mean to normalize the RMSE.
`"R2"`	`calculate_r2`	Coefficient of determination.
`"rho"`	`calculate_pearson_r`	Pearson correlation coefficient.
`"KTAU"`	`calculate_kendalls_tau`	Kendall Tau correlation coefficient.
`"PI"`	`calculate_predictive_index`	Predictive index, weighted Kendall Tau using the difference of the true values to weight ordinal comparisons.

see the cinnabar.stats API documentation for more details.

Loading Example RBFE Results#

For this example, we will use RBFE data generated with OpenFE, which is bundled with cinnabar’s data directory. The dataset contains one set of simulated relative free energy edges (Method A) and a corresponding set of experimental absolute free energies. To demonstrate comparing two computational protocols, we also create a second method (Method B) by adding ±0.5 kcal/mol Gaussian noise to the Method A values. You can easily swap this out with your own RBFE outputs.

For more information on creating a FEMap and adding data to it, see the cinnabar API tutorial.

%matplotlib inline
import numpy as np
import pandas as pd
from openff.units import unit

from cinnabar import FEMap, stats

# Load the computational results bundled with cinnabar
rbfe_results = pd.read_csv("../cinnabar/data/computational_data.csv")

# Fix the random seed so bootstrap results are reproducible
rng = np.random.default_rng(seed=42)

femap = FEMap()

# Method A: the original calculated values
for _, result in rbfe_results.iterrows():
    # add each calculated relative free energy to the FEMap
    femap.add_relative_calculation(
        labelA=result["Ligand1"],  # string identifier for ligandA
        labelB=result["Ligand2"],  # string identifier for ligandB
        value=result["calc_DDG"] * unit.kilocalorie_per_mole,  # the calculated relative free energy with units
        uncertainty=result["calc_dDDG(MBAR)"]
        * unit.kilocalorie_per_mole,  # the uncertainty in the calculated relative free energy with units
        source="Method A",  # optional string describing the source of the calculation
    )

# Method B: same edges, slightly different values (simulating a second protocol)
for _, result in rbfe_results.iterrows():
    # add each calculated relative free energy to the FEMap
    femap.add_relative_calculation(
        labelA=result["Ligand1"],
        labelB=result["Ligand2"],
        # add ±0.5 kcal/mol noise
        value=(result["calc_DDG"] + rng.normal(0, 0.5)) * unit.kilocalorie_per_mole,
        uncertainty=result["calc_dDDG(MBAR)"] * unit.kilocalorie_per_mole,
        source="Method B",
    )

# Experimental absolute values (shared by both methods)
experimental_results = pd.read_csv("../cinnabar/data/experimental_data.csv")
for _, exp_row in experimental_results.iterrows():
    femap.add_experimental_measurement(
        label=exp_row["Ligand"],
        value=exp_row["expt_DG"] * unit.kilocalorie_per_mole,
        uncertainty=exp_row["expt_dDG"] * unit.kilocalorie_per_mole,
        source="Experimental",
    )

# Generate MLE absolute values for both sources in one call.
# This produces sources "MLE(Method A)" and "MLE(Method B)" in the absolute dataframe.
femap.generate_absolute_values()
print(f"FEMap contains {femap.n_ligands} ligands")
rel_df = femap.get_relative_dataframe()
print("Computational sources:", rel_df.loc[rel_df["computational"], "source"].unique().tolist())

FEMap contains 36 ligands
Computational sources: ['Method A', 'Method B']

Extracting data from the FEMap#

Relative ΔΔG data#

get_relative_dataframe() returns a DataFrame of every directly simulated edge together with the corresponding experimental ΔΔG derived from the absolute experimental values. The dataframe is also grouped and sorted by the source column to make computing metrics for a particular method straightforward.

The columns are:

labelA, labelB: labels for the two ligands involved in the transformation
DDG (kcal/mol): the simulated free energy difference (ΔΔG = ΔG_B − ΔG_A)
uncertainty (kcal/mol): the reported uncertainty for the simulation
source: which method / dataset the row comes from
computational: True for calculated results, False for experimental

rel_df = femap.get_relative_dataframe()
# show some method A results
rel_df.head(10)

	labelA	labelB	DDG (kcal/mol)	uncertainty (kcal/mol)	source	computational
0	CAT-13a	CAT-13m	-0.95	0.13	Method A	True
1	CAT-13a	CAT-17g	-0.02	0.10	Method A	True
2	CAT-13a	CAT-17i	-0.76	0.11	Method A	True
3	CAT-13b	CAT-17g	0.36	0.11	Method A	True
4	CAT-13c	CAT-17i	0.26	0.11	Method A	True
5	CAT-13d	CAT-13b	2.12	0.12	Method A	True
6	CAT-13d	CAT-13f	0.13	0.13	Method A	True
7	CAT-13d	CAT-13h	1.46	0.10	Method A	True
8	CAT-13d	CAT-13i	-0.59	0.13	Method A	True
9	CAT-13d	CAT-17a	-0.78	0.07	Method A	True

# show some experimental results
rel_df.tail(10)

	labelA	labelB	DDG (kcal/mol)	uncertainty (kcal/mol)	source	computational
164	CAT-4m	CAT-4c	1.30	0.141421	experimental	False
165	CAT-4m	CAT-4j	0.13	0.141421	experimental	False
166	CAT-4m	CAT-4k	1.30	0.141421	experimental	False
167	CAT-4m	CAT-4l	-0.19	0.141421	experimental	False
168	CAT-4m	CAT-4n	0.06	0.141421	experimental	False
169	CAT-4m	CAT-4p	-0.93	0.141421	experimental	False
170	CAT-4n	CAT-13k	-0.61	0.141421	experimental	False
171	CAT-4o	CAT-4b	-0.25	0.141421	experimental	False
172	CAT-4o	CAT-4d	0.27	0.141421	experimental	False
173	CAT-4p	CAT-13k	0.38	0.141421	experimental	False

# show all the sources in the dataframe
print(f"Sources present: {rel_df['source'].unique().tolist()}")

Sources present: ['Method A', 'Method B', 'experimental']

To compute error metrics for a particular method we need to pair each computational edge with the corresponding experimental value. Both are already in the same DataFrame with matching (labelA, labelB) keys we simply merge on those columns, while making sure to filter out any rows with missing values (e.g. if an experimental value is missing for a particular edge, that row will be dropped from the merged dataframe).

# make a helper function to align a source in the dataframe with the experimental values
def get_paired_ddg(df: pd.DataFrame, source: str) -> pd.DataFrame:
    """Return a DataFrame with matched computational and experimental ΔΔG columns."""
    # Computational rows for the requested source
    calc = df[df["source"] == source][["labelA", "labelB", "DDG (kcal/mol)", "uncertainty (kcal/mol)"]]
    calc = calc.rename(columns={"DDG (kcal/mol)": "DDG_calc", "uncertainty (kcal/mol)": "dDDG_calc"})

    # Experimental rows (source label is always "experimental" inside the relative dataframe)
    expt = df[df["source"] == "experimental"][["labelA", "labelB", "DDG (kcal/mol)", "uncertainty (kcal/mol)"]]
    expt = expt.rename(columns={"DDG (kcal/mol)": "DDG_expt", "uncertainty (kcal/mol)": "dDDG_expt"})

    # merge the aligned dataframes
    paired = calc.merge(expt, on=["labelA", "labelB"], how="left")
    # drop any rows with a missing experimental value
    paired = paired.dropna(subset=["DDG_expt"])
    return paired


paired_A = get_paired_ddg(rel_df, "Method A")
paired_B = get_paired_ddg(rel_df, "Method B")

print(f"Paired edges — Method A: {len(paired_A)}, Method B: {len(paired_B)}\n")
paired_A.head()

Paired edges — Method A: 58, Method B: 58

	labelA	labelB	DDG_calc	dDDG_calc	DDG_expt	dDDG_expt
0	CAT-13a	CAT-13m	-0.95	0.13	0.08	0.141421
1	CAT-13a	CAT-17g	-0.02	0.10	-0.90	0.141421
2	CAT-13a	CAT-17i	-0.76	0.11	-0.63	0.141421
3	CAT-13b	CAT-17g	0.36	0.11	-0.62	0.141421
4	CAT-13c	CAT-17i	0.26	0.11	-0.15	0.141421

Absolute ΔG data#

get_absolute_dataframe() returns per-ligand absolute free energies as a DataFrame. After calling generate_absolute_values() the MLE-derived computational estimates appear alongside the experimental absolute values, directly simulated absolute values are also included if present. The dataframe is grouped and sorted by the source column to make computing metrics for a particular method straightforward.

The columns are:

label: ligand identifier
DG (kcal/mol): the absolute binding free energy, either experimental, MLE-derived or directly simulated
uncertainty (kcal/mol): the reported uncertainty
source: which method / dataset the row comes from
computational: True for calculated results, False for experimental

abs_df = femap.get_absolute_dataframe()
# print the experimental values
abs_df.head(10)

	label	DG (kcal/mol)	uncertainty (kcal/mol)	source	computational
0	CAT-13a	-8.83	0.1	Experimental	False
1	CAT-13b	-9.11	0.1	Experimental	False
2	CAT-13c	-9.31	0.1	Experimental	False
3	CAT-13d	-10.46	0.1	Experimental	False
4	CAT-13e	-9.95	0.1	Experimental	False
5	CAT-13f	-9.08	0.1	Experimental	False
6	CAT-13g	-9.08	0.1	Experimental	False
7	CAT-13h	-9.62	0.1	Experimental	False
8	CAT-13i	-9.26	0.1	Experimental	False
9	CAT-13j	-8.72	0.1	Experimental	False

# print the MLE values calculated using Method B
abs_df.tail(10)

	label	DG (kcal/mol)	uncertainty (kcal/mol)	source	computational
98	CAT-4c	1.907047	0.110923	MLE(Method B)	True
99	CAT-4d	-1.208185	0.117046	MLE(Method B)	True
100	CAT-4i	2.829656	0.109747	MLE(Method B)	True
101	CAT-4j	1.381073	0.100368	MLE(Method B)	True
102	CAT-4k	1.650562	0.105023	MLE(Method B)	True
103	CAT-4l	2.157689	0.117528	MLE(Method B)	True
104	CAT-4m	0.903243	0.089598	MLE(Method B)	True
105	CAT-4n	0.209548	0.103610	MLE(Method B)	True
106	CAT-4o	0.921020	0.096058	MLE(Method B)	True
107	CAT-4p	-0.259442	0.098955	MLE(Method B)	True

# get all sources in the dataframe
print(f"Sources present: {abs_df['source'].unique().tolist()}")

Sources present: ['Experimental', 'MLE(Method A)', 'MLE(Method B)']

# make a helper function to align the chosen source with the experimental data
def get_paired_dg(df: pd.DataFrame, source: str) -> pd.DataFrame:
    """Return a DataFrame with matched MLE computational and experimental ΔG columns."""
    calc = df[df["source"] == source][["label", "DG (kcal/mol)", "uncertainty (kcal/mol)"]]
    calc = calc.rename(columns={"DG (kcal/mol)": "DG_calc", "uncertainty (kcal/mol)": "dDG_calc"})

    expt = df[df["source"] == "Experimental"][["label", "DG (kcal/mol)", "uncertainty (kcal/mol)"]]
    expt = expt.rename(columns={"DG (kcal/mol)": "DG_expt", "uncertainty (kcal/mol)": "dDG_expt"})

    # merge the aligned dataframes
    paired = calc.merge(expt, on="label", how="left")
    # drop any rows with a missing experimental value
    paired = paired.dropna(subset=["DG_expt"])
    return paired


# MLE source names are automatically composed as "MLE(<original_source>)"
abs_paired_A = get_paired_dg(abs_df, "MLE(Method A)")
abs_paired_B = get_paired_dg(abs_df, "MLE(Method B)")

print(f"Paired ligands — MLE(Method A): {len(abs_paired_A)}, MLE(Method B): {len(abs_paired_B)}\n")
abs_paired_A.head()

Paired ligands — MLE(Method A): 36, MLE(Method B): 36

	label	DG_calc	dDG_calc	DG_expt	dDG_expt
0	CAT-13a	0.503213	0.066349	-8.83	0.1
1	CAT-13b	0.378864	0.102983	-9.11	0.1
2	CAT-13c	0.028110	0.099052	-9.31	0.1
3	CAT-13d	-1.211377	0.075449	-10.46	0.1
4	CAT-13e	-1.136890	0.099052	-9.95	0.1

Computing metrics with `cinnabar.stats`#

Each metric function implemented in cinnabar.stats has a similar API, they each take two numpy arrays: y_true and y_pred — and return a single scalar. The convention throughout cinnabar is:

y_true: experimental values
y_pred: calculated / predicted values

The arguments can be swaped however or used to compare two different calculated results as you wish.

Relative ΔΔG metrics#

y_true_A = paired_A["DDG_expt"].to_numpy()
y_pred_A = paired_A["DDG_calc"].to_numpy()
y_pred_B = paired_B["DDG_calc"].to_numpy()

# calculate the RMSE and MUE for Method A/B relative to experiment and report in a pandas DataFrame
data = []

for source, y_pred in [("Method A", y_pred_A), ("Method B", y_pred_B)]:
    for metric, func in [("RMSE", stats.calculate_rmse), ("MUE", stats.calculate_mue)]:
        value = func(y_true_A, y_pred)
        data.append({"Method": source, "Metric": metric, "Value": value})


metrics_df = pd.DataFrame(data)
print("Relative ΔΔG metrics:")
metrics_df

Relative ΔΔG metrics:

	Method	Metric	Value
0	Method A	RMSE	1.053002
1	Method A	MUE	0.867586
2	Method B	RMSE	1.134862
3	Method B	MUE	0.898565

You can cross-check these values with the default annotations shown in the scatter plots in the cinnabar API tutorial and see that they match to 2 decimal places.

Absolute ΔG metrics#

The same functions work equally well on the MLE-derived absolute free energies, we can also calculate correlation metrics for these values matching the defaults on the absolute ΔG scatter plots in the API tutorial:

y_true_A_abs = abs_paired_A["DG_expt"].to_numpy()
y_pred_A_abs = abs_paired_A["DG_calc"].to_numpy()
y_pred_B_abs = abs_paired_B["DG_calc"].to_numpy()

# calculate the default metrics for Method A/B relative to experiment and report in a pandas DataFrame
data = []
for source, y_pred in [("MLE(Method A)", y_pred_A_abs), ("MLE(Method B)", y_pred_B_abs)]:
    for metric, func in [
        ("RMSE", stats.calculate_rmse),
        ("MUE", stats.calculate_mue),
        ("R2", stats.calculate_r2),
        ("rho", stats.calculate_pearson_r),
    ]:
        value = func(y_true_A_abs, y_pred)
        data.append({"Method": source, "Metric": metric, "Value": value})
metrics_df = pd.DataFrame(data)
print("Absolute ΔG metrics:")
metrics_df

Absolute ΔG metrics:

	Method	Metric	Value
0	MLE(Method A)	RMSE	9.364494
1	MLE(Method A)	MUE	9.326389
2	MLE(Method A)	R2	0.614966
3	MLE(Method A)	rho	0.784198
4	MLE(Method B)	RMSE	9.376977
5	MLE(Method B)	MUE	9.326389
6	MLE(Method B)	R2	0.561993
7	MLE(Method B)	rho	0.749662

NOTE The accuracy metrics (RMSE and MUE) are significantly worse than those reported on the absolute ΔG scatter plots in the API tutorial. This is due the MLE values being centered around 0 rather than the experimental mean, this offset is automatically removed by default in the scatter plot. The correlation metrics (R² and Pearson r) are unaffected by this shift and match the API tutorial values.

# calculate the RMSE and MUE error again after applying a shift to align the center of the predicted and experimental values
y_true_A_abs_centered = y_true_A_abs - np.mean(y_true_A_abs)
y_pred_A_abs_centered = y_pred_A_abs - np.mean(y_pred_A_abs)
y_pred_B_abs_centered = y_pred_B_abs - np.mean(y_pred_B_abs)

# calculate the default metrics for Method A/B relative to experiment and report in a pandas DataFrame
data = []
for source, y_pred in [("MLE(Method A)", y_pred_A_abs_centered), ("MLE(Method B)", y_pred_B_abs_centered)]:
    for metric, func in [
        ("RMSE", stats.calculate_rmse),
        ("MUE", stats.calculate_mue),
        ("R2", stats.calculate_r2),
        ("rho", stats.calculate_pearson_r),
    ]:
        value = func(y_true_A_abs_centered, y_pred)
        data.append({"Method": source, "Metric": metric, "Value": value})
metrics_df = pd.DataFrame(data)
print("Relative ΔΔG metrics:")
metrics_df

Relative ΔΔG metrics:

	Method	Metric	Value
0	MLE(Method A)	RMSE	0.843931
1	MLE(Method A)	MUE	0.664549
2	MLE(Method A)	R2	0.614966
3	MLE(Method A)	rho	0.784198
4	MLE(Method B)	RMSE	0.972712
5	MLE(Method B)	MUE	0.770764
6	MLE(Method B)	R2	0.561993
7	MLE(Method B)	rho	0.749662

Bootstrap confidence intervals#

A sample estimate of a metric is not useful in isolation without an estimate of its uncertainity. bootstrap_statistic wraps any metric with bootstrap resampling to create a distribution of values from which confidence intervals can be derived (see here for more information on confidence intervals and their derivation). It returns a dictionary with the following keys:

mle: the statistic on the original (unsampled) data, or sample statistic
mean: the mean value for the metric over all bootstrap samples
stderr: standard error of the statistic over all bootstrap samples
low / high: the confidence interval bounds (default 95 %)

The statistic argument accepts any of the available metrics as short-hand strings: "RMSE", "MUE", "R2", "rho", "KTAU", "RAE", "NRMSE", or "PI".

You can also pass the experimental and predicted uncertainty arrays and ask the bootstrap to incorporate them by setting include_true_uncertainty=True and/or include_pred_uncertainty=True. This adds Gaussian noise drawn from those uncertainties during each replicate of bootstrap sampling.

# calculate the RMSE of the DDGs again with a bootstrap CI estimate
rmse_boot_A = stats.bootstrap_statistic(
    y_true=y_true_A,
    y_pred=y_pred_A,
    dy_true=paired_A["dDDG_expt"].to_numpy(),
    dy_pred=paired_A["dDDG_calc"].to_numpy(),
    statistic="RMSE",
    ci=0.95,
    nbootstrap=1000,
    include_pred_uncertainty=False,
)

print("RMSE bootstrap results for Method A (ΔΔG):")
for key, value in rmse_boot_A.items():
    print(f"  {key:8s}: {value:.3f} kcal/mol")

RMSE bootstrap results for Method A (ΔΔG):
  mle     : 1.053 kcal/mol
  stderr  : 0.086 kcal/mol
  mean    : 1.053 kcal/mol
  low     : 0.879 kcal/mol
  high    : 1.220 kcal/mol

This can then be repeated for every metric of interest using a simple loop as above.

Common tasks with `FEMap` dataframes#

The DataFrames returned by get_relative_dataframe() and get_absolute_dataframe() are standard pandas DataFrames, so any standard DataFrame operation works on them. A few useful patterns which can help you explore the data and prepare it for downstream analysis:

# List all unique sources
rel_df["source"].unique()

# Filter to a single method
method_a_rows = rel_df[rel_df["source"] == "Method A"]

# Filter to experimental rows only
expt_rows = rel_df[~rel_df["computational"]]

Recap#

Use femap.get_relative_dataframe() to obtain the directly simulated ΔΔG edges together with the corresponding experimental ΔΔG values; merge on (labelA, labelB) to pair them for metric computation.
Use femap.get_absolute_dataframe() for per-ligand ΔG values; merge on label to pair them. Requires femap.generate_absolute_values() to have been called first for the MLE estimates or directly simulated absolute values to be present.
All point-estimate metrics are in cinnabar.stats and accept plain numpy arrays.
bootstrap_statistic wraps any of these with bootstrap resampling and returns the MLE estimate alongside a mean, standard error, and confidence interval.