Re-Quantifying detections using toffee and 2D modified Gaussian

Through the many visualisations of toffee data, and appreciation for the TOF detector, one recognises that the data is roughly gaussian in both retention time and m/z index space. Indeed, as we will see below, it is possible to model the raw data as an analytic function to improve the quantification results. In doing so, the least-squares method used to fit the function to the data mimicks a high-frequency filter and performs basic noise removal. By fixing the retention time of the analytic function for all fragments, it is also possible to pull apart co-eluting peaks and count only the contribution from the peak of interest.

The elution profile from the LC column gives a log-normal distribution that is skewed towards the left, while the data is normally distributed in index m/z space that we attribute to the distribution of kinetic energy that individual ions obtain within the TOF mass analyser. These observations imply that we can perform an optimization to find the peak location, spread, and amplitude of the gaussian functions for each fragment, \(j\).

\[F_j \left( \sigma_t, \sigma_m, t_0, m_{0,j}, a_j, c_j \right) = \frac{1}{t} \cdot \frac{a_j}{\sigma_t\sqrt{2 \pi}} \exp{\left[-\frac{(\log{t} - \log{t_0})^2}{2 \sigma_t^2} - \frac{(m - m_{0,j})^2}{2 \sigma_m^2}\right]} + C_j \left( \sigma_t, t_0, c_j \right)\]

with chemical noise, for MS1 signal only, defined as

\[C_j \left( \sigma_m, m_{0,j}, c_j \right) = c_j \exp{\left[-\frac{(m - m_{0,j})^2}{2 \sigma_m^2}\right]}\]

and the minimisation function

\[G \left( \sigma_t, \sigma_m, t_0, \vec{m_0}, \vec{a}, \vec{c} \right) = \sum_j G_j \text{ , where } G_j = F_j \left( \sigma_t, \sigma_m, t_0, m_{0,j}, a_j, c_j \right) - I_j\]

where - \(t\) and \(m\) represent the retention time and \(i_\sqrt{m/z}\) space respectively - \(t_0\) and \(\vec{m_0}\) are the peak location, where we assume that the retention time for all fragments must be constant, but the mass offset is allowed to be different for each one to account for calibration offsets - \(\sigma_t\), \(\sigma_m\) are the spread of each gaussian - \(\vec{a}\) is the amplitude for each fragment - \(\vec{c}\) is the amplitude of chemical noise for each fragment, and - \(I_j\) is the raw intensity data for a given fragment.

By building a model such as a 2D gaussian, we are then able to calculate the volume under the curve, with a hypothesis that this should give a more robust measure of the intensity from sample to sample.

Furthermore, to improve robustness during least-squares fitting, we apply the following transforms to the parameters:

  • \(f(n, x) = n (1 + \tanh{x}) / 2\) is a sigmoid function for constraining \(x\) to the range of \(0, n\)

  • \(\sigma_t = \sigma_t^{\prime2}\), where \(\sigma_t^\prime\) is the value used during optimisation

  • \(\sigma_m = \sigma_m^{\prime2}\), where \(\sigma_m^\prime\) is the value used during optimisation

  • \(t_0 = f(num_t, t_0^\prime)\), where \(t_0^\prime\) is the value used during optimisation, and \(num_t\) is the number of retention time pixels. This constrains \(t_0\) to fall within the valid range of retention times

  • \(m_0 = f(num_m, m_0^\prime)\), where \(m_0^\prime\) is the value used during optimisation, and \(num_m\) is the number of index m/z pixels. This constrains \(m_0\) to fall within the valid range of index m/z values

  • \(a_j = a_j^{\prime2}\), where \(a_j^\prime\) is the value used during optimisation

  • \(c_j = c_j^{\prime2}\), where \(c_j^\prime\) is the value used during optimisation

[1]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

import os
import datetime

from IPython.display import SVG, Image, display

import matplotlib.pyplot as plt

import numpy as np

import pandas as pd

import seaborn as sns

import toffee
from toffee import requant
from toffee.util import calculate_isotope_mz
from toffee.viz import ToffeeFragmentsPlotter

from tqdm import tqdm

base_dir = os.environ.get('DIA_TEST_DATA_REPO', None)
assert base_dir is not None

sns.set()
[2]:
print(' Analysis date:', datetime.datetime.now())
print('toffee version:', toffee.__version__)
 Analysis date: 2019-08-12 10:15:31.826510
toffee version: 0.14.2

Fitting the data to an analytic model using the C++ library

In toffee.requant there are two classes that enable the requantification to be performed using PyProphet results as the input.

Swath Gold Standard data

[3]:
sgs_files = [2, 5, 8, 10]
force = False

sgs_quantifiers = {}
sgs_requantified = []
for f in tqdm(sgs_files):
    name = f'{f:03d}'
    sample = f'napedro_l120224_{name}_sw'
    output = sample + '.csv.gz'
    tof_fname = f'{base_dir}/SwathGoldStandard/tof/{sample}.tof'
    pyprophet_fname = f'{base_dir}/SwathGoldStandard/osw/{sample}.osw'
    q = requant.PyProphetSQLiteRequantifier(
        output_filename=output,
        toffee_filename=tof_fname,
        pyprophet_filename=pyprophet_fname,
        max_protein_q_value_rs=1.0,
        max_protein_q_value_experiment_wide=1.0,
        max_protein_q_value_global=1.0,
    )
    sgs_quantifiers[f] = q

    if not os.path.exists(output) or force:
        q.run()

    df = pd.read_csv(output)
    df['InjectionName'] = name
    df['PQM'] = df.FullPeptideName + '_' + df.Charge.map(str)
    sgs_requantified.append(df)

sgs_requantified = pd.concat(sgs_requantified)
sgs_requantified.head()
100%|██████████| 4/4 [00:00<00:00, 10.48it/s]
[3]:
run_id peptide_id Sequence FullPeptideName Charge peak_group_rank RT Intensity m_score m_score_peptide_run_specific ... MS1Intensity MS2Intensity ModelParamSigmaRT ModelParamSigmaMz ModelParamRT0 ModelParamMz0MS1 ModelParamMz0MS2 ModelParamAmplitudes InjectionName PQM
0 1368487973123354311 1 AAEDFTLLVK AAEDFTLLVK(UniMod:259) 2 1 4209.56 80.0 0.032005 0.032005 ... 5.028338 1.668722 0.064002 11.891527 15.757658 14.731366 12.945118 8.66231,8.142178.14217;1.04211.0421,2.753782.7... 002 AAEDFTLLVK(UniMod:259)_2
1 1368487973123354311 4 AAGASAQVLGQEGK AAGASAQVLGQEGK(UniMod:259) 2 1 2017.42 822.0 0.003899 0.003899 ... 76.928972 15.230757 0.088230 5.231933 14.949945 20.636998 20.908421 305.804,170.775170.775;41.441141.4411,16.10781... 002 AAGASAQVLGQEGK(UniMod:259)_2
2 1368487973123354311 17 ADSTGTLVITDPTR ADSTGTLVITDPTR(UniMod:267) 2 1 3105.00 180.0 0.003899 0.003899 ... 8.074463 2.976146 0.065825 2.685659 15.040877 20.335423 21.117461 49.0696,42.791442.7914;2.926292.92629,8.41438.... 002 ADSTGTLVITDPTR(UniMod:267)_2
3 1368487973123354311 22 AEVAALAAENK AEVAALAAENK(UniMod:259) 2 1 2221.28 2658.0 0.003899 0.003899 ... 189.337273 57.394183 0.082998 5.633617 15.131340 20.949316 21.025522 810.276,374.571374.571;70.898270.8982,87.69587... 002 AEVAALAAENK(UniMod:259)_2
4 1368487973123354311 26 AFGYYGPLR AFGYYGPLR(UniMod:267) 2 1 3723.90 320.0 0.034796 0.034796 ... 19.167878 9.666694 0.110380 9.718412 14.489525 23.953597 19.389373 35.5049,38.795338.7953;8.423798.42379,7.632937... 002 AFGYYGPLR(UniMod:267)_2

5 rows × 24 columns

ProCan90 data

[4]:
procan90_files = ['03-01', '06-07']
force = False

procan90_quantifiers = {}
procan90_requantified = []
for f in tqdm(procan90_files):
    sample = f'ProCan90-M{f}'
    output = sample + '.csv.gz'
    tof_fname = f'{base_dir}/ProCan90/tof/{sample}.tof'
    pyprophet_fname = f'{base_dir}/ProCan90/osw/{sample}.osw'
    q = requant.PyProphetSQLiteRequantifier(
        output_filename=output,
        toffee_filename=tof_fname,
        pyprophet_filename=pyprophet_fname,
    )
    procan90_quantifiers[f] = q

    if not os.path.exists(output) or force:
        q.run()

    df = pd.read_csv(output)
    df['InjectionName'] = sample
    df['PQM'] = df.FullPeptideName + '_' + df.Charge.map(str)
    procan90_requantified.append(df)

procan90_requantified = pd.concat(procan90_requantified)
procan90_requantified.head()
100%|██████████| 2/2 [00:14<00:00,  7.43s/it]
[4]:
run_id peptide_id Sequence FullPeptideName Charge peak_group_rank RT Intensity m_score m_score_peptide_run_specific ... MS1Intensity MS2Intensity ModelParamSigmaRT ModelParamSigmaMz ModelParamRT0 ModelParamMz0MS1 ModelParamMz0MS2 ModelParamAmplitudes InjectionName PQM
0 5145023687975356512 0 AAAAAAAAAAAAAAAGAGAGAK AAAAAAAAAAAAAAAGAGAGAK 3 1 2277.710 4944.0 0.016517 0.016516 ... 611.732180 88.886325 0.289517 3.981082 14.301614 19.422064 20.641688 929.679,4790.634790.63;198.689198.689,133.7331... ProCan90-M03-01 AAAAAAAAAAAAAAAGAGAGAK_3
1 5145023687975356512 3 AAAAAAAAAAGRVGM AAAAAAAAAAGRVGM 2 1 1609.310 14029.0 0.000583 0.000583 ... 2086.219858 201.586612 0.218611 3.505546 15.515581 22.769187 21.015362 14288.9,6270.76270.7;757.188757.188,649649,387... ProCan90-M03-01 AAAAAAAAAAGRVGM_2
2 5145023687975356512 13 AAAAAADPNAAWAAYYSHYYQQPPGPVPGPAPAPAAPPAQGEPPQP... AAAAAADPNAAWAAYYSHYYQQPPGPVPGPAPAPAAPPAQGEPPQP... 5 1 3275.670 2988.0 0.044867 0.044866 ... 289.512315 87.778535 0.291877 2.737636 13.559004 21.539127 21.589122 404.298,2301.282301.28;27.238927.2389,136.0631... ProCan90-M03-01 AAAAAADPNAAWAAYYSHYYQQPPGPVPGPAPAPAAPPAQGEPPQP...
3 5145023687975356512 16 AAAAADLANR AAAAADLANR 2 1 475.722 1992.0 0.012483 0.012482 ... 479.286636 134.605709 0.355179 11.511312 17.247381 19.503196 16.965990 545.168,1321.941321.94;221.035221.035,120.5912... ProCan90-M03-01 AAAAADLANR_2
4 5145023687975356512 19 AAAAASAAGPGGLVAGK AAAAASAAGPGGLVAGK 2 1 978.743 978.0 0.004057 0.004057 ... 355.547847 31.256479 0.307980 7.227377 11.911672 9.497525 16.879286 1193.04,604.938604.938;17.968217.9682,66.87976... ProCan90-M03-01 AAAAASAAGPGGLVAGK_2

5 rows × 24 columns

Compare intensities calculated by the toffee.requant C++ library

[5]:
def plot_intensities(requantified, cols, colname, divisor=None, expected_intensity=None, ylim=None):
    log_func = np.log2
    log_func_name = '(log2)'
    colname = f'{colname} {log_func_name}'

    def load(df, col):
        df = df[['PQM', 'InjectionName', col]].set_index(['PQM', 'InjectionName']).unstack('InjectionName')
        if divisor is not None:
            df = df.divide(df[(col, divisor)], axis=0)
        df = df.apply(log_func)
        return df.stack().reset_index(drop=False).rename(columns={col: colname})

    results = []
    for col in cols:
        df = load(requantified, col)
        df['Source'] = col
        results.append(df)
    results = pd.concat(results)

    x = 'InjectionName'
    col = 'Source'
    if expected_intensity is not None:
        x, col = col, x
    g = sns.catplot(
        kind='box',
        data=results,
        y=colname,
        x=x,
        col=col,
        palette='viridis',
        zorder=5,
        sharey=ylim is not None,
        showfliers=False,
    )

    if expected_intensity is not None:
        for ax, e in zip(g.axes.flatten(), expected_intensity):
            for i in range(len(results[x].unique())):
                ax.plot([i - 0.5, i + 0.5], log_func([e, e]), 'r--', zorder=1)
    if ylim is not None:
        ax.set_ylim(log_func(ylim))
    plt.show()

Swath Gold Standard data

[6]:
plot_intensities(
    sgs_requantified,
    ['Intensity', 'MS2Intensity', 'MS1Intensity'],
    'Intensity',
    expected_intensity=[2.0 ** (d - 10) for d in sgs_files],
    divisor='010',
    ylim=(1e-4, 2),
)
../_images/jupyter_requant_11_0.png

ProCan90 data

[7]:
plot_intensities(
    procan90_requantified,
    ['Intensity', 'MS2Intensity', 'MS1Intensity'],
    'Intensity',
    expected_intensity=(1.0, 1.0),
    divisor='ProCan90-M06-07',
    ylim=(0.01, 100),
)
../_images/jupyter_requant_13_0.png

Explore the raw data

High-scoring results

[8]:
def explore_raw_data(file, quantifiers, peptide_ids=None, plotter=None):
    q = quantifiers[file]
    if plotter is None:
        plotter = ToffeeFragmentsPlotter(q.toffee_filename)
    srl = q._add_ms2_windows(plotter.extractor.swath_run, q.srl)

    if peptide_ids is None:
        peptide_ids = q.scores \
            .sort_values(['m_score', 'Intensity'], ascending=(True, False)) \
            .iloc[100:115] \
            .peptide_id \
            .tolist()

    all_srl = srl.loc[srl.peptide_id.isin(peptide_ids)].copy()
    for ms2_name in sorted(all_srl.MS2Name.unique()):
        print('*' * 80)
        ms2_srl = all_srl.loc[all_srl.MS2Name == ms2_name]
        for peptide_id in ms2_srl.peptide_id.unique():
            print('-' * 50)

            fragments = ms2_srl.loc[ms2_srl.peptide_id == peptide_id]
            pre = fragments['PrecursorMz'].tolist()[:1]
            pro = fragments['ProductMz'].tolist()[:6]

            peptide = q.scores.loc[q.scores.peptide_id == peptide_id, 'FullPeptideName'].iloc[0]
            charge = q.scores.loc[q.scores.peptide_id == peptide_id, 'Charge'].iloc[0]
            rt = q.scores.loc[q.scores.peptide_id == peptide_id, 'RT'].iloc[0]
            rt_zoom = (rt - 300.0, rt + 300.0)

            print(f'ms2_name = \'{ms2_name}\'')
            print(f'peptide_id, peptide, charge, rt, pre = \'{peptide_id}\', \'{peptide}\', {charge}, {rt}, {pre[0]}')
            print(f'pro = {pro}')

            output_fname = '/tmp/fig.png'
            fig = plotter.create_plotly_figure(
                precursor_mz_list=pre,
                product_mz_list=pro,
                psm_name='{} | extraction_width = {} {}'.format(
                    plotter.tof_fname.split('/')[-1].split('.')[0],
                    plotter.extractor.extraction_width,
                    'ppm' if plotter.extractor.use_ppm else 'px',
                ),
                number_of_isotopes=0,
                log_heatmap=True,
                normalise_per_fragment=False,
                retention_time_zoom=rt_zoom,
                aspect_scale=0.2,
                retention_time_line=rt,
                output_fname=output_fname,
            )
            display(Image(output_fname))

SGS data

[9]:
sgs_file = 5
sgs_peptide_ids = [
    476,
    477,
    723,
    920,
]
explore_raw_data(
    file=sgs_file,
    quantifiers=sgs_quantifiers,
    peptide_ids=sgs_peptide_ids,
)
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-008'
peptide_id, peptide, charge, rt, pre = '476', 'IVDVDLTSEGK(UniMod:259)', 2, 2962.99, 592.318
pro = [642.355, 757.382, 856.45, 971.477]
../_images/jupyter_requant_18_1.png
--------------------------------------------------
ms2_name = 'ms2-008'
peptide_id, peptide, charge, rt, pre = '723', 'NPDGTVTVISR(UniMod:267)', 2, 2360.46, 584.813
pro = [484.312, 684.428, 842.497, 957.524]
../_images/jupyter_requant_18_3.png
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-010'
peptide_id, peptide, charge, rt, pre = '477', 'IVEIYGPESSGK(UniMod:259)', 2, 2900.32, 643.84
pro = [669.329, 832.393, 945.477, 1074.52]
../_images/jupyter_requant_18_5.png
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-014'
peptide_id, peptide, charge, rt, pre = '920', 'SDSSDTPPLPSPPGK(UniMod:259)', 2, 2804.66, 745.367
pro = [800.476, 897.528, 998.576, 1113.6]
../_images/jupyter_requant_18_7.png

ProCan90 data

[10]:
procan90_file = '03-01'
procan90_peptide_ids = [  # these are found by examing the raw data
    197043,
    4114,
    114292,
    193918,
    33530,
#     29585,
#     34385,
#     199044,
#     109281,
#     16535,
#     31437,
#     113336,
#     192320,
]
explore_raw_data(
    file=procan90_file,
    quantifiers=procan90_quantifiers,
    peptide_ids=procan90_peptide_ids,
)
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-024'
peptide_id, peptide, charge, rt, pre = '197043', 'VSFELFADK', 2, 3025.44, 528.27404785
pro = [435.22381592, 480.24526978, 593.3293457, 722.37194824, 869.44036865, 956.47235107]
../_images/jupyter_requant_20_1.png
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-035'
peptide_id, peptide, charge, rt, pre = '4114', 'AIPQLQGYLR', 2, 2607.96, 579.83514404
pro = [487.77453613, 508.28781128, 636.34637451, 749.43041992, 877.48901367, 974.54180908]
../_images/jupyter_requant_20_3.png
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-048'
peptide_id, peptide, charge, rt, pre = '114292', 'NALESYAFNMK', 2, 2491.33, 644.30554199
pro = [539.26464844, 610.30175781, 773.36505127, 860.39709473, 989.43969727, 1102.5238037]
../_images/jupyter_requant_20_5.png
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-049'
peptide_id, peptide, charge, rt, pre = '193918', 'VIHDNFGIVEGLMTTVHAITATQK', 4, 4158.86, 649.59545898
pro = [548.30383301, 579.28851318, 732.42504883, 783.37841797, 869.48394775, 896.46246338]
../_images/jupyter_requant_20_7.png
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-055'
peptide_id, peptide, charge, rt, pre = '33530', 'GGGGNFGPGPGSNFR', 2, 1180.7, 689.31835938
pro = [580.28381348, 677.33654785, 734.35803223, 831.4107666, 888.43225098, 1035.5006104]
../_images/jupyter_requant_20_9.png

Outliers defined by the C++ library results

In fitting an analytic model to the raw data, we now have a new set of parameters that can be used for exploring outliers; where an outlier is defined in a manner similar to plotting the whiskers on a box plot. For certain parameters, if the value from the fit model falls outside a multiplier of the inter-quartile range, then it is classed as an outlier.

[11]:
def plot_params(requantified, cols, colname, whis=5.0):
    groups = requantified.groupby('InjectionName')

    n_injections = len(groups)
    colours = [plt.cm.viridis(i / n_injections) for i in range(n_injections)]

    scale = 5
    ncols = len(cols)
    fig, axes = plt.subplots(ncols=ncols, figsize=(scale * ncols, scale))
    for i, (ax, col) in enumerate(zip(axes, cols)):
        xlims = []
        for colour, (injection, df) in zip(colours, groups):

            mp_groups = df[col]
            mp_groups.plot(kind='kde', ax=ax, bw_method=0.3, c=colour, label=injection)

            # plot outlier limits
            q1 = mp_groups.quantile(0.25)
            q3 = mp_groups.quantile(0.75)
            iqr = q3 - q1
            outlier_offset = whis * iqr
            lower = q1 - outlier_offset
            upper = q3 + outlier_offset
            ax.axvline(lower, ls='--', c=colour, alpha=0.5)
            ax.axvline(upper, ls='--', c=colour, alpha=0.5)

            xlim = list(ax.get_xlim())
            xlim[0] = max(xlim[0], lower - 2 * outlier_offset)
            xlim[1] = min(xlim[1], upper + 2 * outlier_offset)
            xlims.append(xlim)

        ax.set_xlim((min(x[0] for x in xlims), max(x[1] for x in xlims)))
        ax.set_title(col)
        if i == 0:
            ax.legend()
    plt.show()

def mark_outliers(requantified, col, whis=5.0):
    mp_groups = requantified[col].groupby(requantified.InjectionName)
    mean = mp_groups.mean()

    # calculate outliers as you would in a box plot
    q1 = mp_groups.quantile(0.25)
    q3 = mp_groups.quantile(0.75)
    iqr = q3 - q1
    lower = q1 - whis * iqr
    upper = q3 + whis * iqr

    delta_col = 'Delta' + col
    requantified[delta_col] = pd.np.NaN

    outlier_col = 'Outlier' + col
    requantified[outlier_col] = pd.np.NaN
    for n in lower.index:
        mask = requantified.InjectionName == n

        requantified.loc[mask, delta_col] = (requantified.loc[mask, col] - mean.loc[n]).abs()

        requantified.loc[mask, outlier_col] = (
            (requantified.loc[mask, col] < lower.loc[n]) |
            (requantified.loc[mask, col] > upper.loc[n])
        )

def get_outlier_peptides(requantified, injection_name, n=8, force=False, **kwargs):
    col_mz0 = 'ModelParamMz0MS2'
    col_rt0 = 'ModelParamRT0'
    col_sigma_rt = 'ModelParamSigmaRT'

    for col in [col_mz0, col_rt0, col_sigma_rt]:
        delta_col = 'Delta' + col
        outlier_col = 'Outlier' + col

        if delta_col not in requantified.columns or force:
            mark_outliers(requantified, col, **kwargs)

    outlier_col = 'PotentialOutlier'
    requantified[outlier_col] = (
        requantified['Outlier' + col_mz0] |
        requantified['Outlier' + col_rt0] |
        requantified['Outlier' + col_sigma_rt]
    )
    display(requantified[outlier_col].groupby([
        requantified['InjectionName'],
        requantified[outlier_col],
    ]).count())

    delta_col = 'PotentialOutlierDelta'
    requantified[delta_col] = (
        requantified['Delta' + col_mz0] *
        requantified['Delta' + col_rt0] *
        requantified['Delta' + col_sigma_rt]
    )

    outlier_peptide_ids = requantified.loc[
        (requantified.InjectionName == injection_name) &
        requantified[outlier_col]
    ].sort_values([delta_col, 'Intensity'], ascending=(False, False)).head(n)
    return outlier_peptide_ids.peptide_id.tolist()

SGS data

[12]:
plot_params(
    sgs_requantified,
    ['ModelParamSigmaRT', 'ModelParamSigmaMz', 'ModelParamRT0', 'ModelParamMz0MS1', 'ModelParamMz0MS2'],
    'ModelParam',
)
sgs_outlier_peptide_ids = get_outlier_peptides(
    requantified=sgs_requantified,
    injection_name=f'{sgs_file:03d}',
    force=True,
    whis=5.0,
)
../_images/jupyter_requant_24_0.png
InjectionName  PotentialOutlier
002            False               161
               True                 22
005            False               258
               True                 18
008            False               331
               True                  8
010            False               331
               True                 12
Name: PotentialOutlier, dtype: int64
[13]:
explore_raw_data(
    file=sgs_file,
    quantifiers=sgs_quantifiers,
    peptide_ids=sgs_outlier_peptide_ids,
)
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-011'
peptide_id, peptide, charge, rt, pre = '1158', 'YYDYTLSINGK(UniMod:259)', 2, 3657.93, 672.832
pro = [526.307, 639.392, 740.439, 1018.53]
../_images/jupyter_requant_25_1.png
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-012'
peptide_id, peptide, charge, rt, pre = '985', 'TIVGVSEDFAALR(UniMod:267)', 2, 4574.05, 694.376
pro = [587.354, 918.455, 1074.55, 1173.61]
../_images/jupyter_requant_25_3.png
--------------------------------------------------
ms2_name = 'ms2-012'
peptide_id, peptide, charge, rt, pre = '1051', 'VESPVLPPVLVPK(UniMod:259)', 2, 4738.54, 691.431
pro = [757.506, 870.59, 1066.71, 1153.74]
../_images/jupyter_requant_25_5.png
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-014'
peptide_id, peptide, charge, rt, pre = '212', 'EYTEGVNGQPSIR(UniMod:267)', 2, 2438.35, 730.356
pro = [667.376, 781.419, 937.509, 1066.55]
../_images/jupyter_requant_25_7.png
--------------------------------------------------
ms2_name = 'ms2-014'
peptide_id, peptide, charge, rt, pre = '695', 'MIVDPVEPHGEMK(UniMod:259)', 2, 2851.66, 745.367
pro = [706.343, 835.386, 1031.51, 1245.6]
../_images/jupyter_requant_25_9.png
--------------------------------------------------
ms2_name = 'ms2-014'
peptide_id, peptide, charge, rt, pre = '920', 'SDSSDTPPLPSPPGK(UniMod:259)', 2, 2804.66, 745.367
pro = [800.476, 897.528, 998.576, 1113.6]
../_images/jupyter_requant_25_11.png
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-015'
peptide_id, peptide, charge, rt, pre = '239', 'FPSPVSHADDLYGK(UniMod:259)', 2, 3293.18, 770.88
pro = [789.387, 1013.48, 1209.6, 1296.63]
../_images/jupyter_requant_25_13.png
--------------------------------------------------
ms2_name = 'ms2-015'
peptide_id, peptide, charge, rt, pre = '729', 'NTDFNGVNNIHQK(UniMod:259)', 2, 2128.47, 754.87
pro = [647.371, 917.504, 1031.55, 1178.62]
../_images/jupyter_requant_25_15.png

ProCan90 data

[14]:
plot_params(
    procan90_requantified,
    ['ModelParamSigmaRT', 'ModelParamSigmaMz', 'ModelParamRT0', 'ModelParamMz0MS1', 'ModelParamMz0MS2'],
    'ModelParam',
)
procan90_outlier_peptide_ids = get_outlier_peptides(
    requantified=procan90_requantified,
    injection_name=f'ProCan90-M{procan90_file}',
    force=True,
)
../_images/jupyter_requant_27_0.png
InjectionName    PotentialOutlier
ProCan90-M03-01  False               17022
                 True                 2377
ProCan90-M06-07  False               23297
                 True                 3138
Name: PotentialOutlier, dtype: int64
[15]:
explore_raw_data(
    file=procan90_file,
    quantifiers=procan90_quantifiers,
    peptide_ids=procan90_outlier_peptide_ids,
)
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-004'
peptide_id, peptide, charge, rt, pre = '122878', 'QNVPIIR', 2, 1062.29, 420.2585144
pro = [401.28707886, 498.33984375, 597.40826416, 711.45117188]
../_images/jupyter_requant_28_1.png
--------------------------------------------------
ms2_name = 'ms2-004'
peptide_id, peptide, charge, rt, pre = '132921', 'REDFLR', 2, 1997.44, 418.22467041
pro = [435.27142334, 550.29840088, 661.3303833]
../_images/jupyter_requant_28_3.png
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-049'
peptide_id, peptide, charge, rt, pre = '38727', 'HGEPEEDIVGLQAFQER', 3, 2662.94, 651.98156738
pro = [579.28851318, 778.38421631, 794.29516602, 907.37921143, 948.48974609, 1063.4691162]
../_images/jupyter_requant_28_5.png
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-050'
peptide_id, peptide, charge, rt, pre = '114165', 'NADMQEELIASLEEQLK', 3, 4136.85, 654.3225708
pro = [517.29803467, 759.42468262, 818.29852295, 846.45672607, 917.49383545, 931.38256836]
../_images/jupyter_requant_28_7.png
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-057'
peptide_id, peptide, charge, rt, pre = '195660', 'VNITPAEVGVLVGK', 2, 3101.9, 698.41394043
pro = [416.28674316, 525.30310059, 572.37664795, 871.52471924, 968.57751465, 1069.6252441]
../_images/jupyter_requant_28_9.png
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-064'
peptide_id, peptide, charge, rt, pre = '18270', 'EAGNINQSLLTLGR', 2, 2626.65, 743.40460205
pro = [446.27215576, 559.35620117, 672.44030762, 759.47229004, 887.53088379, 1001.5737915]
../_images/jupyter_requant_28_11.png
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-089'
peptide_id, peptide, charge, rt, pre = '192141', 'VEEASPGRPSSVDTLLSPTALIDSILR', 3, 4217.8, 941.84143066
pro = [488.31912231, 603.34606934, 716.43011475, 1098.6517334, 1185.6837158, 1311.6175537]
../_images/jupyter_requant_28_13.png
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-096'
peptide_id, peptide, charge, rt, pre = '179998', 'SLLSIPNTDYIQLLSEIAK', 2, 4206.23, 1059.5882568
pro = [660.39263916, 773.47674561, 901.53533936, 1177.6827393, 1393.7573242, 1604.8530273]
../_images/jupyter_requant_28_15.png

Building a python model

[16]:
import scipy
import scipy.optimize
import scipy.special


class Requant():
    def __init__(
        self,
        tof_fname,
        rt_pixel_half_width=15,
        mz_pixel_half_width=20,
        num_ms1_isotopes=1,
        debug_plots=False,
    ):
        self.swath_run = toffee.SwathRun(tof_fname)
        self.ms1_swath_map = self.swath_run.loadSwathMap(toffee.ToffeeWriter.MS1_NAME)
        self.rt_pixel_half_width = rt_pixel_half_width
        self.mz_pixel_half_width = mz_pixel_half_width
        self.num_ms1_isotopes = num_ms1_isotopes
        self.debug_plots = debug_plots

    def run(self, ms2_name, peak_groups):
        ms2_swath_map = self.swath_run.loadSwathMap(ms2_name)
        results = []
        for peak_group in peak_groups:
            assert isinstance(peak_group, requant.RequantPeakGroup)
            raw_data = self._extract_raw_data(ms2_swath_map, peak_group)
            if raw_data is None:
                continue
            ms1_intensity, ms2_intensity = self._fit_intensities(*raw_data)
            results.append((peak_group.modifiedSequence(), peak_group.charge(), ms1_intensity, ms2_intensity))
        return results

    def _extract_raw_data(self, ms2_swath_map, peak_group):
        rt_range = toffee.RetentionTimeRangeWithPixelHalfWidth(
            peak_group.retentionTime(),
            self.rt_pixel_half_width,
        )
        expected_shape = (2 * self.rt_pixel_half_width + 1, 2 * self.mz_pixel_half_width + 1)
        raw_data = []
        normalisers = []
        delta_rt = []
        delta_mz = []

        def extract(sm, mz):
            mz_range = toffee.MassOverChargeRangeWithPixelHalfWidth(mz, self.mz_pixel_half_width)
            chrom = sm.filteredExtractedIonChromatogram(mz_range, rt_range)
            if chrom.intensities.shape == expected_shape:
                r = chrom.intensities.astype(np.float)
                n = max(1, r.max())
                raw_data.append(r / n)
                normalisers.append(n)
                delta_rt.append(chrom.retentionTime[1] - chrom.retentionTime[0])
                delta_mz.append(np.diff(chrom.massOverCharge).mean())
            else:
                print(mz, chrom.intensities.shape, '!=', expected_shape)

        # MS1 data
        for i in range(self.num_ms1_isotopes + 1):
            isotope_mass_offset = 1.0033548
            mz = peak_group.massOverCharge() + i * isotope_mass_offset / peak_group.charge()
            extract(self.ms1_swath_map, mz)
        if len(raw_data) == 0:
            return None

        # MS2 data
        for fragment_mz in peak_group.fragmentMassOverCharges():
            extract(ms2_swath_map, fragment_mz)

        return delta_rt, delta_mz, raw_data, normalisers

    def _fit_intensities(self, delta_rt, delta_mz, raw_data, normalisers):
        n_ms1 = self.num_ms1_isotopes + 1
        n_frag = len(raw_data)
        n_rt = self.rt_pixel_half_width * 2 + 1
        n_mz = self.mz_pixel_half_width * 2 + 1
        assert n_mz > n_rt
        n_slice = n_rt * n_mz

        rt = np.arange(n_rt) + 0.5
        mz = np.arange(n_mz) + 0.5
        rt, mz = np.meshgrid(rt, mz, indexing='ij')

        raw_data_vec = np.zeros((n_frag * n_rt * n_mz,))
        rt_vec = np.zeros_like(raw_data_vec)
        mz_vec = np.zeros_like(raw_data_vec)
        for i_frag, raw in enumerate(raw_data):
            idx = slice(i_frag * n_slice, (i_frag + 1) * n_slice)
            raw_data_vec[idx] = raw.ravel()
            rt_vec[idx] = rt.ravel()
            mz_vec[idx] = mz.ravel()
            for i_rt in range(n_rt):
                for i_mz in range(n_mz):
                    idx = i_frag * n_slice + i_rt * n_mz + i_mz
                    assert raw_data_vec[idx] == raw[i_rt, i_mz]
                    assert rt_vec[idx] == rt[i_rt, i_mz]
                    assert mz_vec[idx] == mz[i_rt, i_mz]
        # pre-allocate for fast calculations
        fit_data_vec = np.zeros_like(raw_data_vec)
        sigma_rt_vec = np.zeros_like(raw_data_vec)
        sigma_mz_vec = np.zeros_like(raw_data_vec)
        rt0_vec = np.zeros_like(raw_data_vec)
        mz0_vec = np.zeros_like(raw_data_vec)
        amplitude_vec = np.zeros_like(raw_data_vec)
        chem_noise_amplitude_vec = np.zeros_like(raw_data_vec)

        radius_vec = np.zeros_like(raw_data_vec)

        def sigmoid(n, x):
            # n * scipy.special.expit(x)
            return 0.5 * n * (1 + np.tanh(x))

        def guassian_2d(sigma_rt, sigma_mz, rt0, mz0_ms1, mz0_ms2, amplitude, chem_noise_amplitude):

            # set up vectors, enforce positivity and ranges
            rt0_vec[:] = sigmoid(n_rt, rt0)
            sigma_rt_vec[:] = sigma_rt ** 2
            sigma_mz_vec[:] = sigma_mz ** 2
            for i_frag in range(n_frag):
                idx = slice(i_frag * n_slice, (i_frag + 1) * n_slice)
                amplitude_vec[idx] = amplitude[i_frag] ** 2
                if i_frag < n_ms1:
                    mz0_vec[idx] = sigmoid(n_mz, mz0_ms1)
                    chem_noise_amplitude_vec[idx] = chem_noise_amplitude[i_frag] ** 2
                else:
                    mz0_vec[idx] = sigmoid(n_mz, mz0_ms2)

            # calculate the fit model
            exp_rt = -0.5 * ((np.log(rt_vec) - np.log(rt0_vec)) / sigma_rt_vec) ** 2
            denom_rt = np.sqrt(2 * np.pi) * rt_vec * sigma_rt_vec

            exp_mz = -0.5 * ((mz_vec - mz0_vec) / sigma_mz_vec) ** 2

            signal_vec = amplitude_vec * np.exp(exp_rt + exp_mz) / denom_rt
            chem_noise_vec = chem_noise_amplitude_vec * np.exp(exp_mz)

            return signal_vec + chem_noise_vec

        def all_guassian_2d(sigma_rt, sigma_mz, rt0, mz0_ms1, mz0_ms2, *amplitudes_and_noise):
            assert len(amplitudes_and_noise) == n_frag + n_ms1
            fit_data_vec[:] = guassian_2d(
                sigma_rt,
                sigma_mz,
                rt0,
                mz0_ms1,
                mz0_ms2,
                amplitudes_and_noise[:n_frag],
                amplitudes_and_noise[n_frag:],
            )

        def residual(params):
            all_guassian_2d(*params)
            return fit_data_vec - raw_data_vec

        # sigma_x, sigma_y, x0, y0_ms1, y0_ms2
        guess = [0.5, 2.0, 0.0, 0.0, 0.0]
        n = len(guess)
        # add amplitude vector
        guess = guess + [1.0] * n_frag
        # add chemical noise vector
        guess = guess + [0.1] * n_ms1

        result = scipy.optimize.least_squares(residual, guess, method='lm')

        ms1_intensity, ms2_intensity = np.NaN, np.NaN
        if result.success:
            # print(result)
            # print('')
            # print('n_frag', n_frag)
            # print('sigma_x, sigma_y, x0', result.x[:n])
            # print('y0', result.x[n:n + n_frag])
            # print('amplitude', result.x[n + n_frag:n + 2 * n_frag])
            # print('noise', result.x[n + 2 * n_frag:])

            if self.debug_plots:
                self._make_debug_plots(raw_data_vec, fit_data_vec)

            # calculate final fit
            params = result.x.copy()
            params[-n_ms1:] = 0.0  # set checmical noise to zero
            all_guassian_2d(*params)

            area_under_curve = []
            raw_area_under_curve = []
            for i_frag in range(n_frag):
                idx = slice(i_frag * n_slice, (i_frag + 1) * n_slice)
                a = fit_data_vec[idx].sum()
                a *= normalisers[i_frag]
                a *= delta_rt[i_frag]
                a *= delta_mz[i_frag]
                area_under_curve.append(a)
                a = raw_data_vec[idx].sum()
                a *= normalisers[i_frag]
                a *= delta_rt[i_frag]
                a *= delta_mz[i_frag]
                raw_area_under_curve.append(a)

            # integrate intensity -- subtracting noise
            ms1_intensity = sum(area_under_curve[:n_ms1])
            ms2_intensity = sum(area_under_curve[n_ms1:])
            raw_ms1_intensity = sum(raw_area_under_curve[:n_ms1])
            raw_ms2_intensity = sum(raw_area_under_curve[n_ms1:])
            print('Raw MS1 & MS2:', raw_ms1_intensity, raw_ms2_intensity)
            print('Fit MS1 & MS2:', ms1_intensity, ms2_intensity)

        return ms1_intensity, ms2_intensity

    def _make_debug_plots(self, raw_data_vec, fit_data_vec):
        import matplotlib.pyplot as plt

        n_rt = self.rt_pixel_half_width * 2 + 1
        n_mz = self.mz_pixel_half_width * 2 + 1
        assert n_mz > n_rt
        n_slice = n_rt * n_mz
        n_frag = raw_data_vec.shape[0] // n_slice

        raw_data = []
        fit_data = []
        for i_frag in range(n_frag):
            idx = slice(i_frag * n_slice, (i_frag + 1) * n_slice)
            raw_data.append(raw_data_vec[idx].reshape(n_rt, n_mz))
            fit_data.append(fit_data_vec[idx].reshape(n_rt, n_mz))

        ncols = len(raw_data)
        nrows = 4
        scale = 2
        figsize = (scale * ncols * 1.5, scale * nrows)
        cmap = plt.cm.viridis
        for take_transpose in [False, True]:
            fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=figsize)
            for raw, fit, (ax1, ax2, ax3, ax4) in zip(raw_data, fit_data, axes.T):
                x = np.arange(n_rt)
                y = np.arange(n_mz)
                if take_transpose:
                    raw = raw.T
                    fit = fit.T
                    x, y = y, x
                diff = raw - fit
                vmin, vmax = 0, max(raw.max(), fit.max())
                levels = np.linspace(vmin, vmax, num=30)

                ax1.contourf(x, y, raw.T, vmin=vmin, vmax=vmax, levels=levels, cmap=cmap)
                ax2.contourf(x, y, fit.T, vmin=vmin, vmax=vmax, levels=levels, cmap=cmap)
                ax3.contourf(x, y, np.abs(diff).T, vmin=vmin, vmax=vmax, levels=levels, cmap=cmap)
                for ax in (ax1, ax2, ax3):
                    ax.get_xaxis().set_visible(False)
                    ax.get_yaxis().set_visible(False)

                ax4.plot(x, np.sum(raw, axis=1))
                ax4.plot(x, np.sum(fit, axis=1))
                ax4.plot(x, np.sum(diff, axis=1))

            x_label = 'RT' if not take_transpose else 'm/z'
            y_label = 'RT' if take_transpose else 'm/z'
            axes[0, 0].set_ylabel(f'Raw data ({y_label})')
            axes[1, 0].set_ylabel(f'Fit data ({y_label})')
            axes[2, 0].set_ylabel(f'Difference ({y_label})')
            axes[3, 0].set_ylabel(f'Integrated over {y_label}')
            for ax in axes[-1, :]:
                ax.set_xlabel(x_label)

            fig.tight_layout()
            plt.show()

def run_python_requant(file, quantifiers, peptide_ids=None):
    q = quantifiers[file]
    swath_run = toffee.SwathRun(q.toffee_filename)
    srl = q._add_ms2_windows(swath_run, q.srl)

    if peptide_ids is None:
        peptide_ids = q.scores \
            .sort_values(['m_score', 'Intensity'], ascending=(True, False)) \
            .iloc[100:115] \
            .peptide_id \
            .tolist()

    all_srl = srl.loc[srl.peptide_id.isin(peptide_ids)].copy()
    operator = Requant(q.toffee_filename, debug_plots=True)

    for ms2_name in sorted(all_srl.MS2Name.unique()):
        print('*' * 80)
        ms2_srl = all_srl.loc[all_srl.MS2Name == ms2_name]
        for peptide_id in ms2_srl.peptide_id.unique():
            print('-' * 50)

            fragments = ms2_srl.loc[ms2_srl.peptide_id == peptide_id]
            pre = fragments['PrecursorMz'].tolist()[:1]
            pro = fragments['ProductMz'].tolist()[:6]

            peptide = q.scores.loc[q.scores.peptide_id == peptide_id, 'FullPeptideName'].iloc[0]
            charge = q.scores.loc[q.scores.peptide_id == peptide_id, 'Charge'].iloc[0]
            rank = q.scores.loc[q.scores.peptide_id == peptide_id, 'peak_group_rank'].iloc[0]
            rt = q.scores.loc[q.scores.peptide_id == peptide_id, 'RT'].iloc[0]

            print(f'ms2_name = \'{ms2_name}\'')
            print(f'peptide, charge, rank, rt, pre = \'{peptide}\', {charge}, {rank}, {rt}, {pre[0]}')
            print(f'pro = {pro}')
            assert len(pre) > 0
            assert len(pro) > 0

            peak_group = requant.RequantPeakGroup(peptide, charge, rank, rt, pre[0], pro)
            print(operator.run(ms2_name, [peak_group]))

Quantify & visualise the high scoring peptides

SGS data

[17]:
run_python_requant(
    file=sgs_file,
    quantifiers=sgs_quantifiers,
    peptide_ids=sgs_peptide_ids,
)
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-008'
peptide, charge, rank, rt, pre = 'IVDVDLTSEGK(UniMod:259)', 2, 1, 2962.99, 592.318
pro = [642.355, 757.382, 856.45, 971.477]
../_images/jupyter_requant_33_1.png
../_images/jupyter_requant_33_2.png
Raw MS1 & MS2: 1901.0151427865394 832.5617156912243
Fit MS1 & MS2: 1913.1799130745012 614.3915440003645
[('IVDVDLTSEGK(UniMod:259)', 2, 1913.1799130745012, 614.3915440003645)]
--------------------------------------------------
ms2_name = 'ms2-008'
peptide, charge, rank, rt, pre = 'NPDGTVTVISR(UniMod:267)', 2, 1, 2360.46, 584.813
pro = [484.312, 684.428, 842.497, 957.524]
../_images/jupyter_requant_33_4.png
../_images/jupyter_requant_33_5.png
Raw MS1 & MS2: 1068.797569083983 226.6386387566327
Fit MS1 & MS2: 964.1610009504825 201.86531477130598
[('NPDGTVTVISR(UniMod:267)', 2, 964.1610009504825, 201.86531477130598)]
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-010'
peptide, charge, rank, rt, pre = 'IVEIYGPESSGK(UniMod:259)', 2, 1, 2900.32, 643.84
pro = [669.329, 832.393, 945.477, 1074.52]
../_images/jupyter_requant_33_7.png
../_images/jupyter_requant_33_8.png
Raw MS1 & MS2: 1343.0514804124023 587.4011974819572
Fit MS1 & MS2: 1361.4908623989802 540.0065696055367
[('IVEIYGPESSGK(UniMod:259)', 2, 1361.4908623989802, 540.0065696055367)]
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-014'
peptide, charge, rank, rt, pre = 'SDSSDTPPLPSPPGK(UniMod:259)', 2, 1, 2804.66, 745.367
pro = [800.476, 897.528, 998.576, 1113.6]
../_images/jupyter_requant_33_10.png
../_images/jupyter_requant_33_11.png
Raw MS1 & MS2: 592.4653454073219 122.54244489367036
Fit MS1 & MS2: 505.67140106941656 88.20516067864703
[('SDSSDTPPLPSPPGK(UniMod:259)', 2, 505.67140106941656, 88.20516067864703)]

ProCan90 data

[18]:
run_python_requant(
    file=procan90_file,
    quantifiers=procan90_quantifiers,
    peptide_ids=procan90_peptide_ids,
)
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-024'
peptide, charge, rank, rt, pre = 'VSFELFADK', 2, 1, 3025.44, 528.27404785
pro = [435.22381592, 480.24526978, 593.3293457, 722.37194824, 869.44036865, 956.47235107]
../_images/jupyter_requant_35_1.png
../_images/jupyter_requant_35_2.png
Raw MS1 & MS2: 28778.380265487547 6040.770575850728
Fit MS1 & MS2: 24396.972393554213 5284.702891876719
[('VSFELFADK', 2, 24396.972393554213, 5284.702891876719)]
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-035'
peptide, charge, rank, rt, pre = 'AIPQLQGYLR', 2, 1, 2607.96, 579.83514404
pro = [487.77453613, 508.28781128, 636.34637451, 749.43041992, 877.48901367, 974.54180908]
../_images/jupyter_requant_35_4.png
../_images/jupyter_requant_35_5.png
Raw MS1 & MS2: 30576.567646871867 4026.2126532143407
Fit MS1 & MS2: 12613.59797450274 3776.8008016967888
[('AIPQLQGYLR', 2, 12613.59797450274, 3776.8008016967888)]
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-048'
peptide, charge, rank, rt, pre = 'NALESYAFNMK', 2, 1, 2491.33, 644.30554199
pro = [539.26464844, 610.30175781, 773.36505127, 860.39709473, 989.43969727, 1102.5238037]
../_images/jupyter_requant_35_7.png
../_images/jupyter_requant_35_8.png
Raw MS1 & MS2: 69309.96573677666 4298.2830446508515
Fit MS1 & MS2: 29909.713574711433 4079.6536675590323
[('NALESYAFNMK', 2, 29909.713574711433, 4079.6536675590323)]
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-049'
peptide, charge, rank, rt, pre = 'VIHDNFGIVEGLMTTVHAITATQK', 4, 1, 4158.86, 649.59545898
pro = [548.30383301, 579.28851318, 732.42504883, 783.37841797, 869.48394775, 896.46246338]
../_images/jupyter_requant_35_10.png
../_images/jupyter_requant_35_11.png
Raw MS1 & MS2: 36075.46972826726 4581.8361977096065
Fit MS1 & MS2: 31349.392247620683 3715.5025255865135
[('VIHDNFGIVEGLMTTVHAITATQK', 4, 31349.392247620683, 3715.5025255865135)]
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-055'
peptide, charge, rank, rt, pre = 'GGGGNFGPGPGSNFR', 2, 1, 1180.7, 689.31835938
pro = [580.28381348, 677.33654785, 734.35803223, 831.4107666, 888.43225098, 1035.5006104]
../_images/jupyter_requant_35_13.png
../_images/jupyter_requant_35_14.png
Raw MS1 & MS2: 18984.09727504625 3965.848553410137
Fit MS1 & MS2: 15462.107708268175 3531.5971517279813
[('GGGGNFGPGPGSNFR', 2, 15462.107708268175, 3531.5971517279813)]

Quantify & visualise the outlier peptides

SGS data

[19]:
run_python_requant(
    file=sgs_file,
    quantifiers=sgs_quantifiers,
    peptide_ids=sgs_outlier_peptide_ids,
)
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-011'
peptide, charge, rank, rt, pre = 'YYDYTLSINGK(UniMod:259)', 2, 1, 3657.93, 672.832
pro = [526.307, 639.392, 740.439, 1018.53]
../_images/jupyter_requant_38_1.png
../_images/jupyter_requant_38_2.png
Raw MS1 & MS2: 93.2215802795607 5.8908822504070155
Fit MS1 & MS2: 6.277257202415132e-06 3.8327705199061106
[('YYDYTLSINGK(UniMod:259)', 2, 6.277257202415132e-06, 3.8327705199061106)]
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-012'
peptide, charge, rank, rt, pre = 'TIVGVSEDFAALR(UniMod:267)', 2, 1, 4574.05, 694.376
pro = [587.354, 918.455, 1074.55, 1173.61]
../_images/jupyter_requant_38_4.png
../_images/jupyter_requant_38_5.png
Raw MS1 & MS2: 30.68848719714169 3.290575667970753
Fit MS1 & MS2: 2.5708773559614384 0.5779817408456864
[('TIVGVSEDFAALR(UniMod:267)', 2, 2.5708773559614384, 0.5779817408456864)]
--------------------------------------------------
ms2_name = 'ms2-012'
peptide, charge, rank, rt, pre = 'VESPVLPPVLVPK(UniMod:259)', 2, 1, 4738.54, 691.431
pro = [757.506, 870.59, 1066.71, 1153.74]
../_images/jupyter_requant_38_7.png
../_images/jupyter_requant_38_8.png
Raw MS1 & MS2: 30.267127515105777 3.626600767829161
Fit MS1 & MS2: 12.886623304127127 3.457505682641535
[('VESPVLPPVLVPK(UniMod:259)', 2, 12.886623304127127, 3.457505682641535)]
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-014'
peptide, charge, rank, rt, pre = 'EYTEGVNGQPSIR(UniMod:267)', 2, 1, 2438.35, 730.356
pro = [667.376, 781.419, 937.509, 1066.55]
../_images/jupyter_requant_38_10.png
../_images/jupyter_requant_38_11.png
Raw MS1 & MS2: 571.6679698524539 159.05646570609215
Fit MS1 & MS2: 569.5681733014899 134.3360836457496
[('EYTEGVNGQPSIR(UniMod:267)', 2, 569.5681733014899, 134.3360836457496)]
--------------------------------------------------
ms2_name = 'ms2-014'
peptide, charge, rank, rt, pre = 'MIVDPVEPHGEMK(UniMod:259)', 2, 1, 2851.66, 745.367
pro = [706.343, 835.386, 1031.51, 1245.6]
../_images/jupyter_requant_38_13.png
../_images/jupyter_requant_38_14.png
Raw MS1 & MS2: 1431.3119159291473 33.13252129485163
Fit MS1 & MS2: 732.8624744794847 1.724481319894012
[('MIVDPVEPHGEMK(UniMod:259)', 2, 732.8624744794847, 1.724481319894012)]
--------------------------------------------------
ms2_name = 'ms2-014'
peptide, charge, rank, rt, pre = 'SDSSDTPPLPSPPGK(UniMod:259)', 2, 1, 2804.66, 745.367
pro = [800.476, 897.528, 998.576, 1113.6]
../_images/jupyter_requant_38_16.png
../_images/jupyter_requant_38_17.png
Raw MS1 & MS2: 592.4653454073219 122.54244489367036
Fit MS1 & MS2: 505.67140106941656 88.20516067864703
[('SDSSDTPPLPSPPGK(UniMod:259)', 2, 505.67140106941656, 88.20516067864703)]
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-015'
peptide, charge, rank, rt, pre = 'FPSPVSHADDLYGK(UniMod:259)', 2, 1, 3293.18, 770.88
pro = [789.387, 1013.48, 1209.6, 1296.63]
../_images/jupyter_requant_38_19.png
../_images/jupyter_requant_38_20.png
Raw MS1 & MS2: 87.85798084824421 6.512684995738631
Fit MS1 & MS2: 56.37263352773063 3.721348294688908
[('FPSPVSHADDLYGK(UniMod:259)', 2, 56.37263352773063, 3.721348294688908)]
--------------------------------------------------
ms2_name = 'ms2-015'
peptide, charge, rank, rt, pre = 'NTDFNGVNNIHQK(UniMod:259)', 2, 1, 2128.47, 754.87
pro = [647.371, 917.504, 1031.55, 1178.62]
../_images/jupyter_requant_38_22.png
../_images/jupyter_requant_38_23.png
Raw MS1 & MS2: 77.90595348022188 7.52938575325957
Fit MS1 & MS2: 61.73288675398399 3.375192854570683
[('NTDFNGVNNIHQK(UniMod:259)', 2, 61.73288675398399, 3.375192854570683)]

ProCan90 data

[20]:
run_python_requant(
    file=procan90_file,
    quantifiers=procan90_quantifiers,
    peptide_ids=procan90_outlier_peptide_ids,
)
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-004'
peptide, charge, rank, rt, pre = 'QNVPIIR', 2, 1, 1062.29, 420.2585144
pro = [401.28707886, 498.33984375, 597.40826416, 711.45117188]
../_images/jupyter_requant_40_1.png
../_images/jupyter_requant_40_2.png
Raw MS1 & MS2: 2748.484368356817 66.35065408147446
Fit MS1 & MS2: 2799.4163691023036 45.93175784643054
[('QNVPIIR', 2, 2799.4163691023036, 45.93175784643054)]
--------------------------------------------------
ms2_name = 'ms2-004'
peptide, charge, rank, rt, pre = 'REDFLR', 2, 1, 1997.44, 418.22467041
pro = [435.27142334, 550.29840088, 661.3303833]
../_images/jupyter_requant_40_4.png
../_images/jupyter_requant_40_5.png
Raw MS1 & MS2: 1371.4563234616016 75.66820364750609
Fit MS1 & MS2: 148.42967820914942 24.819024099327837
[('REDFLR', 2, 148.42967820914942, 24.819024099327837)]
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-049'
peptide, charge, rank, rt, pre = 'HGEPEEDIVGLQAFQER', 3, 1, 2662.94, 651.98156738
pro = [579.28851318, 778.38421631, 794.29516602, 907.37921143, 948.48974609, 1063.4691162]
../_images/jupyter_requant_40_7.png
../_images/jupyter_requant_40_8.png
Raw MS1 & MS2: 2602.4031337189504 258.6497463384277
Fit MS1 & MS2: 1916.315200109147 267.09342144412926
[('HGEPEEDIVGLQAFQER', 3, 1916.315200109147, 267.09342144412926)]
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-050'
peptide, charge, rank, rt, pre = 'NADMQEELIASLEEQLK', 3, 1, 4136.85, 654.3225708
pro = [517.29803467, 759.42468262, 818.29852295, 846.45672607, 917.49383545, 931.38256836]
../_images/jupyter_requant_40_10.png
../_images/jupyter_requant_40_11.png
Raw MS1 & MS2: 11887.751376622578 206.13587926308224
Fit MS1 & MS2: 10432.259282198298 14.468248025629837
[('NADMQEELIASLEEQLK', 3, 10432.259282198298, 14.468248025629837)]
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-057'
peptide, charge, rank, rt, pre = 'VNITPAEVGVLVGK', 2, 1, 3101.9, 698.41394043
pro = [416.28674316, 525.30310059, 572.37664795, 871.52471924, 968.57751465, 1069.6252441]
../_images/jupyter_requant_40_13.png
../_images/jupyter_requant_40_14.png
Raw MS1 & MS2: 1822.2111671039456 88.9707300880708
Fit MS1 & MS2: 1422.579423536166 40.31177107899468
[('VNITPAEVGVLVGK', 2, 1422.579423536166, 40.31177107899468)]
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-064'
peptide, charge, rank, rt, pre = 'EAGNINQSLLTLGR', 2, 1, 2626.65, 743.40460205
pro = [446.27215576, 559.35620117, 672.44030762, 759.47229004, 887.53088379, 1001.5737915]
../_images/jupyter_requant_40_16.png
../_images/jupyter_requant_40_17.png
Raw MS1 & MS2: 5431.12478549956 124.70365258035318
Fit MS1 & MS2: 3157.8429424718797 36.32079911217963
[('EAGNINQSLLTLGR', 2, 3157.8429424718797, 36.32079911217963)]
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-089'
peptide, charge, rank, rt, pre = 'VEEASPGRPSSVDTLLSPTALIDSILR', 3, 1, 4217.8, 941.84143066
pro = [488.31912231, 603.34606934, 716.43011475, 1098.6517334, 1185.6837158, 1311.6175537]
../_images/jupyter_requant_40_19.png
../_images/jupyter_requant_40_20.png
Raw MS1 & MS2: 2300.6600852753154 130.07829467804535
Fit MS1 & MS2: 699.0792733909628 98.30380772826076
[('VEEASPGRPSSVDTLLSPTALIDSILR', 3, 699.0792733909628, 98.30380772826076)]
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-096'
peptide, charge, rank, rt, pre = 'SLLSIPNTDYIQLLSEIAK', 2, 1, 4206.23, 1059.5882568
pro = [660.39263916, 773.47674561, 901.53533936, 1177.6827393, 1393.7573242, 1604.8530273]
../_images/jupyter_requant_40_22.png
../_images/jupyter_requant_40_23.png
Raw MS1 & MS2: 2132.882937447126 114.49477907659808
Fit MS1 & MS2: 228.34049788928704 70.15997919249725
[('SLLSIPNTDYIQLLSEIAK', 2, 228.34049788928704, 70.15997919249725)]

Build testing data for the C++ library

We already use data from the DIA_TEST_DATA_REPO in our unit testing, so use the python implementation to build test data for the C++ tests.

[21]:
peptides_charges_and_rt = [
    ('DAVTPADFSEWSK', 2, 2609.69),
    ('EATFGVDESANK', 2, 830.17),
    ('GDLDAASYYAPVR', 2, 1882.61),
    # ms2-025
    ('ELHINLIPNKQDR', 3, 1842.77),
    ('KSDIDEIVLVGGSTR', 3, 2231.21),
    # ms2-052
    ('LAAEESVGTMGNR', 2, 826.85),
    ('TPVISGGPYYER', 2, 1467.61),
    # ms2-056
    ('TLEEDEEELFK', 2, 2659.49),
    ('VQVALEELQDLK', 2, 3044.61),
    # ms2-066
    ('GILAAEESVGTMGNR', 2, 1962.29),
    ('SPFTVGVAAPLDLSK', 2, 3263.73),
]
peptides = [p[0] for p in peptides_charges_and_rt]
srl_fname = base_dir + '/ProCan90/srl/hek_srl.OpenSwath.iRT.tsv'
tof_fname = base_dir + '/ProCan90/tof/ProCan90-M03-01.iRT.tof'
[22]:
plotter = ToffeeFragmentsPlotter(tof_fname)
[23]:
srl = pd.read_csv(srl_fname, sep='\t')
srl = srl.loc[srl.PeptideSequence.isin(peptides)]

ms2_map = plotter.extractor.swath_run.mapPrecursorsToMS2Names(srl.PrecursorMz.unique(), 0.0, 1.0)
srl['MS2Name'] = srl.PrecursorMz.map(ms2_map)

n_isotopes = 0
precursors, products = calculate_isotope_mz(
    precursor_and_product_df=srl,
    n_isotopes=n_isotopes,
    sort_by_intensity=True,
)

all_srl = srl.copy()
for ms2_name in sorted(all_srl.MS2Name.unique()):
    print('*' * 80)
    srl = all_srl.loc[all_srl.MS2Name == ms2_name]
    for peptide, charge, rt in [p for p in peptides_charges_and_rt if p[0] in srl.PeptideSequence.unique()]:
        print('-' * 50)
        pqp = srl.loc[srl.PeptideSequence == peptide, 'TransitionGroupId'].unique()[0]
        ms2_name = srl.loc[srl.PeptideSequence == peptide, 'MS2Name'].unique()[0]
        pre = precursors.loc[precursors.Id == pqp, 'IsotopeMz'].tolist()
        pro = products.loc[products.GroupId == pqp, 'IsotopeMz'].tolist()[:6]
        assert len(pre) > 0
        assert len(pro) > 0
        print(f'ms2_name = \'{ms2_name}\'')
        print(f'peptide, charge, rt, pre = \'{peptide}\', {charge}, {rt}, {pre[0]}')
        print(f'pro = {pro}')
        output_fname = '/tmp/fig.png'
        fig = plotter.create_plotly_figure(
            precursor_mz_list=pre,
            product_mz_list=pro,
            psm_name='{} | extraction_width = {} {}'.format(
                plotter.tof_fname.split('/')[-1].split('.')[0],
                plotter.extractor.extraction_width,
                'ppm' if plotter.extractor.use_ppm else 'px',
            ),
            number_of_isotopes=n_isotopes,
            log_heatmap=True,
            normalise_per_fragment=False,
            retention_time_line=rt,
            output_fname=output_fname,
        )
        display(Image(output_fname))
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-025'
peptide, charge, rt, pre = 'ELHINLIPNKQDR', 3, 1842.77, 530.630126953125
pro = [757.395141601563, 607.31982421875, 720.403930664063, 660.342346191406, 546.2994384765631, 1097.60620117188]
../_images/jupyter_requant_44_1.png
--------------------------------------------------
ms2_name = 'ms2-025'
peptide, charge, rt, pre = 'KSDIDEIVLVGGSTR', 3, 2231.21, 530.28955078125
pro = [576.309997558594, 801.3988647460941, 689.39404296875, 688.314819335938, 900.46728515625, 559.272216796875]
../_images/jupyter_requant_44_3.png
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-046'
peptide, charge, rt, pre = 'EATFGVDESANK', 2, 830.17, 634.293701171875
pro = [819.38427734375, 663.29443359375, 419.22488403320295, 966.4526977539059, 548.2674560546881, 762.3628540039059]
../_images/jupyter_requant_44_5.png
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-052'
peptide, charge, rt, pre = 'LAAEESVGTMGNR', 2, 826.85, 667.822082519531
pro = [635.29296875, 821.393432617188, 950.43603515625, 1150.51574707031, 734.3613891601559, 1079.47863769531]
../_images/jupyter_requant_44_7.png
--------------------------------------------------
ms2_name = 'ms2-052'
peptide, charge, rt, pre = 'TPVISGGPYYER', 2, 1467.61, 669.838073730469
pro = [928.415893554688, 1041.5, 1140.568359375, 841.383911132813, 784.362426757813, 727.340942382813]
../_images/jupyter_requant_44_9.png
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-056'
peptide, charge, rt, pre = 'TLEEDEEELFK', 2, 2659.49, 691.3220825195309
pro = [1167.50524902344, 1038.46264648438, 909.4199829101559, 794.39306640625, 717.2937622070309, 846.3363647460941]
../_images/jupyter_requant_44_11.png
--------------------------------------------------
ms2_name = 'ms2-056'
peptide, charge, rt, pre = 'VQVALEELQDLK', 2, 3044.61, 692.8877563476559
pro = [1058.57287597656, 1157.64123535156, 987.5357055664059, 616.366455078125, 874.45166015625, 745.409057617188]
../_images/jupyter_requant_44_13.png
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-057'
peptide, charge, rt, pre = 'GDLDAASYYAPVR', 2, 1882.61, 699.3384399414059
pro = [855.435913085938, 926.473022460938, 997.510131835938, 768.403930664063, 605.340576171875, 1112.537109375]
../_images/jupyter_requant_44_15.png
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-062'
peptide, charge, rt, pre = 'DAVTPADFSEWSK', 2, 2609.69, 726.835693359375
pro = [1066.48400878906, 969.4312133789059, 1167.53173828125, 420.22415161132795, 898.3941040039059, 783.3671875]
../_images/jupyter_requant_44_17.png
********************************************************************************
--------------------------------------------------
ms2_name = 'ms2-066'
peptide, charge, rt, pre = 'GILAAEESVGTMGNR', 2, 1962.29, 752.8748168945309
pro = [1150.51574707031, 1221.55285644531, 635.29296875, 821.393432617188, 1079.47863769531, 950.43603515625]
../_images/jupyter_requant_44_19.png
--------------------------------------------------
ms2_name = 'ms2-066'
peptide, charge, rt, pre = 'SPFTVGVAAPLDLSK', 2, 3263.73, 751.4166870117191
pro = [970.556762695313, 814.466918945313, 1069.62524414063, 1170.6728515625, 913.5353393554691, 1317.74133300781]
../_images/jupyter_requant_44_21.png
[24]:
operator = Requant(tof_fname, debug_plots=True)

for ms2_name in sorted(all_srl.MS2Name.unique()):
    print('*' * 80)
    srl = all_srl.loc[all_srl.MS2Name == ms2_name]
    for peptide, charge, rt in [p for p in peptides_charges_and_rt if p[0] in srl.PeptideSequence.unique()]:
        print('-' * 50)
        pqp = srl.loc[srl.PeptideSequence == peptide, 'TransitionGroupId'].unique()[0]
        charge = srl.loc[srl.PeptideSequence == peptide, 'PrecursorCharge'].unique()[0]
        ms2_name = srl.loc[srl.PeptideSequence == peptide, 'MS2Name'].unique()[0]
        rank = 1

        pre = precursors.loc[precursors.Id == pqp, 'IsotopeMz'].tolist()
        pro = products.loc[products.GroupId == pqp, 'IsotopeMz'].tolist()[:6]
        assert len(pre) > 0
        assert len(pro) > 0
        print(pqp, ms2_name)

        peak_group = requant.RequantPeakGroup(peptide, charge, rank, rt, pre[0], pro)
        print(operator.run(ms2_name, [peak_group]))
********************************************************************************
--------------------------------------------------
ELHINLIPNKQDR_3 ms2-025
546.2994384765631 (31, 38) != (31, 41)
../_images/jupyter_requant_45_1.png
../_images/jupyter_requant_45_2.png
Raw MS1 & MS2: 7971.382612753305 1360.6360110167914
Fit MS1 & MS2: 5936.0447804902815 1105.2600737423677
[('ELHINLIPNKQDR', 3, 5936.0447804902815, 1105.2600737423677)]
--------------------------------------------------
KSDIDEIVLVGGSTR_3 ms2-025
../_images/jupyter_requant_45_4.png
../_images/jupyter_requant_45_5.png
Raw MS1 & MS2: 3663.5420920012957 548.9509062646167
Fit MS1 & MS2: 1565.341063713942 467.98236860566055
[('KSDIDEIVLVGGSTR', 3, 1565.341063713942, 467.98236860566055)]
********************************************************************************
--------------------------------------------------
EATFGVDESANK_2 ms2-046
419.22488403320295 (31, 36) != (31, 41)
../_images/jupyter_requant_45_7.png
../_images/jupyter_requant_45_8.png
Raw MS1 & MS2: 2727.821488812674 294.9427344606671
Fit MS1 & MS2: 1408.4605688391075 249.52289699122136
[('EATFGVDESANK', 2, 1408.4605688391075, 249.52289699122136)]
********************************************************************************
--------------------------------------------------
LAAEESVGTMGNR_2 ms2-052
../_images/jupyter_requant_45_10.png
../_images/jupyter_requant_45_11.png
Raw MS1 & MS2: 2402.871796360551 274.4929809051114
Fit MS1 & MS2: 640.5041539081958 193.0167084570027
[('LAAEESVGTMGNR', 2, 640.5041539081958, 193.0167084570027)]
--------------------------------------------------
TPVISGGPYYER_2 ms2-052
../_images/jupyter_requant_45_13.png
../_images/jupyter_requant_45_14.png
Raw MS1 & MS2: 18987.131273955467 3611.313791924786
Fit MS1 & MS2: 13335.611364021508 3232.3448694321823
[('TPVISGGPYYER', 2, 13335.611364021508, 3232.3448694321823)]
********************************************************************************
--------------------------------------------------
TLEEDEEELFK_2 ms2-056
../_images/jupyter_requant_45_16.png
../_images/jupyter_requant_45_17.png
Raw MS1 & MS2: 14489.968375957485 1537.2048833212073
Fit MS1 & MS2: 11542.81942725199 1334.0724131818538
[('TLEEDEEELFK', 2, 11542.81942725199, 1334.0724131818538)]
--------------------------------------------------
VQVALEELQDLK_2 ms2-056
616.366455078125 (31, 40) != (31, 41)
../_images/jupyter_requant_45_19.png
../_images/jupyter_requant_45_20.png
Raw MS1 & MS2: 2704.855085515457 233.11135740071734
Fit MS1 & MS2: 1448.0969122618142 242.93041021698042
[('VQVALEELQDLK', 2, 1448.0969122618142, 242.93041021698042)]
********************************************************************************
--------------------------------------------------
GDLDAASYYAPVR_2 ms2-057
../_images/jupyter_requant_45_22.png
../_images/jupyter_requant_45_23.png
Raw MS1 & MS2: 9132.382213427156 2131.9018901521886
Fit MS1 & MS2: 7664.530453314555 1927.7800850246947
[('GDLDAASYYAPVR', 2, 7664.530453314555, 1927.7800850246947)]
********************************************************************************
--------------------------------------------------
DAVTPADFSEWSK_2 ms2-062
420.22415161132795 (31, 36) != (31, 41)
../_images/jupyter_requant_45_25.png
../_images/jupyter_requant_45_26.png
Raw MS1 & MS2: 10443.635559962993 1055.7539806959778
Fit MS1 & MS2: 8230.962938048931 991.0828664359409
[('DAVTPADFSEWSK', 2, 8230.962938048931, 991.0828664359409)]
********************************************************************************
--------------------------------------------------
GILAAEESVGTMGNR_2 ms2-066
../_images/jupyter_requant_45_28.png
../_images/jupyter_requant_45_29.png
Raw MS1 & MS2: 6142.502002452005 1221.8592417017833
Fit MS1 & MS2: 4804.03914396974 1101.5187370787942
[('GILAAEESVGTMGNR', 2, 4804.03914396974, 1101.5187370787942)]
--------------------------------------------------
SPFTVGVAAPLDLSK_2 ms2-066
../_images/jupyter_requant_45_31.png
../_images/jupyter_requant_45_32.png
Raw MS1 & MS2: 2656.414106639162 136.0974727587502
Fit MS1 & MS2: 580.3664099675456 126.41985619771064
[('SPFTVGVAAPLDLSK', 2, 580.3664099675456, 126.41985619771064)]
[ ]: