Sub-sampling the data of a toffee file to just include standard peptides

In order to build a set of test files with known properties that can be used within a testing framework, we wish to take a full experimental file and extract out a subset of the data. In this example, we are going to take a HEK293 control that has a collection of spike-in retention time standards, extract the data just for those peptides such that this file can be used in regression tests.

[1]:
%load_ext autoreload
%autoreload 2
[2]:
%matplotlib inline
import os
import pandas as pd
import numpy as np
import scipy
import scipy.sparse
import seaborn as sns
import matplotlib.pyplot as plt
from tqdm import tqdm

import toffee
from toffee import util
from toffee.manipulation import Subsampler

sns.set()
sns.set_color_codes()
[3]:
toffee.__version__
[3]:
'0.12.18'

Sub-sample using the Spectral Reference Library (SRL)

Open the transition library for the iRT peptides

[4]:
base_dir = os.environ.get('DIA_TEST_DATA_REPO', None)
assert base_dir is not None
srl_fname = base_dir + '/ProCan90/srl/hek_srl.OpenSwath.iRT.tsv'
srl = pd.read_csv(srl_fname, sep='\t')

and then create the object for sub-sampling a toffee file using the peptides in teh reference library.

[5]:
tof_fname = base_dir + '/ProCan90/tof/ProCan90-M03-01.tof'
subsampler = Subsampler(
    tof_fname=tof_fname,
    precursor_and_product_df=srl,
    filter_ms2_windows=['ms2-025', 'ms2-041'],
)

Running the sub-sampling is as simple as passing in the output file name.

[6]:
test_tof_fname = os.path.basename(tof_fname).replace('.tof', '.test.tof')
subsampler.run(test_tof_fname)
100%|██████████| 2/2 [00:25<00:00, 11.43s/it]

Properties of the toffee HDF5 file

Using standard HDF tools we can interrogate the new toffee file for things such as compression, dataset and attribute names, or even the Intrinsic Mass Spacing parameters.

[7]:
!ls -lhF $test_tof_fname
-rw-r--r--  1 btully  470603294   867K 15 Apr 20:43 ProCan90-M03-01.test.tof
[8]:
!h5ls -vg $test_tof_fname/ms1
Opened "ProCan90-M03-01.test.tof" with sec2 driver.
ms1                      Group
    Attribute: IMSAlpha scalar
        Type:      native double
        Data:  7.01548e-05
    Attribute: IMSBeta scalar
        Type:      native double
        Data:  -5.92871e-05
    Attribute: IMSGamma scalar
        Type:      native int
        Data:  266664
    Attribute: firstScanRetentionTimeOffset scalar
        Type:      native double
        Data:  0.17
    Attribute: precursorCenter scalar
        Type:      native double
        Data:  -1
    Attribute: precursorLower scalar
        Type:      native double
        Data:  -1
    Attribute: precursorUpper scalar
        Type:      native double
        Data:  -1
    Attribute: scanCycleTime scalar
        Type:      native double
        Data:  3.32
    Location:  1:405261
    Links:     1
[9]:
!h5ls -v $test_tof_fname/ms1
Opened "ProCan90-M03-01.test.tof" with sec2 driver.
IMSAlphaPerScan          Dataset {1627/1627}
    Location:  1:413217
    Links:     1
    Chunks:    {1627} 13016 bytes
    Storage:   13016 logical bytes, 6975 allocated bytes, 186.61% utilization
    Filter-0:  deflate-1 OPT {6}
    Type:      native double
IMSBetaPerScan           Dataset {1627/1627}
    Location:  1:422560
    Links:     1
    Chunks:    {1627} 13016 bytes
    Storage:   13016 logical bytes, 7470 allocated bytes, 174.24% utilization
    Filter-0:  deflate-1 OPT {6}
    Type:      native double
imsCoord                 Dataset {257400/257400}
    Location:  1:432398
    Links:     1
    Chunks:    {257400} 1029600 bytes
    Storage:   1029600 logical bytes, 198728 allocated bytes, 518.10% utilization
    Filter-0:  deflate-1 OPT {6}
    Type:      native unsigned int
intensity                Dataset {257400/257400}
    Location:  1:633670
    Links:     1
    Chunks:    {257400} 1029600 bytes
    Storage:   1029600 logical bytes, 251727 allocated bytes, 409.01% utilization
    Filter-0:  deflate-1 OPT {6}
    Type:      native unsigned int
retentionTimeIdx         Dataset {1627/1627}
    Location:  1:406565
    Links:     1
    Chunks:    {1627} 6508 bytes
    Storage:   6508 logical bytes, 3956 allocated bytes, 164.51% utilization
    Filter-0:  deflate-1 OPT {6}
    Type:      native unsigned int
[10]:
!h5ls -vg $test_tof_fname/ms2-041
Opened "ProCan90-M03-01.test.tof" with sec2 driver.
ms2-041                  Group
    Attribute: IMSAlpha scalar
        Type:      native double
        Data:  7.01547e-05
    Attribute: IMSBeta scalar
        Type:      native double
        Data:  -3.32516e-05
    Attribute: IMSGamma scalar
        Type:      native int
        Data:  142548
    Attribute: firstScanRetentionTimeOffset scalar
        Type:      native double
        Data:  1.467
    Attribute: precursorCenter scalar
        Type:      native double
        Data:  611.5
    Attribute: precursorLower scalar
        Type:      native double
        Data:  608.5
    Attribute: precursorUpper scalar
        Type:      native double
        Data:  614.5
    Attribute: scanCycleTime scalar
        Type:      native double
        Data:  3.32
    Location:  1:174499
    Links:     1
[11]:
!h5ls -v $test_tof_fname/ms2-041
Opened "ProCan90-M03-01.test.tof" with sec2 driver.
IMSAlphaPerScan          Dataset {1627/1627}
    Location:  1:181206
    Links:     1
    Chunks:    {1627} 13016 bytes
    Storage:   13016 logical bytes, 8244 allocated bytes, 157.88% utilization
    Filter-0:  deflate-1 OPT {6}
    Type:      native double
IMSBetaPerScan           Dataset {1627/1627}
    Location:  1:191818
    Links:     1
    Chunks:    {1627} 13016 bytes
    Storage:   13016 logical bytes, 4755 allocated bytes, 273.73% utilization
    Filter-0:  deflate-1 OPT {6}
    Type:      native double
imsCoord                 Dataset {96051/96051}
    Location:  1:198941
    Links:     1
    Chunks:    {96051} 384204 bytes
    Storage:   384204 logical bytes, 172435 allocated bytes, 222.81% utilization
    Filter-0:  deflate-1 OPT {6}
    Type:      native unsigned int
intensity                Dataset {96051/96051}
    Location:  1:373920
    Links:     1
    Chunks:    {96051} 384204 bytes
    Storage:   384204 logical bytes, 28973 allocated bytes, 1326.08% utilization
    Filter-0:  deflate-1 OPT {6}
    Type:      native unsigned int
retentionTimeIdx         Dataset {1627/1627}
    Location:  1:175803
    Links:     1
    Chunks:    {1627} 6508 bytes
    Storage:   6508 logical bytes, 2707 allocated bytes, 240.41% utilization
    Filter-0:  deflate-1 OPT {6}
    Type:      native unsigned int

Compare to the original data

By opening up the file we just created, we can ensure that the data it contains matches exactly the same data as the original file. In doing so, we can prove that the data used in testing later is indistiguisable regardless of which file is being used.

First off, we need a series of functions that check data inside numpy and scipy matrices.

[12]:
def sparse_assert_allclose(A, B, atol=1e-8):
    """
    Test to see if two sparse matrices compare equal.
    Modified from: https://stackoverflow.com/a/47771340/758811
    """
    np.testing.assert_array_equal(A.shape, B.shape)
    r1, c1, v1 = scipy.sparse.find(A)
    r2, c2, v2 = scipy.sparse.find(B)
    np.testing.assert_array_equal(r1, r2)
    np.testing.assert_array_equal(c1, c2)
    np.testing.assert_allclose(v1, v2, atol=atol)

def chromatogram_assert_allclose(A, B):
    np.testing.assert_allclose(
        A.retentionTime,
        B.retentionTime
    )
    np.testing.assert_allclose(
        A.massOverCharge,
        B.massOverCharge
    )
    sparse_assert_allclose(
        A.intensities,
        B.intensities
    )

For further checking, we can plot the raw data of each file alongside one another

[13]:
def plot_raw(original_chrom, test_chrom, mz_value, window):
    def _plot_raw_impl(chrom, ax1, ax2, title, label_y=True):
        xlim = [chrom.retentionTime[0] / 60.0, chrom.retentionTime[-1] / 60.0]
        dense_intensities = chrom.intensities.toarray()

        intensities = np.log10(1 + dense_intensities.T)
        ax1.matshow(intensities, extent=xlim + [0, 6.0], cmap=plt.cm.viridis)
        ax1.set_xlim(xlim)
        ax1.set_ylabel('m/z (Da)')
        ax1.yaxis.set_visible(False)
        ax1.set_title(title, y=0.9)

        ax2.plot(chrom.retentionTime / 60.0, np.sum(dense_intensities, axis=1))
        ax2.set_xlim(xlim)
        ax2.set_xlabel('Retention Time (min)')
        if label_y:
            ax2.set_ylabel('Intensity')

    fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(15, 3), sharex=True, sharey='row')
    _plot_raw_impl(original_chrom, axes[0, 0], axes[1, 0], 'Original Data')
    _plot_raw_impl(test_chrom, axes[0, 1], axes[1, 1], 'Test Data', label_y=False)
    fig.suptitle('{} Data | Mass Over Charge = {:.2F}'.format(window, mz_value))
    fig.tight_layout()
    plt.show()

Again, loading the MS1 data is more expensive than others, so we only want to do it once

[14]:
test_swath_run = toffee.SwathRun(test_tof_fname)
test_ms1_swath_map = test_swath_run.loadSwathMap(toffee.ToffeeWriter.MS1_NAME)

Finally, let’s loop through the precursors and products that were used by the sub-samplerand compare the new file to the existing data

[15]:
ppm_width = 100
for k, ms2_name in enumerate(tqdm(sorted(subsampler.precursors.ms2Name.unique().tolist()))):
    # filter the library
    try:
        precursors = subsampler.precursors[subsampler.precursors.ms2Name == ms2_name]
        products = subsampler.products[subsampler.products.ms2Name == ms2_name]
    except KeyError:
        # this will happen if the window is not in the library
        continue

    # load the MS2 swath map
    ms2_swath_map = subsampler.swath_run.loadSwathMap(ms2_name)
    test_ms2_swath_map = test_swath_run.loadSwathMap(ms2_name)

    # collect the MS1 data
    for i, (_, row) in enumerate(precursors.iterrows()):
        mz = row.IsotopeMz
        mz_range = toffee.MassOverChargeRangeWithPPMFullWidth(mz, ppm_width)
        ms1_xic = subsampler.ms1_swath_map.extractedIonChromatogramSparse(mz_range)
        test_ms1_xic = test_ms1_swath_map.extractedIonChromatogramSparse(mz_range)
        if i == 0 and k < 4:
            plot_raw(ms1_xic, test_ms1_xic, mz, window='MS1')
        try:
            chromatogram_assert_allclose(ms1_xic, test_ms1_xic)
        except AssertionError:
            plot_raw(ms1_xic, test_ms1_xic, mz, window=f'Error! MS1')

    # collect the MS2 data
    for i, (_, row) in enumerate(products.iterrows()):
        mz = row.IsotopeMz
        mz_range = toffee.MassOverChargeRangeWithPPMFullWidth(mz, ppm_width)
        ms2_xic = ms2_swath_map.extractedIonChromatogramSparse(mz_range)
        test_ms2_xic = test_ms2_swath_map.extractedIonChromatogramSparse(mz_range)
        if i == 0 and k < 4:
            plot_raw(ms2_xic, test_ms2_xic, mz, window=ms2_name)
        try:
            chromatogram_assert_allclose(ms2_xic, test_ms2_xic)
        except AssertionError:
            plot_raw(ms2_xic, test_ms2_xic, mz, window=f'Error! {ms2_name}')
  0%|          | 0/2 [00:00<?, ?it/s]
../_images/jupyter_filtering-just-for-iRT-peptides_23_1.png
../_images/jupyter_filtering-just-for-iRT-peptides_23_2.png
 50%|█████     | 1/2 [00:01<00:01,  1.96s/it]
../_images/jupyter_filtering-just-for-iRT-peptides_23_4.png
../_images/jupyter_filtering-just-for-iRT-peptides_23_5.png
100%|██████████| 2/2 [00:03<00:00,  1.82s/it]

And now, clean up

[16]:
if os.path.isfile(test_tof_fname):
    os.remove(test_tof_fname)