toffee
: A library for fast access to DIAMS data¶
Data generated by DIAMS exists in proprietary data formats that are opaque to opensource software. Typically, a user will convert this data to the open mzML
format before utilising a tool such as OpenSWATH for analysis. Unfortunately, the conversion of data formats results in a significant expansion in the file size, and as the downstream algorithms are largely I/O bound, this leads to an unneccessary increase in analysis time.
Based on the physics of mass analyzers, it is possible to create a more highlycompressed format whilst still retaining profile raw data. Ultimately, this is an implementation of the Intrinsic Mass Spacing (IMS) as described in Schneider (2016) and reflects the fact that mass analyzers a discrete sensors that operate in the time domain, and transform results into the mass over charge domain.
In order to investigate this fully, we took a Sciex SWATHMS file from a recent run through the ProCan facility, and a Thermo QExacutive DIA RAW file from Peckner2018. Each of these raw files was converted to profile mzML
using msconvert
and then using pyteomics
we extracted the raw m/z data and saved to HDF5 for fast read access.
[1]:
%matplotlib inline
import os
import pandas as pd
import numpy as np
import scipy
import h5py
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import clear_output
clear_output(wait=False) # clear import errors that aren't useful for tutorial
sns.set()
sns.set_color_codes()
base_dir = os.environ.get('DIA_TEST_DATA_REPO', None)
assert base_dir is not None
Sciex / Time of Flight (TOF) Mass Analyzers¶
We recognised an opportunity to represent SWATHMS data in a different form based on the physics of Time of Flight (TOF) detection (see wikipedia).
Time of Flight Physics¶
We know that the ions, of charge \(q\), are passed through an electric potential \(U\)
and will result in a kinetic energy
assuming no acceleration in the TOF tube (of length \(d\)), then velocity \(v=d/t\). Equating the energies and solving for time we get
From here, we can calculate a relationship between \(\Delta t\) and \(\Delta \sqrt{m/q}\)
Thus, if there is a discrete and constant spacing in \(\Delta t\) due to the physics of the detector, we can assume there will be a discrete and constant spacing of \(\sqrt{m/q_2}\sqrt{m/q_1}\).
Schneider (2016) refers to this, \(\alpha\), as the Intrinsic Mass Spacing (IMS).
[2]:
class CalculatorTOF(object):
IMS_TRANSFORM = 'sqrt(m/z)'
FNAME = base_dir + '/ProCan90/ProCan90M0301.ms2050.h5'
def __init__(self, mz):
self.mz = mz
self.root_mz = np.sqrt(mz)
def calculate_ims_params_directly(self):
delta_root_mz = np.abs(np.diff(self.root_mz))
min_delta_root_mz = delta_root_mz.min()
min_alpha_list = delta_root_mz[delta_root_mz <= 1.4 * min_delta_root_mz]
alpha = np.median(min_alpha_list)
alpha_mad = np.median(np.abs(min_alpha_list  alpha))
beta = np.mean(self.residual([alpha, 0.0]))
return alpha, alpha_mad, beta
@classmethod
def calculate_deltas_in_ims_space(cls, mz):
root_mz = np.sqrt(mz)
return np.diff(root_mz)
def residual_ppm(self, params, offset=0):
return self.residual(params, offset=offset) * 1e6 / self.mz
def residual(self, params, offset=0):
return self._func(params, offset=offset)  self.mz
def ims_coord(self, params, offset=0):
a, b = params
i = (self.root_mz  b) / a
return np.around(i, decimals=0) + offset
def _func(self, params, offset=0):
a, b = params
i = self.ims_coord(params, offset=offset)
return (a * i + b) ** 2
This raw data is plotted below for both an MS1 window and the middle MS2 window where the xaxis is m/z and the yaxis plots the difference between consecutive m/z values in the raw data.
[3]:
def plot_raw_deltas(Calculator, xlim, ylim):
fname = Calculator.FNAME
windows = ['ms1', 'ms2050']
ncols = len(windows)
fig, axes = plt.subplots(nrows=1, ncols=ncols, figsize=(5 * ncols, 5), num=1, sharey=True, sharex=True)
if not isinstance(axes, np.ndarray):
axes = np.array([axes])
with h5py.File(fname, 'r') as f:
for (window_name, ax) in zip(windows, axes):
window = f[window_name]
title = window_name
if window_name != 'ms1':
title += '  {:.2F} > {:.2F}'.format(window.attrs['lower'], window.attrs['upper'])
ax.set_title(title)
for scan_key in list(window.keys())[500:700]:
mz = window[scan_key]['mz'].value
ax.semilogy(mz[1:], np.diff(mz), 'k.', ms=0.5, alpha=0.5)
for ax in axes:
ax.set_xlim(xlim)
ax.set_ylim(ylim)
ax.set_xlabel('m/z (Da)')
axes[0].set_ylabel('Difference between m/z (Da)')
fig.tight_layout()
plot_raw_deltas(CalculatorTOF, xlim=(100, 2000), ylim=(1e3, 1e1))
It is clear from this plot that a consistent pattern emerges. From visual inspection, this seems to match our hypothesis of a \(\sqrt{m/z}\) relationship. Indeed, if we plot this along the yaxis, the Intrinsic Mass Spacing of the TOF spectrum becomes obvious. The figure below mimics Figure 3.2 of Schneider (2016) where the lowest set of data is the IMS and each line above that represents an integer multiple of the IMS. This can be seen clearly by the red lines that show the first 10 multiples of the IMS. The IMS multiplier minus one indicates the number of missing data points in the spectrum (likely due to no ions hitting the detector during this time).
[4]:
def plot_in_ims_space(Calculator, xlim, ylim):
fname = Calculator.FNAME
windows = ['ms1', 'ms2050']
ncols = len(windows)
fig, axes = plt.subplots(nrows=1, ncols=ncols, figsize=(5 * ncols, 5), num=1, sharey=True, sharex=True)
with h5py.File(fname, 'r') as f:
for (window_name, ax) in zip(windows, axes):
window = f[window_name]
title = window_name
if window_name != 'ms1':
title += '  {:.2F} > {:.2F}'.format(window.attrs['lower'], window.attrs['upper'])
ax.set_title(title)
min_list = list()
for scan_key in list(window.keys())[500:700]:
mz = window[scan_key]['mz'].value
deltas_ims = Calculator.calculate_deltas_in_ims_space(mz)
ax.semilogy(mz[1:], deltas_ims, 'k.', ms=0.5, alpha=0.5)
min_val = deltas_ims.min()
min_list.extend(list(deltas_ims[deltas_ims < 1.1 * min_val]))
min_list = np.array(min_list)
min_val = min_list.min()
min_list = min_list[min_list < 1.1 * min_val]
min_ims_alpha, stddev_min_ims_alpha = np.mean(min_list), np.std(min_list)
fmt = '{:<7s} :: IMS alpha = {:.8e} +/ {:.3e}'
print(fmt.format(window_name, min_ims_alpha, stddev_min_ims_alpha))
multipliers = [i + 1 for i in range(15)]
for multiplier in multipliers:
ims = multiplier * min_ims_alpha
ax.plot([1275, 1825], [ims, ims], 'r')
ax.text(
1875,
ims,
'{} x IMS'.format(multiplier),
horizontalalignment='left',
verticalalignment='center',
color='r',
fontsize=7,
)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
ax.set_xlabel('m/z (Da)')
axes[0].set_ylabel('Difference between m/z after IMS transform')
fig.tight_layout()
fig.suptitle('IMS transform: ' + Calculator.IMS_TRANSFORM, y=1.02)
plot_in_ims_space(CalculatorTOF, xlim=(0, 2000), ylim=(3e5, 1e2))
ms1 :: IMS alpha = 7.01547425e05 +/ 4.462e11
ms2050 :: IMS alpha = 7.01547256e05 +/ 4.387e11
It is clear from this plot that a consistent pattern emerges. From visual inspection, this seems to match our hypothesis of a \(\sqrt{m/z}\) relationship. Indeed, if we plot this along the yaxis, the Intrinsic Mass Spacing of the TOF spectrum becomes obvious. The figure below mimics Figure 3.2 of Schneider (2016) where the lowest set of data is the IMS and each line above that represents an integer multiple of the IMS. This can be seen clearly by the red lines that show the first 10 multiples of the IMS. The IMS multiplier minus one indicates the number of missing data points in the spectrum (likely due to no ions hitting the detector during this time).
Using IMS for data compression¶
Knowing that the m/z data is driven by an integer relationship with the IMS allows two opportunities for nearlossless data compression:
We can drop any \(m/z\) values with corresponding zero intensities as these data points can be added back to the spectrum without losing fidelity
We can store the IMS value as a double, and convert the \(m/z\) vector to a vector of unsigned 32bit integers that represent an index
The relationship to convert between vectors is summarised as
where \(\alpha_{IMS}\) is the IMS value, \(i_{\sqrt{m/z}}\) is the unsigned integer index of the \(m/z\) value in \(\sqrt{m/z}\) space, and \(\beta_{trunc}\) is to control for any truncation error that may be introduced from the mzML file.
In the figure below, we plot out (in black) the mass error, in ppm, of the converted mass over charge compared to the values collected from the mzML. The red dots indicate the mass error if the index (\(i_{\sqrt{m/z}}\)) is offset by \(0.25\) in either direction. In general, this shows that the bounds of the indices are tight, but there is a latent mass error that comes from translating the floatingpoint \(m/z\) value into integer space.
Thermo / Orbitrap¶
Old
New
where \(\gamma\) is the minimum value of the IMS coords such that these are zeroindexed in the file
[6]:
class CalculatorOrbitrap(object):
IMS_TRANSFORM = '1 / sqrt(m/z)'
FNAME = base_dir + '/thermo/CC20160706_P100_Plate34_PC3_T3_P0034_A01_acq_01.h5'
def __init__(self, mz):
self.mz = mz
self.inverse_root_mz = 1 / np.sqrt(mz)
def calculate_ims_params_directly(self):
delta_root_mz = np.abs(np.diff(self.inverse_root_mz))
min_delta_root_mz = delta_root_mz.max()
min_alpha_list = delta_root_mz[delta_root_mz >= 1.4 * min_delta_root_mz]
alpha = np.median(min_alpha_list)
alpha_mad = np.median(np.abs(min_alpha_list  alpha))
beta = np.mean(self.residual([alpha, 0.0]))
return alpha, alpha_mad, beta
@classmethod
def calculate_deltas_in_ims_space(cls, mz):
inverse_root_mz = 1 / np.sqrt(mz)
return np.diff(inverse_root_mz)
def residual_ppm(self, params, offset=0):
return self.residual(params, offset=offset) * 1e6 / self.mz
def residual(self, params, offset=0):
return self._func(params, offset=offset)  self.mz
def ims_coord(self, params, offset=0):
a, b = params
i = (self.inverse_root_mz  b) / a
return np.around(i, decimals=0) + offset
def _func(self, params, offset=0):
i = self.ims_coord(params, offset=offset)
a, b = params
return (a * i + b) ** 2
[7]:
plot_raw_deltas(CalculatorOrbitrap, xlim=(200, 2500), ylim=(1e4, 1e2))
[8]:
plot_in_ims_space(CalculatorOrbitrap, xlim=(200, 2500), ylim=(1e8, 1e5))
ms1 :: IMS alpha = 1.34410614e08 +/ 1.244e10
ms2050 :: IMS alpha = 2.68852104e08 +/ 2.746e10
Scan data saved as a point cloud¶
If we consider the raw data collected within a given window across a full gradient during a SWATHMS run, each data point represents:
Mass over charge: The abscissa refers to the mass over charge ratio of the peptide ion
Retention time: The ordinate refers to the time at which the peptide ion is elutes from the Liquid Chromatography column
Intensity: The applicate is a count of the number of times a peptide ion is detected on the sensor
We have shown that mass over charge can be represented by an unsigned integer, as can intensity. Furthermore, the retention time for each data point can be considered as an index into an associated (but much smaller) vector of doubles. In this manner, we can save the raw data into a file format as 3 vectors of unsigned 32bit integers with some associated metadata.
toffee
takes this exact approach, saving each SWATH window as a group in an HDF5 file. Each group has:
double
attributeslower
,center
, andupper
to describe the properties of the window;double
attributesscanCycleTime
(\(t_c\)) andfirstScanRetentionTimeOffset
(\(t_0\)) that describe the retention time vector for each window such that \(t(i) = t_c i + t_0\) in seconds;a
uint32
dataset ofretentionTimeIdx
, of equal length as the retention time vector, that represents indices into themzIMSCoord
andintensity
between which the corresponding value of \(t(i)\) is applied;double
attributesimsAlpha
(\(\alpha_{IMS}\)) andimsBeta
(\(\beta_{trunc}\)) for the Intrinsic Mass Spacing m/z conversion such that \(\frac{m}{z} = \left( \alpha_{IMS} i_{\sqrt{m/z}} + \beta_{trunc} \right)^2\); anduint32
datasets ofmzIMSCoord
andintensity
for the point cloud, with one entry for each point in the cloud.
In taking this approach, we have found a file format with lossless compression that can represent a SWATHMS mzML
file with a tenfold decrease in the file size.
File 





ProCan90M0301 
576.06 
533.25 
5,596.73 
92.57 
ProCan90M0601 
779.12 
732.04 
7,555.10 
93.95 
Property 






count 
90 
90 
90 
90 
90 
mean 
1.12 
1.08 
11.02 
96.81 
984.17 
std 
0.27 
0.26 
2.77 
3.13 
16.72 
min 
0.54 
0.49 
5.14 
91.95 
952.54 
25% 
0.93 
0.95 
8.91 
95.77 
973.97 
50% 
1.21 
1.17 
12.02 
96.60 
986.83 
75% 
1.28 
1.24 
12.72 
97.27 
995.92 
max 
1.57 
1.54 
16.09 
106.05 
1021.46 
[ ]: