toffee: A library for fast access to Time of Flight SWATH-MS data

Toffee

Conda Version Conda Platforms MIT license Documentation Status

Master: Master Build Status || Dev: Dev Build Status

Toffee is a library and file format for Time of Flight SWATH-MS data. The file format provides lossless compression that results in files of a similar size to those from the proprietary and closed vendor format. In addition, the high-performance C++ library implements spatial data structures to allow a user to extract spectrographic (slice along mass over charge axis) and chromatographic (slice along retention time) data in constant time.

Toffee was born out of a need to store and access SWATH-MS data in a state-of-the-art high-throughput proteomics facility, ProCan, capable of generating thousands of files per month. Using the mzML file format in this environment would quickly outstrip the storage hardware available and we believe that such a limitation limits the potential of this technology. The challenges around mzML can be summarised into three categories:

  1. File size: Biobank-scale proteomics facilities may run upwards of 100,000 SWATH-MS runs; operating in a manner typical to ProCan results in Sciex wiff files 1-2 GB per each that unpack to 10-20 GB when converted to mzML leading to petabytes of data that needs to be stored and archived. Furthermore, this increase in file size adds significant time to processing, making analytics software largely IO-bound. On the ProCan90 dataset, toffee files are 95-100% the size of the original vendor files.

  2. Random access: Indexed mzML substantially improves randomly accessing single scan data (at constant retention time), yet algorithms often require slices along the mass over charge axis and this requires iterating over the full mzML file. Toffee facilitates a different access model allowing near constant time slicing in both retention time and mass over charge axes.

  3. Testability: A key challenge to improving downstream software is the slow iterative cycle imposed by storing experimental data in mzML. Building reliable and robust algorithms requires a strong testing framework of both unit and regression tests and a test harness that encourages developers to use it. The IO-bound nature of mzML files risks artificial barriers to test adoption. However, by solving points 1 and 2 above, extremely small (small enough to be committed to the repository) toffee files can be generated with exemplar data for integration into a unit and regression testing frameworks.

Toffee files are based on the open HDF5 format and can thus be read by many different programming languages. Within the toffee documentation, the ToffeeWriter class outlines the structure of the HDF5 file, and should be considered the canonical description.

In addition to the file format, toffee is also a high-performance C++ library for accessing the data in toffee files. By and large, the python classes are direct wrappings of the C++ code and API documentation can be considered largely equivalent. We use pybind11 for wrapping, and this will automatically take care of conversions of numpy and scipy matrices to corresponding Eigen matrices, albeit by creating copies.

For Users

Toffee is made available through the conda python packaging system. It can be installed using:

1
conda install --yes -c cmriprocan toffee

It is also included in a simple cmriprocan/toffee Docker image with conda and toffee only, along with cmriprocan/openms-toffee that is a Docker image for those operating a containerised workflow.

For Developers

We are basing our development workflow around Microsoft Visual Studio Code and conda. The following should help you set up a development environment. In general, we aim to use conda ‘env’ to manage dependencies.

  1. If you haven’t already, install git using your favourite method and clone this repository

  2. If you haven’t already, install conda

  3. If you haven’t already, install anaconda-client using conda install --yes anaconda-client

  4. If you haven’t already (and you’re on a mac), install the MacOS SDK

1
2
3
curl -L -o MacOSX10.9.sdk.tar.xz https://github.com/phracker/MacOSX-SDKs/releases/download/10.13/MacOSX10.9.sdk.tar.xz
tar -xzfv MacOSX10.9.sdk.tar.xz
sudo mv MacOSX10.9.sdk /opt/MacOSX10.9.sdk
  1. Log in to anaconda client as ProCanSoftEngRobot – you may need to ask the team for credentials

  2. If you haven’t already, download VSCode and install

  3. If you haven’t already, open VSCode and install the following extensions (look for the icon of the left side that looks like a square)

  • Microsoft “python” extension

  • Microsoft “C++” extension

  • vector-of-bool “CMake Tools” extension

  • Microsoft “Visual Studio Code Tools for AI” extension (this gets you jupyter notebooks working, among other things)

  1. From within VSCode, open this repository’s root directory; you don’t need to worry about workspaces

  2. Open up a terminal in VSCode (ctrl + backtick works on MacOS)

  • Change into the .dev-environment folder and run bash create_dev_conda_environment.sh – this will set up all of the dependecies in a conda environment called ``dev-toffee`

  1. Open the Command Palette (cmd + shift + p on MacOS) and search for “python: select interpreter” and chose any value. This will create a settings.json fie in .vscode in the root of the repository.

  2. Copy the following into <repository-root>/.vscode/settings.json, being sure to replace <your-anaconda-root> with the correct path

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
{
    "python.pythonPath": "<your-anaconda-root>/envs/dev-toffee/bin/python",
    "cmake.cmakePath": "<your-anaconda-root>/envs/dev-toffee/bin/cmake",
    "cmake.generator": "Ninja",
    "cmake.configureSettings": {
        "CMAKE_MAKE_PROGRAM": "<your-anaconda-root>/envs/dev-toffee/bin/ninja",
        "CMAKE_C_COMPILER": "<your-anaconda-root>/envs/dev-toffee/bin/clang",
        "CMAKE_CXX_COMPILER": "<your-anaconda-root>/envs/dev-toffee/bin/clangxx"
    },
    "cmake.configureOnOpen": true,
    "files.associations": {
        "array": "cpp",
        "*.tcc": "cpp",
        "cctype": "cpp",
        "clocale": "cpp",
        "cmath": "cpp",
        "complex": "cpp",
        "cstdarg": "cpp",
        "cstddef": "cpp",
        "cstdint": "cpp",
        "cstdio": "cpp",
        "cstdlib": "cpp",
        "cstring": "cpp",
        "ctime": "cpp",
        "cwchar": "cpp",
        "cwctype": "cpp",
        "deque": "cpp",
        "forward_list": "cpp",
        "list": "cpp",
        "unordered_map": "cpp",
        "unordered_set": "cpp",
        "vector": "cpp",
        "exception": "cpp",
        "optional": "cpp",
        "fstream": "cpp",
        "functional": "cpp",
        "initializer_list": "cpp",
        "iomanip": "cpp",
        "iosfwd": "cpp",
        "iostream": "cpp",
        "istream": "cpp",
        "limits": "cpp",
        "memory": "cpp",
        "new": "cpp",
        "numeric": "cpp",
        "ostream": "cpp",
        "sstream": "cpp",
        "stdexcept": "cpp",
        "streambuf": "cpp",
        "string_view": "cpp",
        "system_error": "cpp",
        "cinttypes": "cpp",
        "type_traits": "cpp",
        "tuple": "cpp",
        "typeindex": "cpp",
        "typeinfo": "cpp",
        "utility": "cpp",
        "valarray": "cpp",
        "variant": "cpp",
        "atomic": "cpp"
    },
    "python.linting.pylintEnabled": false,
    "git.autofetch": true,
    "python.linting.flake8Enabled": true,
    "python.linting.flake8Args": [
        "--max-line-length=120"
    ],
    "editor.rulers": [120]
    "python.unitTest.unittestEnabled": false,
    "python.unitTest.nosetestsEnabled": false,
    "python.unitTest.pyTestEnabled": true,
    "editor.minimap.enabled": false,
    "C_Cpp.intelliSenseEngineFallback": "Disabled",
}

We follow the OpenVDB style guide for the C++ and PEP-8 for our python code, so please aim to stay consistent with the rest of the code base. Contributions will be pass through peer review and style will be one element that is reviewed.

Changes

Change Log

0.14

0.14.3

  • Introduced a new concept of using a raw toffee file to “re-quantify” the results of PyProphet. In essense, we can use the retention time reported by PyProphet, and the m/z values in the search library to anchor the data we extract from the toffee file. From here, we can then fit an analytic 2D Gaussian surface to the raw data using least-squares. See the docs/jupyter/requant.ipynb for details of both the equations and the results. The function can be called using the following:

usage: requantify_pyprophet_sqlite [-h] [--max_q_value_rs MAX_Q_VALUE_RS]
                                   [--max_peptide_q_value_rs MAX_PEPTIDE_Q_VALUE_RS]
                                   [--max_protein_q_value_rs MAX_PROTEIN_Q_VALUE_RS]
                                   [--max_peptide_q_value_experiment_wide MAX_PEPTIDE_Q_VALUE_EXPERIMENT_WIDE]
                                   [--max_protein_q_value_experiment_wide MAX_PROTEIN_Q_VALUE_EXPERIMENT_WIDE]
                                   [--max_peptide_q_value_global MAX_PEPTIDE_Q_VALUE_GLOBAL]
                                   [--max_protein_q_value_global MAX_PROTEIN_Q_VALUE_GLOBAL]
                                   [--max_peak_group_rank MAX_PEAK_GROUP_RANK]
                                   [--lower_window_overlap LOWER_WINDOW_OVERLAP]
                                   [--upper_window_overlap UPPER_WINDOW_OVERLAP]
                                   output_filename toffee_filename
                                   pyprophet_filename

Take the SQLite output from PyProphet and re-quantifies the intensities. The
new file will contain the following columns || "ProteinName": The identifier
of the protein || "Sequence": The identifier of the peptide ||
"FullPeptideName": The identifier of the precursor || "Charge": The charge of
the precursor || "peak_group_rank": The rank of the precursor peak group ||
"MS1Intensity": The newly quantified MS1 intensity || "MS2Intensity": The
newly quantified MS2 intensity || "ModelParamSigmaRT": The Sigma RT parameter
of the analytic model || "ModelParamSigmaMz": The Sigma m/z parameter of the
analytic model || "ModelParamRT0": The RT0 parameter of the analytic model ||
"ModelParamMz0MS1": The m/z_0 parameter of the analytic model for MS1 ||
"ModelParamMz0MS2": The m/z_0 parameter of the analytic model for MS2 ||
"ModelParamAmplitudes": The amplitude parameters of the analytic model with
";" separating MS1 and MS2, and "," separating each fragment.

positional arguments:
  output_filename       Filename for the output results (*.csv.gz).
  toffee_filename       The raw data toffee filename (*.tof).
  pyprophet_filename    Filename for the PyProphet SQLite results that matches
                        the toffee file (*.osw).

optional arguments:
  -h, --help            show this help message and exit
  --max_q_value_rs MAX_Q_VALUE_RS
                        Run specific peak group FDR threshold.
  --max_peptide_q_value_rs MAX_PEPTIDE_Q_VALUE_RS
                        Run specific peptide FDR threshold.
  --max_protein_q_value_rs MAX_PROTEIN_Q_VALUE_RS
                        Run specific protein FDR threshold.
  --max_peptide_q_value_experiment_wide MAX_PEPTIDE_Q_VALUE_EXPERIMENT_WIDE
                        Experiment wide peptide FDR threshold.
  --max_protein_q_value_experiment_wide MAX_PROTEIN_Q_VALUE_EXPERIMENT_WIDE
                        Experiment wide protein FDR threshold.
  --max_peptide_q_value_global MAX_PEPTIDE_Q_VALUE_GLOBAL
                        Global peptide FDR threshold.
  --max_protein_q_value_global MAX_PROTEIN_Q_VALUE_GLOBAL
                        Global protein FDR threshold.
  --max_peak_group_rank MAX_PEAK_GROUP_RANK
                        Number of peak groups to consider.
  --lower_window_overlap LOWER_WINDOW_OVERLAP
                        Positive value to indicate the MS2 window lower
                        overlap (in Da).This should match the settings used in
                        OpenMSToffee/OpenSwath.
  --upper_window_overlap UPPER_WINDOW_OVERLAP
                        Positive value to indicate the MS2 window upper
                        overlap (in Da).This should match the settings used in
                        OpenMSToffee/OpenSwath

0.14.2

  • Significant performance improvement in the Sciex raw data reader – memory usage down by >60% and runtime down by 50%

0.14.1

  • Added new conversion method that converts raw Sciex data directly to toffee (PD-892)

$ raw_sciex_data_to_toffee --help
usage: raw_sciex_data_to_toffee [-h] [--filter_ms2_window FILTER_MS2_WINDOW]
                                [--hide_progress_bar] [--debug]
                                zip_filename toffee_filename

Convert raw Sciex zip data file to toffee

positional arguments:
  zip_filename          The input filename (*.zip).
  toffee_filename       The output filename (*.tof).

optional arguments:
  -h, --help            show this help message and exit
  --filter_ms2_window FILTER_MS2_WINDOW
                        If positive integer, only this MS2 window will be
                        included.
  --hide_progress_bar   If set, then progress bar will not be shown
  --debug               If set, then debugging logs will be printed

0.13

0.13.1

0.12

0.12.18

  • Updated features for the manual validation tool based on feedback from first round of validation (PD-881)

0.12.17

  • Added a small app to enable visual/manual validation of retention times picked for specific peptide queueries (PD-881)

  • Changed return signature of ToffeeFragmentsPlotter::load_raw_data to include MS1 chromatogram

0.12.16

  • Added optional flag so that when you are loading a SwathMap, you can adjust the IMS coords to minimise the PPM error when slicing the data as a 2D image. (PD-879)

0.12.15

  • Bumped psims requirement to 0.1.27 that incorporates our fix for the bug in lxml into their __exit__ method. This should be much more robust against catching other errors during the xml serialisation. Removed the fix from our code (PD-875)

0.12.14

  • Zero-intensity points in a spectra are not copied from mzML to toffee. These can be losslessly recovered. (PD-876)

0.12.13

  • Added helper function to SwathRun to give immediate knowledge of if there is any MS1 data in the toffee file (PD-642)

0.12.12

  • Fixed the bug where lxml would crash on closing large mzML files (PD-875)

  • Extracted header data directly from the mzML file and stored in the toffee file (PD-873). This required that headers were moved from being an HDF5 attribute to a dataset, so the file format version has been bumped to 1.2. This is not a breaking change within toffee.

0.12.11

  • Robustness improvements to last – and a slight change to CircleCI config to hopefully build the Docker image.

0.12.10

  • Enabled conversion of mzML to toffee files using pyteomics. This now means the toffee library is completely stand-alone from the OpenMS code base. psims and pyteomcs both need to be installed using pip as their conda versions are not up to date. (PD-871)

0.12.9

  • Reverted changes to the IMS indices that were made when constructing a SwathMap as this lead to downstream lossy data when, for example, creating in-silico dilutions. (PD-870)

0.12.8

  • Added code to efficiently sub-sample toffee files to only include data for specifically requrested peptides. This is very useful for creating small files that can be used in downstream regression testing, without requiring GB of download. (PD-869)

0.12.7

  • Added first step for visualisation – this is based on plotly and enables an interactive figure to be generated for a given peptide (transition group) with a specified number of isotopes. (PD-868)

0.12.6

  • Added code to enable combining two toffee files where one serves as a background and peptides from the other are added with an ‘in-silico’ dilution at known retention times. This is extremely useful for testing purposes. (PD-867)

0.12.5

  • Added ability to convert toffee to mzML using the psims library (PD-793)

0.12.4

  • Renamed SwathMapSummary to SwathMapInMemorySpectrumAccess, and gave this a common base class with SwathMapSpectrumAccess

  • Added function to return m/z transformer for SwathMapInMemorySpectrumAccess and SwathMapSpectrumAccess.

  • Updated the example notebook that shows how to sub-sample a toffee file for just the iRT precursors.

0.12.3

  • Added an uncompressed cache file for using with SwathMapSpectrumAccess. This gives a significant improvement in performance, as you are no longer uncompressing data, effectively meaning that HDF5 acts like a memmap.

0.12.2

  • Small change to the IMS alpha calculation step. There are certain situations where numerical error will mean that the alpha value will flip-flop between iterations. This is caught and one value is accepted, a nicer error is thrown when that doesn’t work.

0.12.1

In general, we have switched away from using least squares to calculate alpha and beta in favour of the more robust direct method prototyped using python – the results of this prototyping are currently being used in the preparation of the Toffee manuscript and will be included as supplementary material. There is a regression test that compares the results of the python code to this C++ implementation to ensure they are equivalent.

This now enables us to store alpha and beta on a “per scan” basis and thus get lossless compression between m/z and the integer index space. The file format version has been bumped to v1.1, although it remains backwards compatible to v0.2.

Toffee data can now be loaded in three modes:

  • SwathMap which uses the median values for alpha and beta and enables the user to slice the raw data like an image. Using the library in this manner results in a 2-5 ppm mass accuracy loss as alpha and beta do not vary across retention time.

  • SwathMapSummary where you can only access the data to quickly produce plots such as totalIonChromatogram. This mode can only be used on files created with a format >= v1.1.

  • SwathMapSpectrumAccess where you can only access the data scan-by-scan in a manner akin to how one would read an mzML or wiff file. Using the library in this mode is essentially lossless (ppm error < 1e-6), at the cost of not being able to extract data by slicing through the mass over charge axis. This mode can only be used on files created with a format >= v1.1.

All of these modes are loaded through the SwathRun object as before, and there is no reason that the same algorithm cannot make use of both depending on need. They are const correct, and so will play nicely in shared memory parallelism.

0.11

0.11.1

  • Changed the method for calculating the IMS coords to be more accurate via Levenberg-Marquardt non-linear least squares (PD-800)

  • Version of toffee library used to create a file now stored as a parameter

0.10

0.10.7

  • Added ability to convert toffee SawthMap back to raw data (PD-793)

0.10.6

  • Fixed duplicate m/z IMS coordinates bug (PD-749)

0.10.5

  • Fixed IMS gamma underflow bug

0.10.4

  • Fixed IMS gamma off-by-one error that could occur when looking at the lowest m/z value in a window

License

MIT Copyright (c) 2017-2019 Children’s Medical Research Institute (CMRI)

Indices and tables