{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Re-Quantifying detections using toffee and 2D modified Gaussian\n", "\n", "Through the many visualisations of toffee data, and appreciation for the TOF detector, one recognises that the data is roughly gaussian in both retention time and m/z index space. Indeed, as we will see below, it is possible to model the raw data as an analytic function to improve the quantification results. In doing so, the least-squares method used to fit the function to the data mimicks a high-frequency filter and performs basic noise removal. By fixing the retention time of the analytic function for all fragments, it is also possible to pull apart co-eluting peaks and count only the contribution from the peak of interest.\n", "\n", "The elution profile from the LC column gives a log-normal distribution that is skewed towards the left, while the data is normally distributed in index m/z space that we attribute to the distribution of kinetic energy that individual ions obtain within the TOF mass analyser. These observations imply that we can perform an optimization to find the peak location, spread, and amplitude of the gaussian functions for each fragment, $j$.\n", "\n", "$$\n", "F_j \\left( \\sigma_t, \\sigma_m, t_0, m_{0,j}, a_j, c_j \\right) = \\frac{1}{t} \\cdot \\frac{a_j}{\\sigma_t\\sqrt{2 \\pi}} \\exp{\\left[-\\frac{(\\log{t} - \\log{t_0})^2}{2 \\sigma_t^2} - \\frac{(m - m_{0,j})^2}{2 \\sigma_m^2}\\right]} + C_j \\left( \\sigma_t, t_0, c_j \\right)\n", "$$\n", "\n", "with chemical noise, for MS1 signal only, defined as\n", "\n", "$$\n", "C_j \\left( \\sigma_m, m_{0,j}, c_j \\right) = c_j \\exp{\\left[-\\frac{(m - m_{0,j})^2}{2 \\sigma_m^2}\\right]}\n", "$$\n", "\n", "and the minimisation function\n", "\n", "$$\n", "G \\left( \\sigma_t, \\sigma_m, t_0, \\vec{m_0}, \\vec{a}, \\vec{c} \\right) = \\sum_j G_j \\text{ , where } G_j = F_j \\left( \\sigma_t, \\sigma_m, t_0, m_{0,j}, a_j, c_j \\right) - I_j\n", "$$\n", "\n", "where \n", "- $t$ and $m$ represent the retention time and $i_\\sqrt{m/z}$ space respectively\n", "- $t_0$ and $\\vec{m_0}$ are the peak location, where we assume that the retention time for all fragments must be constant, but the mass offset is allowed to be different for each one to account for calibration offsets\n", "- $\\sigma_t$, $\\sigma_m$ are the spread of each gaussian\n", "- $\\vec{a}$ is the amplitude for each fragment\n", "- $\\vec{c}$ is the amplitude of chemical noise for each fragment, and\n", "- $I_j$ is the raw intensity data for a given fragment.\n", "\n", "By building a model such as a 2D gaussian, we are then able to calculate the volume under the curve, with a hypothesis that this should give a more robust measure of the intensity from sample to sample.\n", "\n", "Furthermore, to improve robustness during least-squares fitting, we apply the following transforms to the parameters:\n", "\n", "- $f(n, x) = n (1 + \\tanh{x}) / 2$ is a sigmoid function for constraining $x$ to the range of $0, n$\n", "- $\\sigma_t = \\sigma_t^{\\prime2}$, where $\\sigma_t^\\prime$ is the value used during optimisation\n", "- $\\sigma_m = \\sigma_m^{\\prime2}$, where $\\sigma_m^\\prime$ is the value used during optimisation\n", "- $t_0 = f(num_t, t_0^\\prime)$, where $t_0^\\prime$ is the value used during optimisation, and $num_t$ is the number of retention time pixels. This constrains $t_0$ to fall within the valid range of retention times\n", "- $m_0 = f(num_m, m_0^\\prime)$, where $m_0^\\prime$ is the value used during optimisation, and $num_m$ is the number of index m/z pixels. This constrains $m_0$ to fall within the valid range of index m/z values\n", "- $a_j = a_j^{\\prime2}$, where $a_j^\\prime$ is the value used during optimisation\n", "- $c_j = c_j^{\\prime2}$, where $c_j^\\prime$ is the value used during optimisation" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "%load_ext autoreload\n", "%autoreload 2\n", "\n", "import os\n", "import datetime\n", "\n", "from IPython.display import SVG, Image, display\n", "\n", "import matplotlib.pyplot as plt\n", "\n", "import numpy as np\n", "\n", "import pandas as pd\n", "\n", "import seaborn as sns\n", "\n", "import toffee\n", "from toffee import requant\n", "from toffee.util import calculate_isotope_mz\n", "from toffee.viz import ToffeeFragmentsPlotter\n", "\n", "from tqdm import tqdm\n", "\n", "base_dir = os.environ.get('DIA_TEST_DATA_REPO', None)\n", "assert base_dir is not None\n", "\n", "sns.set()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Analysis date: 2019-08-12 10:15:31.826510\n", "toffee version: 0.14.2\n" ] } ], "source": [ "print(' Analysis date:', datetime.datetime.now())\n", "print('toffee version:', toffee.__version__)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fitting the data to an analytic model using the C++ library\n", "\n", "In `toffee.requant` there are two classes that enable the requantification to be performed using `PyProphet` results as the input." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Swath Gold Standard data" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 4/4 [00:00<00:00, 10.48it/s]\n" ] }, { "data": { "text/html": [ "
\n", " | run_id | \n", "peptide_id | \n", "Sequence | \n", "FullPeptideName | \n", "Charge | \n", "peak_group_rank | \n", "RT | \n", "Intensity | \n", "m_score | \n", "m_score_peptide_run_specific | \n", "... | \n", "MS1Intensity | \n", "MS2Intensity | \n", "ModelParamSigmaRT | \n", "ModelParamSigmaMz | \n", "ModelParamRT0 | \n", "ModelParamMz0MS1 | \n", "ModelParamMz0MS2 | \n", "ModelParamAmplitudes | \n", "InjectionName | \n", "PQM | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1368487973123354311 | \n", "1 | \n", "AAEDFTLLVK | \n", "AAEDFTLLVK(UniMod:259) | \n", "2 | \n", "1 | \n", "4209.56 | \n", "80.0 | \n", "0.032005 | \n", "0.032005 | \n", "... | \n", "5.028338 | \n", "1.668722 | \n", "0.064002 | \n", "11.891527 | \n", "15.757658 | \n", "14.731366 | \n", "12.945118 | \n", "8.66231,8.142178.14217;1.04211.0421,2.753782.7... | \n", "002 | \n", "AAEDFTLLVK(UniMod:259)_2 | \n", "
1 | \n", "1368487973123354311 | \n", "4 | \n", "AAGASAQVLGQEGK | \n", "AAGASAQVLGQEGK(UniMod:259) | \n", "2 | \n", "1 | \n", "2017.42 | \n", "822.0 | \n", "0.003899 | \n", "0.003899 | \n", "... | \n", "76.928972 | \n", "15.230757 | \n", "0.088230 | \n", "5.231933 | \n", "14.949945 | \n", "20.636998 | \n", "20.908421 | \n", "305.804,170.775170.775;41.441141.4411,16.10781... | \n", "002 | \n", "AAGASAQVLGQEGK(UniMod:259)_2 | \n", "
2 | \n", "1368487973123354311 | \n", "17 | \n", "ADSTGTLVITDPTR | \n", "ADSTGTLVITDPTR(UniMod:267) | \n", "2 | \n", "1 | \n", "3105.00 | \n", "180.0 | \n", "0.003899 | \n", "0.003899 | \n", "... | \n", "8.074463 | \n", "2.976146 | \n", "0.065825 | \n", "2.685659 | \n", "15.040877 | \n", "20.335423 | \n", "21.117461 | \n", "49.0696,42.791442.7914;2.926292.92629,8.41438.... | \n", "002 | \n", "ADSTGTLVITDPTR(UniMod:267)_2 | \n", "
3 | \n", "1368487973123354311 | \n", "22 | \n", "AEVAALAAENK | \n", "AEVAALAAENK(UniMod:259) | \n", "2 | \n", "1 | \n", "2221.28 | \n", "2658.0 | \n", "0.003899 | \n", "0.003899 | \n", "... | \n", "189.337273 | \n", "57.394183 | \n", "0.082998 | \n", "5.633617 | \n", "15.131340 | \n", "20.949316 | \n", "21.025522 | \n", "810.276,374.571374.571;70.898270.8982,87.69587... | \n", "002 | \n", "AEVAALAAENK(UniMod:259)_2 | \n", "
4 | \n", "1368487973123354311 | \n", "26 | \n", "AFGYYGPLR | \n", "AFGYYGPLR(UniMod:267) | \n", "2 | \n", "1 | \n", "3723.90 | \n", "320.0 | \n", "0.034796 | \n", "0.034796 | \n", "... | \n", "19.167878 | \n", "9.666694 | \n", "0.110380 | \n", "9.718412 | \n", "14.489525 | \n", "23.953597 | \n", "19.389373 | \n", "35.5049,38.795338.7953;8.423798.42379,7.632937... | \n", "002 | \n", "AFGYYGPLR(UniMod:267)_2 | \n", "
5 rows × 24 columns
\n", "\n", " | run_id | \n", "peptide_id | \n", "Sequence | \n", "FullPeptideName | \n", "Charge | \n", "peak_group_rank | \n", "RT | \n", "Intensity | \n", "m_score | \n", "m_score_peptide_run_specific | \n", "... | \n", "MS1Intensity | \n", "MS2Intensity | \n", "ModelParamSigmaRT | \n", "ModelParamSigmaMz | \n", "ModelParamRT0 | \n", "ModelParamMz0MS1 | \n", "ModelParamMz0MS2 | \n", "ModelParamAmplitudes | \n", "InjectionName | \n", "PQM | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "5145023687975356512 | \n", "0 | \n", "AAAAAAAAAAAAAAAGAGAGAK | \n", "AAAAAAAAAAAAAAAGAGAGAK | \n", "3 | \n", "1 | \n", "2277.710 | \n", "4944.0 | \n", "0.016517 | \n", "0.016516 | \n", "... | \n", "611.732180 | \n", "88.886325 | \n", "0.289517 | \n", "3.981082 | \n", "14.301614 | \n", "19.422064 | \n", "20.641688 | \n", "929.679,4790.634790.63;198.689198.689,133.7331... | \n", "ProCan90-M03-01 | \n", "AAAAAAAAAAAAAAAGAGAGAK_3 | \n", "
1 | \n", "5145023687975356512 | \n", "3 | \n", "AAAAAAAAAAGRVGM | \n", "AAAAAAAAAAGRVGM | \n", "2 | \n", "1 | \n", "1609.310 | \n", "14029.0 | \n", "0.000583 | \n", "0.000583 | \n", "... | \n", "2086.219858 | \n", "201.586612 | \n", "0.218611 | \n", "3.505546 | \n", "15.515581 | \n", "22.769187 | \n", "21.015362 | \n", "14288.9,6270.76270.7;757.188757.188,649649,387... | \n", "ProCan90-M03-01 | \n", "AAAAAAAAAAGRVGM_2 | \n", "
2 | \n", "5145023687975356512 | \n", "13 | \n", "AAAAAADPNAAWAAYYSHYYQQPPGPVPGPAPAPAAPPAQGEPPQP... | \n", "AAAAAADPNAAWAAYYSHYYQQPPGPVPGPAPAPAAPPAQGEPPQP... | \n", "5 | \n", "1 | \n", "3275.670 | \n", "2988.0 | \n", "0.044867 | \n", "0.044866 | \n", "... | \n", "289.512315 | \n", "87.778535 | \n", "0.291877 | \n", "2.737636 | \n", "13.559004 | \n", "21.539127 | \n", "21.589122 | \n", "404.298,2301.282301.28;27.238927.2389,136.0631... | \n", "ProCan90-M03-01 | \n", "AAAAAADPNAAWAAYYSHYYQQPPGPVPGPAPAPAAPPAQGEPPQP... | \n", "
3 | \n", "5145023687975356512 | \n", "16 | \n", "AAAAADLANR | \n", "AAAAADLANR | \n", "2 | \n", "1 | \n", "475.722 | \n", "1992.0 | \n", "0.012483 | \n", "0.012482 | \n", "... | \n", "479.286636 | \n", "134.605709 | \n", "0.355179 | \n", "11.511312 | \n", "17.247381 | \n", "19.503196 | \n", "16.965990 | \n", "545.168,1321.941321.94;221.035221.035,120.5912... | \n", "ProCan90-M03-01 | \n", "AAAAADLANR_2 | \n", "
4 | \n", "5145023687975356512 | \n", "19 | \n", "AAAAASAAGPGGLVAGK | \n", "AAAAASAAGPGGLVAGK | \n", "2 | \n", "1 | \n", "978.743 | \n", "978.0 | \n", "0.004057 | \n", "0.004057 | \n", "... | \n", "355.547847 | \n", "31.256479 | \n", "0.307980 | \n", "7.227377 | \n", "11.911672 | \n", "9.497525 | \n", "16.879286 | \n", "1193.04,604.938604.938;17.968217.9682,66.87976... | \n", "ProCan90-M03-01 | \n", "AAAAASAAGPGGLVAGK_2 | \n", "
5 rows × 24 columns
\n", "