Crossmark

Diagnostics and correction of batch effects in large‐scale proteomic studies: a tutorial

Crossref DOI link: https://doi.org/10.15252/msb.202110240

Published Online: 2021-08-25

Published Print: 2021-08-01

Update policy: https://doi.org/10.1007/springer_crossmark_policy

Authors

Čuklina, Jelena https://orcid.org/0000-0002-5220-8642
Lee, Chloe H https://orcid.org/0000-0002-6232-7119
Williams, Evan G https://orcid.org/0000-0002-9746-376X
Sajic, Tatjana https://orcid.org/0000-0003-4282-1336
Collins, Ben C https://orcid.org/0000-0003-0827-3495
Rodríguez Martínez, María https://orcid.org/0000-0003-3766-4233
Sharma, Varun S https://orcid.org/0000-0002-4531-640X
Wendt, Fabian https://orcid.org/0000-0002-2501-536X
Goetze, Sandra https://orcid.org/0000-0001-6880-8020
Keele, Gregory R https://orcid.org/0000-0002-1843-7900
Wollscheid, Bernd https://orcid.org/0000-0002-3923-1610
Aebersold, Ruedi https://orcid.org/0000-0002-9576-3267
Pedrioli, Patrick G A https://orcid.org/0000-0001-6719-9139
Funding

Funding for this research was provided by:

H2020 European Research Council (668858)

H2020 European Research Council (ERC‐20140AdG‐670821)

Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung (IZLRZ3_163911)

Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung (3100A0‐688‐107679)
License Information

Text and Data Mining valid from 2021-08-01

Version of Record valid from 2021-08-25
More Information

Article History

Received: 22 January 2021

Revised: 16 July 2021

Accepted: 26 July 2021

First Online: 25 August 2021

Conflict of interest

: The authors declare that they have no conflict of interest.

: Box 1

: Proteomic experiments now routinely profile hundreds or thousands of proteins across hundreds of samples. However, detecting all proteins without missing values across the whole dataset is not yet feasible. The patterns of "missingness" are known to be batch‐specific (Karpievitch et al , ), and some workflows are susceptible to a rapid inflation of missing values as the number of batches increases (Brenes et al , ). This is also true for the largest datasets of this manuscript: aging mouse DIA and TMT datasets (see Box 1 Figure, Figs and for details).

: It should be noted, that even though "missingness" for low‐abundant peptides is more common (i.e., an issue related to the dynamic range and sensitivity of the mass spectrometer), this problem can also arise due to fundamental peptide interference regardless of their abundance or the acquisition parameters.

: Missing values can also affect batch effect correction methodologies. For instance, the current implementation of ComBat (Johnson et al , ) does not work if a peptide is missing in one batch. One possible solution is to remove all peptides with missing values before the batch correction (Lee et al , ). However, this may lead to loss of valuable quantitative information. Thus, methods which are more robust to missing data, such as median centering, can sometimes be better suited for proteomic data.

: Missing values are often imputed, by filling them with zeros, random small values (Tyanova et al , ) or re‐quantification of elution traces (Röst et al , ). Such imputation, however, can introduce bias that is batch‐ or peptide‐specific, as seen in Figs and . In turn, this skews batch effect diagnostic methods, such as hierarchical clustering, PCA, or PVCA. In these cases, batch effect assessment will be biased, as the clustering pattern will be driven by missing values (Fig ). One can estimate this effect by varying the fraction of missing values and assessing to what extent the batch effects are driven by consistently quantified peptides vs. missing values containing ones (Fig ).

: More importantly, imputed values bias the analysis past the batch effect adjustment stage. As shown in Box 1 Figure B and C, if re‐quantifications ("with requants") values inferred from MS elution traces are used, the correlation within batches seems higher than the correlation of replicates, while this problem is not observed when imputation is not used ("no requants"). Protein inference is also affected by the imputation on lower levels.

: Finally, provided that there are enough confidently quantified values, many downstream analysis techniques, such as differential expression or protein correlation analyses, can handle missing values. We therefore advise to avoid imputation, or at least suggest to perform it after batch correction whenever possible.

: Box Figure 1. The problem of missing values in batch effect diagnosis and correction: Aging mouse study. (A) Hierarchical clustering and heatmap of normalized data; missing values shown in black. The missing values are non‐randomly associated with the batch; (B) heatmap of selected sample correlation: Stronger correlation of samples within Batch 2 (blue) and Batch 3 (brown) is visible in the data with "requants", and replicate correlation is much more prominent in the data without "requants"; (C) distribution of selected sample correlation: same effect, as in (B) showing the distribution of sample correlation.

Document is current

Any future updates will be listed below