Keywords: 2-DE; Sensitivity; Proteoforms; Human proteome project
AB: Amido Black; Cy5min: Cy5 Minimal Labeling; Cy5sat: Cy5 Saturated Labeling; PTM: Post-Translation Modification
Since completion of the Human Genome Project determined the number of protein-coding genes, the quest is on to derive similar information regarding the human proteome [1–4]. In this study, we consider the entire complement of protein species (proteoforms) in a given cell or organism as the proteome width.
Assuming the ‘one gene one protein” mantra, there should be >20,000 human proteins [5,6]. However, in reality the situation is far more complex, and some estimates suggest there may be 100 protein variants from a single protein-coding gene [7,8]. Variations can include single amino acid substitutions (SAPs) derived from non-synonymous SNPs, translation of the alternatively spliced transcripts (AS), and post-translationally modified forms (PTM) . Inventory of protein diversity by mass-spectrometry was coined as population proteomics over 5 years ago .
Proteome width is currently investigated using shotgun massspectrometry, and this identified up to 8000 different species in one cell line . For blood plasma the benchmark of 2000 identified proteins was recently achieved  but this is evidently just the tip of the iceberg, and will certainly be expanded when identification of multiple proteoforms becomes commonplace . The term “protein species” has been used traditionally [13,14], but this is being replaced by “proteoforms” in the top-down mass-spectrometry community [15,16]. It is still not known exactly how many proteoforms are present in a given biological sample, but estimations based on theory range from 104 to 106 species (Supplementary Figure S1). We tried to address this question experimentally by investigating the ability of 2-D gel electrophoresis to evaluate the proteome width. The number of spots should be proportional to the proteome width, and in our approach. We exploited this proportionality to assign the width of the proteome of essentially different specimens.
For estimation of proteome width, 2-D electrophoresis (2-DE) is more appropriate method. In contrast to conventional gel-free proteomic approaches. 2-DE allows detection of AS, SAPs and PTMs, which can all affect the protein properties. Existing bottom-up mass spectrometry methods cannot achieve this aim, while a top-down approach is still challenging .
In order to investigate proteome width, we used data from previous studies on blood plasma , tissues and cells . From these studies the sensitivity for different staining methods was assigned and a number of 2-D gels were produced using different dyes and varying amounts of protein. The proteome width was considered in two steps: (1) determination of the number of protein spots on a 2-DE image to assess the sensitivity of different staining dyes, and (2) extrapolation of the spots-to-sensitivity function to estimate number of spots the highest theoretical sensitivity.
The actual sensitivity was assigned to different staining methods by preparing a dilution series (BSA was used to estimate sensitivity; Figure 1a) and determining the lowest detectable concentration (Supplementary Table S1). The response (Z) is then estimated for each staining method (Figure 2b). The number of protein spots is plotted as a function of the amount of protein loaded on the gel; more protein loaded equates to more spots present, up to saturation levels after which gels become overloaded. The tilt angle is different depending on the dye; therefore the dyes were characterized using the following formula:
Figure 2(a,b,c): Number of protein spots on 2-DE gel of blood plasma as a function of: (a) amount of the total protein applied to the gel: AB – Amido Black, CBB - Coomassie Brilliant Blue, ST – Silver Thiosulfate, GA - Glutaric Aldehyde, Cy5 – fluorescent labelling by Cyanines, Cy5-sat – Cyanines saturated and (B) sensitivity of the dye (double-log scale). S = - lgC, where C is the lowermost detectable concentration, M. (C) Number of spots as a function of the sensitivity of the staining dyes for HepG2, bacterial cells (datapoints were taken from ), and for human blood plasma (datapoints were taken from  S = -lgC, where C is a limit of detection of analytical method.
Z = number of spots (#) / amount of protein (ng)
Data from different staining methods were then placed on a single plot (Figure 1c). Experimental data were used to derive the dependency of response (Z) to the sensitivity of the staining method. The dependency can then be extrapolated to hypothetical detection limits (e.g., one molecule in 1 L of blood, or one molecule per cell) by multiplying the response to the total amount of protein in a given volume of blood plasma or cell. As exemplified in Figure 1c given a particular value of detection limit (DL) from the sensitivity axis, if DL1 = one molecule per cell, then the following formula can be applied:
# proteins = Z (DL1)*Q,
where Q is the total amount of protein in the cell. To calculate Z, a dynamic range of five orders of magnitude was covered using selected gel staining methods (Supplementary Table S1). Assuming that a dye with comparatively higher sensitivity develops more protein spots, we explored the spot-to-sensitivity dependency of different biomaterials.
In previous work we demonstrated how to calculate the number of protein species in human blood plasma . The number of protein spots can be shown as a function of the total amount of protein applied to the 2-DE gel (Figure 2a; ). This same function was applied for each dye, and regions corresponding to optimal gel loading were approximated from the linear trend. This method is now generalized and was applied to data on plasma, as well as data for HepG2 and bacterial cells .
The spot-to-sensitivity function was used to estimate the number of protein species that could be detected, given unlimited sensitivity. The theoretically feasible maximum sensitivity could be either one molecule per 1 L of blood plasma (the reverse Avogadro number ), or 1 molecule per average cell for bacteria or HepG2 cells.
This dependence (Figure 2a) can then be used to compare the experiments at fixed amounts of protein in the sample. Generally, the number of spots is a function of two parameters; sensitivity and amount of material, and the dependency on sensitivity can be assigned by attributing a value for the amount of protein. The number of spots produced by different dyes can then be compared for a fixed amount of total protein on the gel. Less sensitive dyes such as Amido Black (AB) and Coomassie (CBB) produce only one or two spots for 1 μg of loaded total protein, whereas more sensitive dyes such as silver anhydride (ST) can give 12 spots, and fluorescent dyes such as Cy5 and Cy5-sat can give even more.
Substituting x=1 μg in the equations in Figure 2, the number of spots were plotted as a function of the dye sensitivity(Figure 2b). Each experimental point in the figure corresponds to a particular dye, and they are well approximated by the exponential dependency, which gives a straight line on a double log plot (R2=0.93).
Firstly, linear regression was used to approximate the number of protein species in blood plasma (Figure 2b). For a sensitivity of one molecule per 1 μL (10−18 M), blood plasma could yield 14,500 spots (the benchmark of 10−18 M is used as a lowermost clinically relevant value, enabling the detection of a biomarker shed from a cancerous focus of less than 1 mm in diameter ). At the present time, high resolution 2-DE of plasma combined with pre-separation and sample depletion can only resolve 400 spots . This discrepancy means that many potential protein biomarkers are yet to be identified and therefore cannot yet be exploited .
The dependency (Figure 2b) was also extended to the lowermost detection limit, known as reverse Avogadro’s number . The physiological role of ultra-rare protein species present at only one molecule per 1 L is unknown at present, but may well be physiologically relevant. At a detection limit of 10−24 M the dependency shown in Figure 2b could achieve 1.75 million different protein species. This value matches closely to the total number of modified and unmodified protein species annotated in the NextProt database (~ 1.8 million, not including somatic mutations) .
The experiments with the eukaryotic HepG2 and bacterial cells were performed following the same protocol as was used for blood plasma , and the data on three different types of biomaterial were acquired (Figure 2c; Supplementary Table S2). The dependency was higher for HepG2 cells than for plasma, and even coefficient was 0.75 for plasma, 1 for HepG2, and 1.5 for bacterial higher still for bacterial cells. Therefore spot-to-sensitivity function appeared to be specific for the type of biomaterial: the exponent power cells. Interestingly, this function was indistinguishable between analyzed bacterial species, E. coli and P. furiosus.
From the trends in Figure 2c the proteome width can be probed. With plasma, at a sensitivity of 10−24 M, millions of protein species could be distinguished. However, approximating to the reverse Avogadro number seems meaningless for the cells due to their limited volume. The approximation to one molecule per cell is more meaningful, which for a HepG2 cell with a 20 nm diameter was 10−12 M. From the equation in Figure 2c, the number of protein spots expected for a single HepG2 cell was 368 spots. However, in practice a population of the cells rather than a single cell is used for analysis . Normalizing the response Z (Figure 1c) to 1 ng of total protein loaded on the gel corresponded to 103 HepG2 cells. Therefore to resolve a typical protein species from an average cell, which is one thousandth of the neighboring single cells, a sensitivity of 10−15 M (10−12 M diluted 103 times) should be approached. From the spot-to-sensitivity equation (Figure 2c), at this sensitivity an average HepG2 cell could generate 18,000 different protein spots.
The smaller volume of a bacterial cell means that a concentration of one molecule per bacterium is 10−9 M rather than 10−12 M for a HepG2 cell. However, many more bacterial cells are used to generate the same amount of protein sample. A sensitivity of 10−12 M is therefore appropriate for detecting one protein species in 1,000 bacterial cells. Applying this function (Figure 2c), at a sensitivity of 10−12 M, 6,900 spots could be generated from a typical bacterial cell.
Our estimate of 7,000–18,000 protein species per cell may be an underestimate, due to problems associated with 2-DE [23-26]. For instance, 2-D gels have limited resolution, and a single protein spot can contain up to 20 protein species . Despite the drawbacks, our speculative estimates for the number of proteoforms appear to be reasonable due to choosing two straightforward correlations: the number of spots vs. amount of protein loaded and the number of spots vs. the sensitivity of the staining method. It should be emphasized that the proteome width can vary between different cells. Although confocal microscopy and other observational methods can investigate single cells , high-throughput proteomic approaches cannot operate at this level at the present time. The average cell is therefore used in proteomics, and this may be the average of thousands or even millions of cells . The problem of proteome heterogenic is relevant to blood plasma as well. The proteome width of blood plasma is dependent on the minimal sample volume, which could represent the whole diversity of proteoforms.
There are evident objections to the approach presented herein. First, the total amount of proteins was simply estimated by some calculation and linear regression, while estimation is highly dependent on the process of different dye staining. The described method also cannot account for proteome dependence of growth stage or culture condition in mammalian or bacterial cells. However, the overall trend is captured in the experiments, compliant with the difference in the dynamic range of plasma, cell and bacteria [10,28,29].
To emphasize the problem let’s look into the figure 3 borrowed from the publication of . We see the normal distribution of the molecules versus their concentrations as a result of proteomic experiments. The figure looks quite comfortable, as observations of the molecules are compliant with the biochemistry view-style of measurements.
However, from the 2DE data presented in this article it is concluded that such distribution is false if we observe the individual molecules. Previously 2-DE was the main method of proteomics and then undeservedly forgotten so deprived of one shortage. It allows to observe proteins as separate proteoforms rather than a result of identification of the peptide mixture. That is pointed out by a thin line drawn over the picture at Figure 1.
Experiments reported herein just show that number of biomolecules is infinitely increasing, when we increase either sensitivity or selectivity or any other analytical parameter . It is comparable to the observation of stars and galaxies – whenever we construct next new telescope we see more objects .
That has an important consequence of the curves in Figure 2, but this consequence is not easily accepted from the scratch. We do not see individual molecules not because of the technical reasons of resolution or dynamic range, or whatever. Post genome molecular science should to accept, that we simply - do not know what if all of these molecules in one moment would become visible.
The work was done in the framework of the State Academies fundamental research program (2013-2020) and RSF grant (# 15-15-30041).