Whole genome (re)sequencing provides new opportunities to discover Copy Number Variation (CNV) on the genome. Due to the continuous reduction in sequencing costs, it has become as the principal methodology to detect CNV in livestock. One parameter that increases the genotyping cost is the depth of the coverage during sequencing. The main aim of this note was to assess the variation on CNV identification with different depth coverage and readlength on genome sequencing. The results point out that sequences coming from short read-length require less depth coverage than those obtained with long read-length. In addition, small CNV require deeper coverage to be detected. These results can reduce the discovering and genotyping costs since sequencing technologies with short read-lengths are often less costly. Finally, a general formula was derived to optimize the sequencing costs.
Keywords: Copy number variation, Depth of coverage, Livestock, Read-length
Copy Number Variation (CNV) represent a significant source of genetic diversity in mammals covering ~12% of the genome , and it has been shown to be associated with phenotypes (diseases/traits) in humans . Next-Generation Sequencing (NGS) technology allows for whole genome (re)sequencing at very low costs per sequence and provides a wealth of information to tackle genetic problems, such as the identification of the molecular basis of complex traits that are difficult to study with conventional approaches . To discover (detect, validate and characterize) or genotype CNV on whole genome sequences, the array Comparative Genomic Hybridization (aCGH) has been so far the most used technique. In aCGH experiments genomic DNA samples are co-hybridized on the same oligonucleotide array and the genomic variation differences from the reference sample lead to CNV detection .
Currently, some studies that identified CNV using aCGH on cattle [5,6], chicken , swine  and goat  are available. The sequencing effort and its cost represent an important limit to the identification of CNV in livestock populations. One of the parameters that deeply affect the genotyping costs is the coverage of the sequencing. The main aim of this note is to assess the effects of depth of coverage (X) and readlength of the sequencer (RL) on the accuracy of the estimate of the number of copies present in a CNV. All these parameters intend to represent the most common resequencing technologies available.
First of all, we need to know the number of reads (Nr) of the sequences, which is calculated as,
Where Lg is the genome length and RL is the read-length of the sequencer.
The number of times that a read is within a CNV (K) is a function of the number of tandem repeats (copies) within the CNV (n), of the size of each copy of the CNV (S), of the RL and of the Lg and it is calculated as follows:
Assuming a Poisson distribution of K(X, Klambauer), the variance of K is
Finally, the coefficient of variation of the number of counts (CV) is:
(3)… which it is used as a measure of the accuracy of the estimates of K.
Input parameters used to assess the accuracy of K were: the length of the bovine genome (Lg = 2,344 megabase) as reported previously , the CNV size (S = from 1 to 200 kb) according to the results of Fadista , the read-length (RL=30, 90, 150 and 300bp) and the depth of coverage of the sequence (X=10, 20 and 30).
To evaluate the number of times that random fragments with size of read-length (RL= 30, 90, 150 and 300 bp) were inside of one CNV, three different sizes of the CNV were considered: 1.6, 105.5 and 220.1 Kb per copy. The numbers of fragments were extracted randomly in silico from the bovine genome sequence and correspond to the number of reads of the genome. Proportions of fragments inside of a CNV [E] were estimated as follows:
where represents the number of counts of read fragments inside of a CNV and Nr is the number of reads.
The coefficient of variation of K is a function of the read-length and of the coverage depth. The CV(K) decreases with shorter reads and deeper coverage in the sequence as shown in the Figure 1.
Additionally, when the CNV length increases the coefficient of variation decreases, independently from the depth of the coverage and from the read-length here tested. The number of fragments included in a CNV extracted in silico marginally differs from the prediction done by formula  (Table 1).
1.6 kb,105.5Kb,220.1 Kb
1.6 kb,105.5Kb,220.1 Kb
Note: aRL= Read-Length; bNr=Number of fragments; C Number of fragments inside CNV in silico; d Number of fragments inside CNV by formula.
Table 1: Estimated number of copies presented in one copy number variations (CNV) varying the read-length, depth coverage and the size of CNV.
The in silico experiment was repeated and the results did not change because the read-length was a constant. The proportion of reads inside of a large (220.1 kb), a medium (105.5 kb) or a small (1.6 kb) CNV were the same for 10X (0.001%), 20X (0.075%) and 30X (0.155%), which shows that depth coverage did not affect the expected copy number estimation of the genotyping, only the accuracy of this estimate.
In cattle, the average size of CNV is 72.3 kb, with a median of 16.7 kb (Min= 1.7 kb; Max= 2,031 kb) . The detection and genotyping of CNV by sequencing depends on the read-length of the sequencer and the size of the CNV. Accounting for these parameters is necessary to determinate the required depth of coverage in order to minimize the cost of genotyping on whole or target (re)sequencing. If whole genome (re)sequencing is used, a deep coverage is recommended to permit the accurate genotyping of also the smallest CNV. However, when a certain region is sequenced to detect one or several CNV(s), the formula  can be used to optimize the depth coverage and increase the accuracy of this CNV genotyping; Its application is not restrictive for cattle, can also be used in other organisms where the state of knowledge has not advanced sufficiently in order to optimize the economic effort.
An inherent problem of NGS data the considerable read-mapping ambiguity . Several methods to detect CNV are based on read depths which assume Poisson distribution. Recently, several completely sequenced genomes were examined, and the Poisson distribution assumption was violated by some NGS technologies . Despite this, in this study the results show that sequences obtained from shorter read-length require less depth coverage, a deeper coverage is required when small CNV are searched. Based on this conclusion, the advantage for the scientific community is that technologies with shorter readlength tend also to be less costly.
The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 222664. (“Quantomics”).
This Publication reflects only the author’s views and the European Community is not liable for any use that may be made of the information contained herein.