ENCODE Project at NHGRI Quality Metrics

Contents

The ENCODE consortium analyzes the quality of the data produced using a variety of metrics. This page includes spreadsheets showing quality metrics for many ENCODE datasets, along with descriptions of what these metrics are, and what they appear to measure. These quality metrics will be updated on occasion to include analysis of more recent data.

It is important to note that quality metrics for evaluating epigenomic assays is an area of research, so standards are emerging as more metrics are used with more datasets and types of experiments. The typical values for a quality metric can be quite different with different experiment types, or even comparing different features in the same experiment type, such as different antibodies used in ChIP-seq experiments. Currently there is no single measurement that identifies all high-quality or low-quality samples. As with quality control for other types of experiments, multiple assessments (including manually inspection of tracks) are useful because they may capture different concerns. Comparison within an experimental method can be very helpful in identifying stochastic error in experiments (i.e. comparing replicates to each other, or comparing values for one antibody in several cell types, or the same antibody and cell type in different labs).

Many of the software tools used for quality metrics, along with their citations, can be found on the Software Tools page. Other documents related to data quality are the Data Standards page and the antibody validation documents accessed from the Antibodies page. For questions about the ENCODE Quality Metrics, contact Mike Pazin, NHGRI.

Updated 26 March 2014

Quality metric spreadsheets

Datasets are divided into DNase-seq, FAIRE-seq, TF ChIP-seq, Histone ChIP-seq, and ChIP Controls. The ReadMe worksheet provides a summary description of the metrics (described in more detail below). The Assays, Cells, and Treatments are defined in the ENCODE Controlled Vocabulary of registered terms. The Identifiers reference the dataset at UCSC (via the Table Browser, metaDb table) and the ENCODE analysis site (filenames).

Download spreadsheets in:   Excel format (.xls)   OpenOffice spreadsheet format (.ods)  

Definitions of quality metrics

Uniquely mappable reads (N_uniq map reads):
The count of the number of sequence reads for this sample that can be aligned to a single genomic location; this does not distinguish between reads that were obtained multiple times (redundant reads) and reads obtained only once (non-redundant reads). A larger number of reads from a sufficiently complex library increases the chances of finding all true binding sites; however, the number of reads required is not known with certainty, and likely depends on enrichment, antibody quality in ChIP experiments, and the fraction of the genome containing the feature being measured.

Self-consistent peaks, IDR n (Self Cons IDR):
An estimate of the number of enriched regions in a single sample. A dataset is divided into 2 pseudo-replicates that are analyzed by peak-calling at relaxed stringency followed by IDR filtering at the indicated IDR threshold.

Replicate-consistent peaks, IDR n (Rep Cons IDR):
The number of enriched regions, determined using IDR (Irreproducible Discovery Rate) using this sample and a replicate. Potential enriched regions are identified using a peak caller at very low stringency, then the IDR method is used to determine which peaks are signal and which are noise, at the indicated IDR threshold. As this analysis is performed using pairs of datasets, the output number of peaks is identical for these two datasets using this method.

Signal Portion of Tags (SPOT):
A measure of enrichment, analogous to the commonly used fraction of reads in peaks metric. SPOT calculates the fraction of reads that fall in tag-enriched regions identified using the Hotspot program, (Hotspot and SPOT are described on the ENCODE Software Tools page) from a sample of 5 million reads. Note that because methods of measuring enrichment based on determining the fraction of reads that fall in peaks are sensitive to the determination of enriched regions, comparison is possible only when using the identical peak caller and parameters. Larger SPOT values indicate higher signal to noise; 1.0 is the maximum possible value (all reads are signal) and 0 is the minimum possible value (all reads are noise). For FAIRE, more than 10 million reads are typically required to reliably detect peaks.

PCR Bottleneck Coefficient (PBC):
A measure of library complexity, i.e. how skewed the distribution of read counts per location is towards 1 read per location.

PBC = N1/Nd

(where N1= number of genomic locations to which EXACTLY one unique mapping read maps, and Nd = the number of genomic locations to which AT LEAST one unique mapping read maps, i.e. the number of non-redundant, unique mapping reads).

PBC is further described on the ENCODE Software Tools page. Provisionally, 0-0.5 is severe bottlenecking, 0.5-0.8 is moderate bottlenecking, 0.8-0.9 is mild bottlenecking, while 0.9-1.0 is no bottlenecking. Very low values can indicate a technical problem, such as PCR bias, or a biological finding, such as a very rare genomic feature. Nuclease-based assays (DNase, MNase) detecting features with base-pair resolution (transcription factor footprints, positioned nucleosomes) are expected to recover the same read multiple times, resulting in a lower PBC score for these assays. Note that the most complex library, random DNA, would approach 1.0, thus the very highest values can indicate technical problems with libraries. It is the practice for some labs outside of ENCODE to remove redundant reads; after this has been done, the value for this metric is 1.0, and this metric is not meaningful. 82% of TF ChIP, 89% of His ChIP, 77% of DNase, 98% of FAIRE, and 97% of control ENCODE datasets have no or mild bottlenecking.

Normalized Strand Cross-correlation coefficient (NSC):
A measure of enrichment derived without dependence on prior determination of enriched regions. Forward and reverse strand read coverage signal tracks are computed (number of unique mapping read starts at each base in the genome on the + and - strand counted separately). The forward and reverse tracks are shifted towards and away from each other by incremental distances and for each shift, the Pearson correlation coefficient is computed. In this way, a cross-correlation profile is computed, representing the correlation between forward and reverse strand coverage at different shifts. The highest cross-correlation value is obtained at a strand shift equal to the predominant fragment length in the dataset as a result of clustering/enrichment of relative fixed-size fragments around the binding sites of the target factor or feature.

The NSC is the ratio of the maximal cross-correlation value (which occurs at strand shift equal to fragment length) divided by the background cross-correlation (minimum cross-correlation value over all possible strand shifts). Higher values indicate more enrichment, values less than 1.1 are relatively low NSC scores, and the minimum possible value is 1 (no enrichment). This score is sensitive to technical effects; for example, high-quality antibodies such as H3K4me3 and CTCF score well for all cell types and ENCODE production groups, and variation in enrichment in particular IPs is detected as stochastic variation. This score is also sensitive to biological effects; narrow marks score higher than broad marks (H3K4me3 vs H3K36me3, H3K27me3) for all cell types and ENCODE production groups, and features present in some individual cells, but not others, in a population are expected to have lower scores.

Relative Strand Cross-correlation coefficient (RSC):
A measure of enrichment derived without dependence on prior determination of enriched regions. Forward and reverse strand read coverage signal tracks are computed (number of unique mapping read starts at each base in the genome on the + and - strand counted separately). The forward and reverse tracks are shifted towards and away from each other by incremental distances and for each shift, the Pearson correlation coefficient is computed. In this way, a cross-correlation profile is computed representing the correlation values between forward and reverse strand coverage at different shifts. The highest cross-correlation value is obtained at a strand shift equal to the predominant fragment length in the dataset as a result of clustering/enrichment of relative fixed-size fragments around the binding sites of the target factor. For short-read datasets (< 100 bp reads) and large genomes with a significant number of non-uniquely mappable positions (e.g., human and mouse), a cross-correlation phantom-peak is also observed at a strand-shift equal to the read length. This read-length peak is an effect of the variable and dispersed mappability of positions across the genome. For a significantly enriched dataset, the fragment length cross-correlation peak (representing clustering of fragments around target sites) should be larger than the mappability-based read-length peak.

The RSC is the ratio of the fragment-length cross-correlation value minus the background cross-correlation value, divided by the phantom-peak cross-correlation value minus the background cross-correlation value. The minimum possible value is 0 (no signal), highly enriched experiments have values greater than 1, and values much less than 1 may indicate low quality.

Definitions of ChIP-seq specific quality metrics

MACS FDR 0.01:
This is the number of enriched regions identified by MACS using an FDR threshold of 0.01 (1%).

Under seq:
If set to 1, this means it was manually annotated that this dataset is likely to be undersequenced.

Diff rep:
If set to 1, it means this row is a replicate that is different from the other replicates (based on self-consistency, NSC, or RSC). Therefore, this sample should not be used for replicate-based comparisons such as IDR.

Manual low S/N:
If set to 1 this means it was manually annotated that the data has low signal to noise. This could be the result of under-sequencing, poor enrichment during ChIP, poor antibody quality, or the biological nature of the feature being examined.

Auto low S/N:
If set to 1 this means the data has low signal to noise, scored by NSC < 1.09 and RSC < 0.9. This could be the result of under-sequencing, poor enrichment during ChIP, poor antibody quality, or the biological nature of the feature being examined.

Revoke flag:
These datasets have been revoked by the production lab. R = revoked dataset, D = duplicate good dataset.