Uniquely mappable reads (N_uniq map reads):
The count of the number of sequence reads for this sample that can be
aligned to a single genomic location; this does not distinguish between reads
that were obtained multiple times (redundant reads) and reads obtained only once
(non-redundant reads). A larger number of reads from a sufficiently complex
library increases the chances of finding all true binding sites; however, the
number of reads required is not known with certainty, and likely depends on
enrichment, antibody quality in ChIP experiments, and the fraction of the genome
containing the feature being measured.
Self-consistent peaks, IDR n (Self Cons IDR):
An estimate of the number of enriched regions in a single sample. A dataset
is divided into 2 pseudo-replicates that are analyzed by peak-calling at relaxed
stringency followed by IDR filtering at the indicated IDR threshold.
Replicate-consistent peaks, IDR n (Rep Cons IDR):
The number of enriched regions, determined using IDR (Irreproducible Discovery Rate)
using this sample and a replicate. Potential enriched regions are identified using a
peak caller at very low stringency, then the IDR method is used to determine which peaks
are signal and which are noise, at the indicated IDR threshold. As this analysis is
performed using pairs of datasets, the output number of peaks is identical for these two
datasets using this method.
Signal Portion of Tags (SPOT):
A measure of enrichment, analogous to the commonly used
fraction of reads in peaks metric. SPOT calculates the fraction of reads that fall in
tag-enriched regions identified using the Hotspot program, (Hotspot and SPOT are described
on the ENCODE Software Tools page) from a sample of 5 million reads. Note that because methods of
measuring enrichment based on determining the fraction of reads that fall in peaks are
sensitive to the determination of enriched regions, comparison is possible only when using
the identical peak caller and parameters. Larger SPOT values indicate higher signal to
noise; 1.0 is the maximum possible value (all reads are signal) and 0 is the minimum possible
value (all reads are noise). For FAIRE, more than 10 million reads are typically required to
reliably detect peaks.
PCR Bottleneck Coefficient (PBC):
A measure of library complexity, i.e. how skewed the distribution of read counts per location
is towards 1 read per location.
PBC = N1/Nd
(where N1= number of genomic locations to which EXACTLY one unique mapping read maps,
and Nd = the number of genomic locations to which AT LEAST one unique mapping read maps,
i.e. the number of non-redundant, unique mapping reads).
PBC is further described on the ENCODE Software Tools page. Provisionally, 0-0.5 is severe bottlenecking, 0.5-0.8
is moderate bottlenecking, 0.8-0.9 is mild bottlenecking, while 0.9-1.0 is no bottlenecking.
Very low values can indicate a technical problem, such as PCR bias, or a biological finding,
such as a very rare genomic feature. Nuclease-based assays (DNase, MNase) detecting features
with base-pair resolution (transcription factor footprints, positioned nucleosomes) are
expected to recover the same read multiple times, resulting in a lower PBC score for these
assays. Note that the most complex library, random DNA, would approach 1.0, thus the very
highest values can indicate technical problems with libraries. It is the practice for some
labs outside of ENCODE to remove redundant reads; after this has been done, the value for this
metric is 1.0, and this metric is not meaningful. 82% of TF ChIP, 89% of His ChIP, 77% of
DNase, 98% of FAIRE, and 97% of control ENCODE datasets have no or mild bottlenecking.
Normalized Strand Cross-correlation coefficient (NSC):
A measure of enrichment derived without dependence on prior determination of enriched regions.
Forward and reverse strand read coverage signal tracks are computed (number of unique mapping
read starts at each base in the genome on the + and - strand counted separately). The forward
and reverse tracks are shifted towards and away from each other by incremental distances and
for each shift, the Pearson correlation coefficient is computed. In this way, a cross-correlation
profile is computed, representing the correlation between forward and reverse strand coverage at
different shifts. The highest cross-correlation value is obtained at a strand shift equal to the
predominant fragment length in the dataset as a result of clustering/enrichment of relative
fixed-size fragments around the binding sites of the target factor or feature.
The NSC is the ratio of the maximal cross-correlation value (which occurs at strand shift equal
to fragment length) divided by the background cross-correlation (minimum cross-correlation value
over all possible strand shifts). Higher values indicate more enrichment, values less than 1.1 are
relatively low NSC scores, and the minimum possible value is 1 (no enrichment). This score is
sensitive to technical effects; for example, high-quality antibodies such as H3K4me3 and CTCF score
well for all cell types and ENCODE production groups, and variation in enrichment in particular IPs
is detected as stochastic variation. This score is also sensitive to biological effects; narrow marks
score higher than broad marks (H3K4me3 vs H3K36me3, H3K27me3) for all cell types and ENCODE production
groups, and features present in some individual cells, but not others, in a population are expected
to have lower scores.
Relative Strand Cross-correlation coefficient (RSC):
A measure of enrichment derived without dependence on prior determination of enriched regions.
Forward and reverse strand read coverage signal tracks are computed (number of unique mapping read
starts at each base in the genome on the + and - strand counted separately). The forward and reverse
tracks are shifted towards and away from each other by incremental distances and for each shift, the
Pearson correlation coefficient is computed. In this way, a cross-correlation profile is computed
representing the correlation values between forward and reverse strand coverage at different shifts.
The highest cross-correlation value is obtained at a strand shift equal to the predominant fragment
length in the dataset as a result of clustering/enrichment of relative fixed-size fragments around
the binding sites of the target factor. For short-read datasets (< 100 bp reads) and large genomes
with a significant number of non-uniquely mappable positions (e.g., human and mouse), a
cross-correlation phantom-peak is also observed at a strand-shift equal to the read length. This
read-length peak is an effect of the variable and dispersed mappability of positions across the
genome. For a significantly enriched dataset, the fragment length cross-correlation peak (representing
clustering of fragments around target sites) should be larger than the mappability-based read-length peak.
The RSC is the ratio of the fragment-length cross-correlation value minus the background
cross-correlation value, divided by the phantom-peak cross-correlation value minus the background
cross-correlation value. The minimum possible value is 0 (no signal), highly enriched experiments have
values greater than 1, and values much less than 1 may indicate low quality.