Differential analysis of HiChIP loops

FitHiChIP also supports finding differential HiChIP loops between two categories (such as two different cell types), each with one or multiple replicates.

The script Differential_Analysis_Script.sh contains command to execute differential analysis.

Source code of the differential analysis is provided in Imp_Scripts/DiffAnalysisHiChIP.r

Basics

The objective is to find differential HiChIP loops among two categories (each with one or more replicates).

In addition, if user provides ChIP-seq alignment files for the input categories (details below), the interacting 1D bins in these differential loops are categorized as differential or non-differential 1D (ChIP) bins.

Specifically, the code categorizes 1D bins in three classes, depending on their difference in ChIP-seq coverage among the two input categories:

  1. HD (i.e. high difference) - when the 1D bin is differential (by EdgeR) according to its ChIP-seq coverage in the two input categories.

  2. ND (i.e. no difference) - when the 1D bin is non-differential (by EdgeR) and has < X% difference in ChIP-seq coverage in the two input categories.

    • Here X is by default 25, and is determined by the parameter --CovThr (mentioned below)

  3. LD (i.e. low difference) - when the 1D bin is non-differential (by EdgeR) but has >= X% difference in ChIP-seq coverage in the two input categories, where X is determined by the parameter --CovThr.

Output differential loops are categorized in the following five classes:

  1. HD-HD - when both interacting bins in the differential loops belong in the class HD

  2. ND-ND - when both interacting bins in the differential loops belong in the class ND

  3. LD-LD - when both interacting bins in the differential loops belong in the class LD

  4. ND-LD - when one interacting bin in the differential loops belong in the class ND, while the other bin belongs within LD

  5. HD-LDorND - when one interacting bin in the differential loops belong in the class HD, while the other bin belongs in either LD or ND.

The category ND-ND (category 2) corresponds to the differential loops involving non-differential 1D bins. That is, differential loops solely due to the changes in 3D (looping).

Options

Parameters associated with this script is:

--AllLoopList
  • Comma or colon separated list of loops with FitHiChIP significance (FDR) values, for all categories and all replicates.

  • User should provide the file *PREFIX*.interactions_FitHiC.bed (please check the page Details of FitHiChIP outputs). Mandatory parameter.

  • Files can be in gzipped format as well.

--ChrSizeFile
  • File having the chromosome sizes for the reference genome.

  • Mandatory parameter.

--FDRThr
  • FDR significance threshold for FitHiChIP loops. Default = 0.01.

--BackgroundFDRThr
  • FDR threshold for FitHiChIP loops which will be considered for defining the background model. Default is 1, means every FitHiChIP contact in the input samples would be considered as background loops of the edgeR model. A value of 0.01 would indicate that the FitHiChIP loops with q-value < 0.01 in at least one sample would be used as the background. If the input samples have very low contact counts for most of the input samples and loops, user may alter this parameter (like 0.1) to include only the significant (or moderately significant) FitHiChIP loops (in at least one input sample) to define the background model.

--ChIPAlignFileList
  • ** Optional parameter **

  • Comma or colon separated list of ChIP-seq alignment files.

  • Either two files, one for each category, are to be given.

  • Or, provide ChIP seq alignment files one for each sample.

Note

  • ChIP-seq alignment files matching with the given HiChIP data (cell type, histone modifications of interest) can be downloaded from ENCODE (https://www.encodeproject.org/).

  • File can either be in BAM or Bedgraph (4-column) format.

--CovThr
  • Applicable if ChIPAlignFileList parameter is not NULL.

  • Maximum difference of ChIP-seq coverage (in percentage) between two categories to label a non-differential (by EdgeR) 1D bin as ND (i.e. no difference).

  • Default = 25, means non-differential 1D bins with difference of ChIP coverage < 25% are considered as ND.

  • In such a case, non-differential 1D bins with difference of ChIP coverage >= 25% are considered as LD (i.e. low difference).

--OutDir
  • Output directory to contain differential analysis results.

  • Mandatory parameter.

--CategoryList
  • Comma or colon separated list of strings, representing labels of two categories.

  • Default: Category1, Category2.

--ReplicaCount
  • Comma or colon separated list representing the count of samples (replicates) belonging to individual categories.

  • Default: 1,1 (means one replicate per sample exists).

--ReplicaLabels1
  • Comma or colon separated list of the label of replicates for the first category.

  • Default: R1,R2, etc.

--ReplicaLabels2
  • Comma or colon separated list of the label of replicates for the second category.

  • Default: R1,R2, etc.

--FoldChangeThr
  • EdgeR fold change threshold.

  • Default = 2 means log2(2) = 1 is employed as the fold change threshold on the EdgeR output.

--DiffFDRThr
  • FDR threshold for EdgeR.

  • Default is 0.05, means that loops with FDR < 0.05, and fold change >= log2(FoldChangeThr) would be considered as differential.

--bcv
  • If number of replicates is 1 for either of the input categories, this value is the square-root-dispersion.

  • Default and recommended value is 0.4. For details, please see the EdgeR manual.

Example 1

Assume the following input conditions:

  1. Two input categories of HiChIP data, denoted as cat1 and cat2.

  2. Each category has two replicates, denoted as repl1 and repl2.

  3. hg19 as the reference genome

  4. One ChIP-seq alignment file for each category.

Then, the command for differential analysis would be:

Rscript Imp_Scripts/DiffAnalysisHiChIP.r \

—AllLoopList cat1_repl1_file,cat1_repl2_file,cat2_repl1_File,cat2_repl2_file \

--ChrSizeFile TestData/chrom_hg19.sizes \

—ChIPAlignFileList cat1_ChIPAlign.bam,cat2_ChIPAlign.bam \

--FDRThr 0.01 \

—BackgroundFDRThr 0.1 \

--CovThr 25 \

—OutDir /home/user/diffanalysis/outdir \

--CategoryList CellLine1,CellLine2 \

—ReplicaCount 2,2 \

--ReplicaLabels1 R1,R2 \

—ReplicaLabels2 R1,R2 \

--FoldChangeThr 2 \

—DiffFDRThr 0.05 \

--bcv 0.4

  • Here, cat1_repl1_file denotes the loops (containing FitHiChIP significance values) of category 1, replicate 1. Similar notations for the other files.

  • In the parameter --AllLoopList, all the samples of category 1 should be specified first, followed by the samples in category 2

  • User can also specify the files using colons (:) as a separator instead of comma.

  • The file cat1_ChIPAlign.bam represents ChIP-seq alignment file for the category 1. Similar notation for category 2.

Example 2

Assume the following input conditions:

  1. Two input categories denoted as cat1 and cat2.

  2. Each category has two replicates, denoted as repl1 and repl2.

  3. hg19 is the reference genome

  4. One ChIP-seq alignment file for individual samples.

Then, the command for differential analysis would be:

Rscript Imp_Scripts/DiffAnalysisHiChIP.r \

—AllLoopList cat1_repl1_file,cat1_repl2_file,cat2_repl1_File,cat2_repl2_file \

--ChrSizeFile TestData/chrom_hg19.sizes \

—FDRThr 0.01 \

--BackgroundFDRThr 0.1 \

—CovThr 25 \

--ChIPAlignFileList cat1_R1_ChIPAlign.bam,cat1_R2_ChIPAlign.bam,cat2_R1_ChIPAlign.bam,cat2_R2_ChIPAlign.bam \

—OutDir /home/user/diffanalysis/outdir \

--CategoryList CellLine1,CellLine2 \

—ReplicaCount 2,2 --ReplicaLabels1 R1,R2 --ReplicaLabels2 R1,R2 \

--FoldChangeThr 2 --DiffFDRThr 0.05 --bcv 0.4

  • Here, cat1_R1_ChIPAlign.bam denotes the ChIP-seq alignment for replicate 1 of the category 1. Similar notations for the other files.

Details of differential analysis outputs

Within the specified output directory OutDir, following files / folders are generated by the differential analysis:

  1. Input_Parameters*.log : lists the input command line options.

  2. MasterSheet_*Loops.bed :

    Union of all FitHiChIP loops of the given categories and samples, with the respective contact counts and interaction significance values.

  3. EdgeR_Loops_Ov_FitHiChIP_Sig_One_Repl :

    • Folder containing the EdgeR differential loops which are significant (in terms of FitHiChIP FDR threshold) in at least one input sample.

    • The file DiffLoops.bed contains differential loops.

    ** If ChIP-seq alignment files are provided **

    • Underlying folders have name X-Y where X and Y denote the type of 1D bin (LD, ND or HD) according to their ChIP-coverage specific differential analysis.

    • Within each folder, file DiffLoops.bed contains differential loops of the particular category.

    • Files DiffLoops_Excl_*.bed contain differential loops exclusive to one category (i.e. significant in only that category).