Various utility scripts / functions
1. generating contact matrices of different resolution (bin size)
The script ContactMatDiffRes.sh within the folder Imp_Scripts generates contact matrix of desired bin size given a valid pairs file generated from the HiC-pro pipeline. Parameters associated with this script is:
- -V ValidPairsFile
Name of the valid pairs file generated from the HiC-pro pipeline
- -B BinSize
Size of the bin (in bytes: target resolution): default 5000 (5 Kb)
- -C ChrSizeFile
File containing the size of the chromosomes corresponding to the reference genome (such as hg19.chrom.sizes)
- -D OutDir
Output directory which will contain the contact matrix of the target resolution
- -H HiCProDir
Directory containing the HiC-pro installed package.
A folder Matrix_BinSize${BinSize} will be created under the specified output directory. The bin specific interval file name is Matrix_abs.bed and the corresponding contact matrix file name is Matrix.matrix
2. inferring peaks from HiChIP data (for use in the HiChIP pipeline)
The script PeakInferHiChIP.sh within the folder Imp_Scripts is used to infer peaks from HiChIP data. The script can be used if HiC-pro pipeline is already executed on a given pair of reads (such as .fastq.gz read pairs). The script uses macs2 package for inferring the peaks. Parameters associated with this script are as follows:
- -H HiCProDir
Directory containing the reads generated by HiC-pro pipeline. Within this directory, files of the formats .ValidPairs, .DEPairs, .REPairs, and .SCPairs are present, which corresponds to different categories of reads generated by the HiC-pro pipeline.
- -D OutDir
Directory to contain the output set of peaks. Default: current directory
- -R refGenomeStr
Reference genome specific string used for MACS2. Default is 'hs' for human chromosome. For mouse, specify 'mm'.
- -M MACS2ParamStr
String depicting the parameters for MACS2. Default: "--nomodel --extsize 147 -q 0.01"
- -L ReadLength
Length of reads for the HiC-pro generated reads. Default 75
The script uses all of the DE, SC, RE and validpairs reads generated from the HiC-pro pipeline to infer peaks. The folder MACS2_ExtSize within the specified output directory contains the MACS2 generated peaks.
3. merging a set of ChIP-seq alignment files
The script MergeBAMInferPeak.sh within the folder Imp_Scripts processes a set of ChIP-seq alignment files, to generate a merged alignment, infer peak (using MACS2) from the generated alignment, and also determines the coverage of individual bins (according to a specified fixed bin size) with respect to individual input ChIP-seq alignment files. Output of this script is used as an input to the differential analysis module (mentioned next). Parameters associated with this script are as follows:
- -I InpFile
A list of ChIP-seq BAM files. Multiple BAM files are to be mentioned in the format: -I bamfile1.bam -I bamfile2.bam -I bamfile3.bam and so on
- -D OutBaseDir
Directory containing the output set of peaks. Default: current directory
- -R refGenome
Reference genome string used for MACS2. Default is 'hs' for human chromosome. For mouse, specify 'mm'
- -M MACS2ParamStr
String depicting the parameters for MACS2. Default: "--nomodel --extsize 147 -q 0.01"
- -C ChrSizeFile
Filename containing the chromosome size information for the reference genome
- -b BinSize
BinSize in base pair. Used to compute the ChIP-seq coverage for individual input BAM files. Default = 5000 (means 5 Kb)
Output files generated from this script:
${OutBaseDir}/merged_input.bam: merged ChIP-seq alignment file
${OutBaseDir}/MACS2_Out/out.macs2_peaks.narrowPeak: Peak (derived by MACS2) file generated by merged ChIP-seq alignment.
${OutBaseDir}/ChIPCoverage1.bed, ${OutBaseDir}/ChIPCoverage2.bed, ..., ${OutBaseDir}/ChIPCoverageN.bed,
where N is the number of input ChIP-seq alignment files. Each of these output files contain the coverage of corresponding input alignment file (with respect to the specified bin size).
4. Producing FitHiChIP loops with different FDR thresholds
It is useful to obtain significant loops on different FDR threholds, without running the complete FitHiChIP module with different FDR threshold parameters.
For a custom FDR threshold of 0.05, first user should apply the following awk script on the file PREFIX.interactions_FitHiC.bed (mentioned in section 4.1.4, file no C):
awk '((NR==1) || ($NF<0.05))' ${PREFIX}.interactions_FitHiC.bed > ${PREFIX}.interactions_FitHiC_Q0.05.bed
5. Applying merge filtering on any set of significant loops (generated by FitHiChIP or any other loop calling method)
It is useful to apply the merge filtering on any set of significant loops, not just of FitHiChIP.
Assuming the source code of FitHiChIP is placed within the folder ${CodeDir}, user can merge the adjacent loops using the following command:
python ${CodeDir}/src/CombineNearbyInteraction.py \
—InpFile ${PREFIX}.interactions_FitHiC_Q0.05.bed \
--OutFile ${PREFIX}.interactions_FitHiC_Q0.05_MergeNearContacts.bed \
—headerInp 1 --binsize 5000 --percent 100 --Neigh 2
where, the options are:
- --InpFile InpFile
Input file containing FitHiChIP significant interactions.
- --OutFile OutFile
Output file to contain the significant and merge filtered loops.
- --headerInp Header
Boolean value (0/1): 1 means InpFile has header line. 0 means no header line is in the input loop file.
- --binsize BinSize
Bin size (in bp). For example, 5000 means 5 Kb bin size.
- --percent PctVal
% of significant loops considered for merge filtering (default PctVal = 100). Users are advised to keep this value.
- --Neigh Val
Window size (number of bins) for merge filtering. Default is Val = 2, means 2X2 bins are employed as window. Users are advised to keep this value.
The file ${PREFIX}.interactions_FitHiC_Q0.05_MergeNearContacts.bed is the set of loops generated after merging adjacent ones.
6. Visualization of the significant loops in WashU epigenome browser
Check the page Details of FitHiChIP outputs for a detailed description of FitHiChIP output signficant loops and their visualization in WashU epigenome browser.
7. Simulating HiChIP contact matrix from HiC and ChIP-seq data
In FitHiChIP manuscript, we have checked its robustness by simulating HiChIP data from reference HiC and ChIP-seq data. Here we provide a brief description of such simulation procedure.
The file Imp_Scripts/HiC_Simulate_by_ChIP_Coverage.r contains the simulation procedure.
Note
The simulation procedure and corresponding implementation is still very basic.
The generated HiChIP contact map is shown to exhibit high bias for low range contacts. So, user may use this implementation as a basic template, and may incorporate their own modifications.
The script is to be executed as
Rscript Imp_Scripts/HiC_Simulate_by_ChIP_Coverage.r [OPTIONS]
Where, [OPTIONS] are the following:
- --chr
Comma or colon separated list of chromosome names for which the simulation would be carried out. Its value can be a single chromosome like chr1, or a list of chromosomes like chr1,chr2
If nothing is provided, all chromosomes within the specified chromosome size file of the reference genome would be analyzed.
- --OutDir
Output directory to contain the simulated results. Mandatory parameter.
- --ChrSizeFile
File having the chromosome sizes for the reference genome.
Mandatory parameter.
- --ChIPAlignFile
ChIP-seq alignment file. Mandatory parameter.
Can either be in BAM or Bedgraph (4-column) format**.
- --HiCMapFile
Reference HiC contact map file, having the following fields in tab delimited format: chr1, start1, end1, chr2, start2, end2, cc.
Here the triplet (chr,start,end) corresponds to one interacting interval. The field cc denotes the contact count between corresponding interacting regions. It can be either raw contact count, or normalized (by ICE).
Note
Output HiChIP would have the same resolution (bin size) as the input HiC map.
- --TotalRead
Total coverage of the output simulated HiChIP contact matrix. That is, total contact count in the simulated HiChIP contact map would be equal to this integer value.
Default is 1000000, means the output simulated map would have 1M reads.
Within the specified output folder OutDir, folders of the following name are created:
Simulate_HiChIP_chr_*chrName*_B_*BinSize*
where, chrName corresponds to one chromosome within the list of chromosomes specified, and BinSize corresponds to the bin size (resolution) of the reference HiC loops.
Within this folder, the file Dumped_Balanced_CC_RegrFactor.bed contains the simulated HiChIP contact map for the corresponding chromosome.
Its first 6 fields are the interacting regions
7th field is the raw contact count obtained by the ChIP-seq coverage values
8th (last) field is the scaled contact count according to the parameter --TotalRead. This contact count is to be used for application, or for visualization.
8. Applying FitHiChIP on capture HiC data
FitHiChIP can be applied on the Promoter Capture HiC assay as well.
1) First, user needs to run HiC-pro using the command: ./HiC-Pro -i data -o output -c config -s mapping -s proc_hic -s quality_checks -s merge_persample
(as suggested in https://github.com/nservant/HiC-Pro/issues/102)
This will generate allValidPairs file with the list of valid interactions.
Then, use the design bait file of the promoter capture HiC data, and extract the corresponding intervals (chromosome, start, end) in a separate file X. This file X will be used as the peak file in FitHiChIP execution (via the configuration parameter "PeakFile=")
We have used peak to all interactions (IntType=3) using coverage bias regression when executing FitHiChIP on promoter capture HiC data.