Setting up configuration file
As mentioned in the Quick Start module, FitHiChIP is executed by typing the following command in a bash terminal (assuming the executable is in current directory):
sh FitHiChIP_HiCPro.sh -C configuration_file_name
Here we mention the parameters and recommended values to be provided in the configuration file. Each entry of the configuration file has the following format:
Param=ParamValue
where,
"Param" indicates one parameter (variable)
"ParamValue" is the corresponding value (numeric or string format).
Note
We recommend users to mention absolute paths of files / folders in the configuration file.
Input HiChIP contact matrix file - different formats supported
FitHiChIP supports input HiChIP interaction files via one of four possible options:
** Option 1: HiC-pro compatible input **
Using the ValidPairs= option, user can provide the valid pairs generated by HiC-pro pipeline.
The file can be either in simple text format, or can be gzipped.
Note
Usually, name of the valid pairs file generated from HiC-pro pipeline is ${HICPRODIR}/hic_results/data/rawdata/rawdata.allValidPairs where ${HICPRODIR} is the directory containing HiC-pro output.
Additionally, user can specify the following two parameters (optional) to provide HiC-pro compatible bin intervals and contact matrix files.
Using the Interval= option, user can put the file depicting the bins / intervals of the interaction matrix.
This file is generated from HiC-pro, specifically its utility function build_matrix applied on the valid pairs file.
Size of an interval depends on the bin size.
Individual bins are also assigned a distinct number.
By default, name of this file ends with a suffix _abs.bed
Note
Chromosome information need to be alphanumeric (similar to the valid pairs option).
Using the Matrix= option, user can put the number of interactions (contacts) among the bins listed in the Interval file.
The file has three columns: 1) first interacting bin number, 2) second interacting bin number, 3) contact count between those two bins.
Generated by applying HiC-pro pipeline utility function build_matrix on the valid pairs file.
** Option 2: Tab delimited contact matrix format **
Using the Bed= option, user can put a tab delimited text file containing the binned intervals and their contact counts.
Of the format: chr1 start1 end1 chr2 start2 end2 contact
Can be generated from any HiC or HiChIP data processing pipelines, or custom scripts.
Useful when the user have processed contact matrix information from any other pipeline.
** Option 3: .hic formatted contact file **
Using the HIC= option, user can put a .hic file (Juicebox / Juicer tools compatible) containing the contact matrix information.
Note
The .hic file should contain the target resolution which is to be specified in the BINSIZE parameter (mentioned below).
** Option 4: .cool / .mcool formatted contact file **
Using the COOL= option, user can put a .cool or .mcool file containing the contact matrix information.
Supports various HiChIP datasets from 4DNucleome portal.
Note
Like .hic files, the .cool / .mcool file should contain the target resolution which is to be specified in the BINSIZE parameter (mentioned below).
Other files / directories
PeakFile=
Reference ChIP-seq or HiChIP peak file.
Note
Mandatory parameter provided the user is analyzing HiChIP data and if the option IntType (mentioned below) is 1, 2, 3 or 5.
User may use pre-computed ChIP-seq peaks in ENCODE (https://www.encodeproject.org/),
or, user may compute HiChIP peaks and use them (described in the page Various utility scripts / functions).
Note
user may employ macs2, or the tool hichipper (Lareau et al 2018) to compute the peaks from HiChIP data, and provide the bed formatted peak file as an input to FitHiChIP.
If MACS2 is employed, .narrowPeak formatted peak file is to be provided.
OutDir=
Output directory which would contain all the results.
Default: present working directory.
ChrSizeFile
File containing the chromosome size corresponding to the reference genome. Mandatory parameter.
For example, the file chrom_hg19.sizes within the folder "TestData" is applicable for the reference genome hg19.
Note
Chromosome size file for the reference genome hg38 can be obtained from the link: https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.chrom.sizes
Another option is to use the utility "fetchChromSizes" from UCSC (https://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/fetchChromSizes) and specify the reference genome information, to download corresponding chromosome size file.
Execution options
CircularGenome=
Boolean variable indicating if the reference genome is circular.
0, by default.
If 1 (circular genome), calculation of genomic distance is slightly different
IntType=
Type of interaction (foreground) reported by FitHiChIP. Options are:
peak to peak: contacts between all pairs of peak segments (subject to fixed size binning)
peak to non peak: contacts between pairs of segments such that one is a peak and the other is a non-peak segment
peak to all (default): here one interacting segment is a peak, while the other can be a peak or a non-peak. Thus, encapsulates options 1 and 2.
all to all: Interactions between every possible pairs of segments, similar to Hi-C.
Everything from 1 to 4. That is, all of the above mentioned interactions are computed.
Note
Default is 3, i.e. peak-to-all (both peak-to-peak and peak-to-nonpeak interactions are reported by FitHiChIP)
BINSIZE=
Size of the bins, depicting the resolution employed. Default= 5000 (means 5 Kb resolution)
LowDistThr=
Lower distance threshold of interaction between two intervals (CIS).
Default: 20000 (indicates 20 Kb).
Interactions below this distance threshold will not be considered for statistical significance.
UppDistThr=
Upper distance threshold of interaction between two intervals (CIS).
Default: 2000000 (indicates 2 Mb).
Interactions above this distance threshold will not be considered for statistical significance.
QVALUE=
FDR (q-value) cutoff for detecting significant interactions.
Default: 0.01
UseP2PBackgrnd=
Can be 0 or 1. Applicable only for peak to all interactions (IntType = 3).
1: peak to peak (or stringent) background (locus pairs) are used for contact probability estimation. Refers to FitHiChIP(S)
0: peak to all (or loose) background (locus pairs) are used for contact probability estimation. Refers to FitHiChIP(L)
Note
We recommend users to execute FitHiChIP using both loose and stringent settings.
Specifically, for low to moderate sequencing depth, use 0, whereas for very high sequencing depth, employ a value 1.
BiasType=
Can be 1 or 2. Indicates the type of bias correction used.
A value of 1 means that coverage bias regression is used (default).
Value of 2 means that ICE bias regression is computed.
Note
Results in the manuscript employ BiasType=1
MergeInt=
Has the value 0 or 1.
If 1, merge filtering (+M) is enabled. Otherwise (0) not.
We recommend setting this value as 1.
PREFIX=
Prefix string used before any output file name. Default = "FitHiChIP"
OverWrite=
A binary variable (1/0).
If 1, overwrites existing FitHiChIP output files.
Otherwise (0) skips re-computing existing outputs.