Sequence-based multiscale Genome 3D Organization Prediction

Orca

Welcome! Orca is a deep learning sequence model framework for multiscale genome structure prediction. Orca can predict genome interactions from kilobase to whole-chromosome-scales using only genomic sequence as input. Orca allows predicting genome structural impacts of any genomic variants, including very large structural variants, or designing virtual genetic screens to probe the sequence basis of genome 3D organization. You can find the Github repo here and the publication here.

Update log

09/18/2023: Added support for complex variants in Seqstr input format. Example input "[hg38]chr9:94904000-110904000 +; chr7:5280600-21280600 -". See input section for more details.

08/11/2022: We have now made a UCSC genome browser trackHub for genome-wide virtual screen of 10bp disruption results here. You can for example use this to find the key CTCF motif behind your genome interaction of interest. You can also directly download the bigWig files for H1-ESC and HFF. Please refer to the publication for more details.

05/10/2022: Updated HFF and H1-ESC models to v0.2 (Nature Genetics 2022 publication version, with minor improvement over the prior bioRxiv version). Job results prior to this date are no longer accessible from original URLs. If you need to retreive your previous prediction results and still have the job ID, feel free to contact us.

What can I use Orca for?

Predict the genome structural impacts of any genome variant, including large structural variants of almost any size.
Predict the genome 3D structure from any human genome sequence, which means you can introduce multiple variants, haplotypes, an entire assembled genome, or any sequence you wish to perform an in silico Hi-C experiment on.
Analyze sequence dependencies of genome 3D structure by performing virtual genetic screens. Orca sequence models can serve as an “in silico genome observatory” that allows designing and performing virtual genetic screens to probe the sequence basis of genome 3D organization.

What is Orca?

Orca is a deep learning sequence modeling framework for multiscale genome interaction prediction. Orca models are trained on high-resolution micro-C datasets for H1-ESC and HFF cell lines (and a cohesin-depleted HCT116 Hi-C model for the analysis of sequence dependencies of chromatin compartments). If you have sufficient computational resources including GPUs, you can also train your own models on Hi-C type data given any cooler format input following our examples (see the training section of the code repository).

This webserver provides an user-friendly interface to many of Orca’s prediction capabilities, including predicting multiscale genome 3D organization effects of structural variants. You can also use Orca with the code provided at our Github repository, which provides the full functionalities such as supporting more complex variants or any input sequence. You can also find more information and resources about Orca from the repository.

Input

In the Orca home page, you can select a prediction mode and provide the corresponding input information, then submit the job to our job queue. An example input is provided as a reference for the input format for any prediction mode that you select. Here we list the required input information for all prediction modes that we currently support in the webserver. All coordinates should be in hg38, 0-based, inclusive for the start coordinate and exclusive for the end coordinate.

Genomic Region - Predict multiscale genome interactions centered at the specified genomic region from sequence and compare with experimental observations (micro-C). An example input is chr9:94904000-126904000.
Structural Variant - Deletion - Predict the genome structural effects of the deletion of an genomic interval. The genomic interval must be specified in hg38. You can liftOver the coordinates to hg38 if they are from a different genome assembly. An example input is chr2:220295000-222000000.
Structural Variant - Duplication - Predict the genome structural effects of the duplication of an genomic interval. An example input is chr17:70126859-71579859.
Structural Variant - Inversion - Predict the genome structural effects of the inversion of an genomic interval. An example input is chr2:218875000-220155000.
Structural Variant - Translocation (single junction) - Predict the genome structural effects of a simple translocation event that involves connecting two chromosomal breakpoints. An example input is chr1:85691449 chr5:89533745 +/+. Specifically, two breakpoint positions and the corresponding two orientations are needed. The orientations decide how the breakpoints are connected. The ‘+’ or ‘-’ sign indicate whether the left or right side of the breakpoint is used. For example ‘+/+’ indicates connecting chr1:0-85691449 with chr5:0-89533745, while ‘+/-’ indicates connecting chr1:0-85691449 with chr5:89533745-chromosome end.
Complex variant - Seqstr format - Construct your customized sequences by following the Seqstr format. Genome assembly is disabled in this mode, as it should be specified within the Seqstr input. An example is "[hg38]chr9:94904000-110904000 +; chr7:5280600-21280600 -", which uses the hg38 assembly, and creates a customized sequence by concatenating two regions, chromosome 9:94904000-110904000 on the plus strand, and chromosome 7:5280600-21280600 on the minus strand. Your customized sequence should be 32Mb long. Shorter sequences will result in a failed job, and longer sequences will be truncated from both ends to 32Mb.

Output

As an example output, here we showed visualizations generated for the predictions of a duplication variant. For structural variant prediction, Orca generates one (genomic region prediction) or multiple (structural variant prediction) files that each contains a series of multi-level predictions zooming into a breakpoint of the variant, or the corresponding position(s) of the breakpoint in the reference sequence.

Example reference sequence predictions for duplication variant (breakpoint):

Example alternative sequence predictions for duplication variant (right boundary):

For all prediction modes, predicted interaction matrices at multiple scales (1Mb, 2Mb, 4Mb, … ) are visualized with heatmaps, where each pixel represents the interaction between a pair of genomic positions. The interaction scores are represented by log fold over the distance-based background scores (log being natural logarithm). The distance-based background is the expected contact score based on the genomic distance (available from our code repository). We also visualize the observed micro-C data side-by-side for comparison whenever appropriate.

In addtion to the visualizations in pdf format, the results page also allows downloading the numerical predictions in PyTorch serialization format with extension '.pth'. The .pth file can be loaded with torch.load. Each file contains a python dictionary. If the prediction mode is one of the structural variant prediction modes, the dictionary stores multiple dictionaries each corresponding to an output file as described above. The dictionary includes:

predictions - Multi-level predictions for H1-ESC and HFF cell types.
experiments - Observations for H1-ESC and HFF cell types that matches the predictions (only available for reference allele).
chr - The chromosome name
start_coords - Start coordinates for the prediction at each level.
end_coords - End coordinates for the prediction at each level.
annos - Annotation information. A list indicating the relative variant positions for each interaction matrix, saved for plotting purpose.

Question and feedback?

Thank you for using Orca. If you have any question or feedback, you can let us know at our user email group [email protected].