scifate_glmnet(adata, gene_filter_rate=0.1, cell_filter_UMI=10000, core_n_lasso=1, core_n_filtering=1, motif_ref='https://www.dropbox.com/s/s8em539ojl55kgf/motifAnnotations_hgnc.csv?dl=1', TF_link_ENCODE_ref='https://www.dropbox.com/s/bjuope41pte7mf4/df_gene_TF_link_ENCODE.csv?dl=1', nt_layers=['X_new', 'X_total'])¶
- Reconstruction of regulatory network (Cao, et. al, Nature Biotechnology, 2020) from TFs to other target
genes via LASSO regression between the total expression of known transcription factors and the newly synthesized RNA of potential targets. The inferred regulatory relationships between TF and targets are further filtered based on evidence of promoter motifs (not implemented currently) and the ENCODE chip-seq peaks. The python wrapper for the glmnet FORTRON code, glm-python (https://github.com/bbalasub1/glmnet_python) was used. More details on lasso regression with glm-python can be found here (https://github.com/bbalasub1/glmnet_python/blob/master/test/glmnet_examples.ipynb). Note that this function can be applied to both of the metabolic labeling single-cell assays with newly synthesized and total RNA as well as the regular single cell assays with both the unspliced and spliced transcripts. Furthermore, you can also replace the either the new or unspliced RNA with dynamo estimated cell-wise velocity, transcription, splicing and degradation rates for each gene (similarly, replacing the expression values of transcription factors with RNA binding, ribosome, epigenetics or epitranscriptomic factors, etc.) to infer the tottal regulatory effects, transcription, splicing and post-transcriptional regulation of different factors. In addition, this approach will be fully integrated with Scribe (Qiu, et. al, 2020) which employs restricted directed information to determine causality by estimating the strength of information transferred from a potential regulator to its downstream target. In contrast of lasso regression, Scribe can learn both linear and non-linear causality in deterministic and stochastic systems. It also incorporates rigorous procedures to alleviate sampling bias and builds upon novel estimators and regularization techniques to facilitate inference of large-scale causal networks.
AnnData.) – adata object that includes both newly synthesized and total gene expression of cells. Alternatively, the object should include both unspliced and spliced gene expression of cells.
gene_filter_rate (float (default: 0.1)) – minimum percentage of expressed cells for gene filtering.
cell_filter_UMI (int (default: 10000)) – minimum number of UMIs for cell filtering.
core_n_lasso (int (default: 1)) – number of cores for lasso regression in linkage analysis. By default, it is 1 and parallel is turned off. Parallel computing can significantly speed up the computation process, especially for datasets involve many cells or genes. But for smaller datasets or genes, it could result in a reduction in speed due to the additional overhead. User discretion is advised.
core_n_filtering (int (default: 1)) – number of cores for filtering TF-gene links. Not used currently.
motif_ref (str (default: ‘https://www.dropbox.com/s/bjuope41pte7mf4/df_gene_TF_link_ENCODE.csv?dl=1’)) – The path to the TF binding motif data as described above. It provides the list of TFs gene names and is used to process adata object to generate the TF expression and target new expression matrix for glmnet based TF-target synthesis rate linkage analysis. But currently it is not used for motif based filtering. By default it is a dropbox link that store the data from us. Other motif reference can bed downloaded from RcisTarget: https://resources.aertslab.org/cistarget/. For human motif matrix, it can be downloaded from June’s shared folder: https://shendure-web.gs.washington.edu/content/members/cao1025/public/nobackup/sci_fate/data/hg19-tss-centered-10kb-7species.mc9nr.feather
TF_link_ENCODE_ref (str (default: ‘https://www.dropbox.com/s/s8em539ojl55kgf/motifAnnotations_hgnc.csv?dl=1’)) – The path to the TF chip-seq data. By default it is a dropbox link from us that stores the data. Other data can be downloaded from: https://amp.pharm.mssm.edu/Harmonizome/dataset/ENCODE+Transcription+Factor+Targets.
nt_layers (list([str, str]) (default: [‘X_new’, ‘X_total’])) – The layers that will be used for the network inference. Note that the layers can be changed flexibly. See the description of this function above.
that if your internet connection is slow, we recommend to download the motif_ref and TF_link_ENCODE_ref and (Note) –
those two arguments with the local paths where the downloaded datasets are saved. (supplies) –
An updated adata object with a new key scifate in .uns attribute, which stores the raw lasso regression results
and the filter results after applying the Fisher exact test of the ChIP-seq peaks.