|17||18||19 1||20 1||21||22||23|
Analyses of molecular phenotypes, such as gene expression, transcription factor binding, chromatin accessibility, and translation, is an important part of understanding the molecular basis of gene regulation and eventually organismal-level phenotypes, such as human disease susceptibility. The development of cheap high-throughput sequencing (HTS) technologies with experiment protocols has increased the use of HTS data as measurements of the molecular phenotypes (e.g., RNA-seq, ChIP-seq, and ATAC-seq). The HTS data provide high-resolution measurements across the whole genome that represent how the molecular phenotypes vary along the genome. We develop multiple statistical methods that better exploit the high-resolution information in the data and apply them to different biological questions in genomics. In this talk, I will briefly introduce two projects: 1) wavelet-based methods for identification of genetic variants associated with chromatin accessibility, and 2) mixture of hidden Markov models for inference of translated coding sequences.
Identification of differences between multiple groups in molecular and cellular phenotypes measured by high-throughput sequencing assays is frequently encountered in genomics applications. For example, common problems include identifying genetic variants associated with gene expression using RNA-seq data and detecting differences in chromatin accessibility across tissues/conditions using DNase-seq or ATAC-seq data. These high-throughput sequencing data provide high-resolution measurements on how traits vary along the whole genome in each sample. However, typical analyses fail to exploit the full potential of these high-resolution measurements, instead aggregating the data at coarser resolutions, such as genes, or windows of fixed length. In this talk, I will present two multi-scale methods that more fully exploit the high-resolution data. In the first part of my talk, I will introduce a wavelet-based approach and demonstrate that the proposed wavelet-based approach has more power than simpler window-based approaches in identification of genetic variants associated with chromatin accessibility. I will also illustrate how the estimated shape of the genotype effect can help in understanding the potential mechanisms underlying the identified associations. The second part will discuss potential limitations of the wavelet based approach in analyses of data sets with small sample sizes or low sequencing depths. To address these issues, I will present another approach that models the count nature of the sequencing data directly using multi-scale models for inhomogeneous Poisson processes, and demonstrate that the proposed models have substantially more power than the wavelet-based approach in analyses of data sets with small sample sizes or low sequencing depths. While we developed these methods with specific applications to sequencing data in mind, these methods have natural applications for analysis of many functional phenotypes.
Using elliptic regularity results, we construct for every starting point, weak solutions to SDEs in R^d with Sobolev diffusion and locally integrable drift coefficient up to their explosion times. Subsequently, we develop non-explosion criteria which allow for linear growth, singularities of the drift coefficient inside an arbitrarily large compact set, and an interplay between the drift and the diffusion coefficient. Moreover, we show strict irreducibility of the solution, which by construction is a strong Markov process with continuous sample paths on the one-point compactification of R^d. Joint work with Haesung Lee
제8회 CMC 정오의 수학산책
일시: 12월 1일(금) 12:00 - 13:15
장소: KAIST 자연과학동 E6-1 3435호
강연자: 김재광 교수 (KAIST)
제목: 빅데이터시대의 통계학
내용: 빅데이터 시대를 맞이하여 빅데이터를 이용하여 사회 과학을 연구하고자 할 때 어떠한 통계학적 이슈들이 있는지 그리고 그러한 문제점들을 해결하고자 할 때 어떤 점들을 주의해야 하는지에 대한 전반적인 내용들을 다루었다. 특히, 빅데이터에서 발생하기 쉬운 선택 편향과 정보 편향에 대한 통계학적 점검과 이것들을 어떻게 해결할 수 있을지에 대한 내용도 다루었다.
참가: https://goo.gl/forms/lJdtJG2HGToWdYMO2 를 통해 사전등록