




R语言是在统计和数据科学界广泛应用的编程语言和开发环境,其免费、开源、灵活的特点,使其受到越来越多的关注。中国R会议(The China-R Conference) 正起始于对R语言的讨论。2008年,统计之都(Capital of Statistics, COS) 在中国人民大学举办了第一届中国R会议,如今中国R会议规模越来越大,已发展到在全国多个城市举办,成为累计参与人数超过2万、参会单位超过3千家的盛会。会议内容覆盖数据科学在各行各业的应用,包括医疗、生物、金融、工业、自动化、互联网等诸多领域,形成了深远的影响,促进了R语言乃至整个数据科学在中国的推广和发展。















Gene selection with nonlinear instrumental variable regression incorporating network structures



Genetical genomics data provide promising opportunities for integrative analysis of gene expression and genotype data. Lin et al. (2015) recently proposed an instrumental variables (IV) regression framework to select important genes with high dimensional genetical genomics data. The IV regression solves the problem of endogeneity issue caused by potential correlation of gene expressions and the error terms, hence improves the performance of gene selection. As genes function in networks to fulfill their joint task, incorporating network or graph structures in a regression model can further improve gene selection performance. Furthermore, gene expressions can be nonlinearly regulated or modified by environmental variables. In this work, we propose a graph constrained penalized nonlinear IV regression framework to solve the endogeneity issue and to improve the selection performance via considering gene network structures. We propose a two-step estimation procedure by adopting a network constrained regularization method to obtain better variable selection and estimation, and further establish the selection consistency. Simulation and real data analysis are conducted to show the utility of the method. 

This is a joint work with Bin Gao and Xu Liu.




A smooth collaborative recommender system



In recent years, there has been a growing demand to develop efficient recommender systems which track users’ preferences and recommend potential items of interest to users. In this talk, I will present a smooth collaborative recommender system to utilize dependency information among users and items which share similar characteristics under the singular value decomposition framework. The proposed method incorporates the neighborhood structure among user-item pairs by exploiting covariates to improve the prediction performance. One key advantage of the proposed method is that it leads to more effective recommendation for “cold-start” users and items, whose preference information is completely missing from the training set. As this type of data involves large-scale customer records, efficient scheme will be proposed to achieve scalable computing. The advantage is confirmed in a variety of simulated experiments as well as one large-scale real example on music listening counts. If time permits, the asymptotic properties will also be discussed.




大数据: 无关乎数据大小和数据本身,重要的是统计思维








Tweedie-Type Formulae and Regression Calibration



Regression calibration is one of the most commonly used bias-reduction technique in measurement error modelling. However, Tweedie’s formula, originally discovered for normal measurement errors, has never been used for regression calibration, instead, many approximate algorithms are developed for the same purpose. In this talk, we shall introduce a set of Tweedie-type formulae not only for multivariate normal measurement error, but also for multivariate Laplace measurement error, a typical case of the ordinary smooth cases. Potential applications of these Tweedie-type formula in parametric/semiparimatric regression models, neural networks with measurement errors will be also discussed.




A generalized association test based on U statistics



Sequencing-based studies are emerging as a major tool for genetic association studies of complex diseases. These studies pose great challenges to the traditional statistical methods because of the high-dimensionality of data and the low frequency of genetic variants. Moreover, there is a great interest in biology and epidemiology to identify genetic risk factors contributed to multiple disease phenotypes. The multiple phenotypes can often follow different distributions, which bring an additional challenge to the current statistical framework. In this talk, I will introduce a generalized similarity U test, referred to as GSU. GSU is a similarity-based test that can handle high-dimensional genotypes and phenotypes. We studied the properties of GSU, and provided the efficient p-value calculation for association test. Through simulation, we found that GSU had advantages over existing methods in terms of power and robustness to phenotype distributions.







贝叶斯置信度递进神经网络(Bayesian Confidence Propagation Neural Network,BCPNN)和伽玛泊松分布缩减法(Gamma Poisson Shrinker,GPS)分别是世界卫生组织(WHO)和美国食品药品监督管理局(FDA)采用的药品不良反应信号检测算法。R包PhViD给出了这两个算法的部分实现,但没有给出药品不良反应信号检测人员常用的IC、EBGM、EBGM05等指标。通过阅读这两个算法的相关文献,我们利用R和Mathematica,完整实现了这两个算法及其全部指标。进一步,我们为江西省药品不良反应监测中心,计算了2004年至2016年药品不良反应信号。本次演讲将讨论这两个算法的统计学原理及其实现细节。


时间:06-24 09:15 - 16:40

