SemiParSampleSel(SemiParSampleSel)
SemiParSampleSel()所属R语言包:SemiParSampleSel
Semiparametric Sample Selection Modelling with Continuous Response
连续反应的半参数样本选择模型
译者:生物统计家园网 机器人LoveR
描述----------Description----------
SemiParSampleSel can be used to fit continuous response sample selection models where the linear predictors are flexibly specified using parametric and regression spline components. During the model fitting process, the possible presence of correlated error equations is accounted for. Regression spline bases are extracted from the package mgcv. Multi-dimensional smooths are available via the use of penalized thin plate regression splines (isotropic). The current implementation does not support scale invariant tensor product smooths.
SemiParSampleSel可用于,适合连续响应样本选择模型的线性预测,灵活地指定使用参数和回归样条曲线组成部分。在模型拟合过程中,可能存在的相关性误差方程。回归样条基中提取的包mgcv。多维平滑可通过处罚薄板样条回归(各向同性)使用。目前的实现并不支持规模不变的张量积平滑。
用法----------Usage----------
SemiParSampleSel(formula.eq1, formula.eq2, data=list(),
iterlimSP=50, pr.tol=1e-6,
gamma=1, aut.sp=TRUE, fp=FALSE, start.v=NULL,
rinit=1, rmax=100, fterm=sqrt(.Machine$double.eps),
mterm=sqrt(.Machine$double.eps),
control=list(maxit=50,tol=1e-6,step.half=25,
rank.tol=sqrt(.Machine$double.eps)))
参数----------Arguments----------
参数:formula.eq1
A GAM formula for equation 1. s terms are used to specify smooth smooth functions of predictors. SemiParSampleSel supports the use shrinkage smoothers for variable selection purposes. See the examples below and the documentation of mgcv for further details on GAM formula specifications. Note that the formula MUST refer to the selection equation.
一个GAM公式为式(1)。 s条款用于指定平滑顺畅的预测功能。 SemiParSampleSel支持使用收缩平滑的变量选择的目的。请参阅下面的例子和文档mgcv GAM公式规格的进一步详情。请注意,该公式必须参考的选择公式。
参数:formula.eq2
A GAM formula for equation 2.
一个GAM公式,式(2)。
参数:data
An optional data frame, list or environment containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which SemiParSampleSel is called.
一个可选的数据框,列表或包含在模型中的变量的环境。如果没有找到data,变量environment(formula),通常是SemiParSampleSel被称为环境。
参数:iterlimSP
A positive integer specifying the maximum number of loops to be performed before the smoothing parameter estimation step is terminated.
一个正整数,指定之前必须执行的平滑参数估计步骤中的最大数量的循环终止。
参数:pr.tol
Tolerance to use in judging convergence of the algorithm when automatic smoothing parameter selection is used.
公差使用,在使用时自动平滑参数的选择,判断算法的收敛性。
参数:gamma
It is is an inflation factor for the model degrees of freedom in the UBRE score. Smoother models can be obtained setting this parameter to a value greater than 1. Typically gamma=1.4 achieves this.
它是通货膨胀因素的模型程度的自由的UBRE得分。将该参数设置为一个大于1的值,可以得到更流畅的模型。通常gamma=1.4达致这个目标。
参数:aut.sp
If TRUE, then automatic multiple smoothing parameter selection is carried out. If FALSE, then smoothing parameters are set to the values obtained from the univariate fits.
如果TRUE,然后自动多平滑参数的选择进行。如果FALSE,然后平滑化参数被设置为从单变量拟合得到的值。
参数:fp
If TRUE, then a fully parametric model with regression splines if fitted. This only makes sense if used jointly with aut.sp=FALSE. See the example below.
如果TRUE,如果安装了全参数化模型回归样条曲线。这仅是有道理的,如果共同使用aut.sp=FALSE。请看下面的例子。
参数:start.v
Although strictly not recommended, starting values for all model parameters can be provided here. Otherwise these are obtained using the Heckman sample selection correction approach.
虽然绝对不推荐,所有模型参数的初始值,这里可以提供。否则,这些都使用了Heckman样本选择校正方法。
参数:rinit
Starting trust region radius. The trust region radius is adjusted as the algorithm proceeds. Bad initial values waste a few steps while the radius is adjusted, but do not prevent the algorithm from working properly. See the documentation of trust for further details.
开始信赖域半径。信赖域半径的算法进行调整。错误的初始值,浪费了几步,而半径进行调整,但不正常工作,防止算法。有关进一步详情,请参阅文档的trust。
参数:rmax
Maximum allowed trust region radius. This may be set very large. If set small, the algorithm traces a steepest descent path.
允许的最大信赖域半径。这可以被设置非常大的。如果设置小,算法的痕迹,一个最速下降路径。
参数:fterm
Positive scalar giving the tolerance at which the difference in objective function values in a step is considered close enough to zero to terminate the algorithm.
正标量给予的差异被认为是在一个步骤中的目标函数的值在足够接近为零,以终止该算法的公差。
参数:mterm
Positive scalar giving the tolerance at which the two-term Taylor-series approximation to the difference in objective function values in a step is considered close enough to zero to terminate the algorithm.
正标量的公差在两个术语的泰勒级数近似给出在一个步骤中的目标函数的值的差在被认为是足够接近为零,以终止该算法。
参数:control
It is a list containing iteration control constants with the following elements: maxit: maximum number of iterations of the magic algorithm; tol: tolerance to use in judging convergence; step.half: if a trial step fails then the method tries halving it up to a maximum of step.half times; rank.tol: constant used to test for numerical rank deficiency of the problem. See the documentation of magic in mgcv for further details.
这是一个列表,其中包含迭代控制参数包含下列元素:maxit:magic算法的迭代的最大数量,“tol:宽容使用在判断收敛,”step.half如果一个试验步骤失败,则该方法试图减半它最大的step.half倍rank.tol:恒用于测试数值秩不足的问题。请参阅文档magicmgcv进一步的细节。
Details
详细信息----------Details----------
The association between the responses is modelled by the correlation parameter ρ of a bivariate normal distribution. In a semiparametric bivariate sample selection model the linear predictors are flexibly specified using parametric components and smooth functions of covariates. Replacing the smooth components with their regression spline expressions yields a fully parametric bivariate sample selection model. In principle, classic maximum likelihood estimation can be employed. However, to avoid overfitting, penalized likelihood maximization has to be employed instead. Here the use of penalty matrices allows for the suppression of that part of smooth term complexity which has no support from the data. The tradeoff between smoothness and fitness is controlled by smoothing parameters associated with the penalty matrices. Smoothing parameters are chosen to minimize the approximate Un-Biased Risk Estimator (UBRE) score.
反应之间的关联建模的相关参数ρ的二元正态分布的。在一个半参数二元样本选择模型的线性预测,灵活地指定使用的协变量的参数化元件和平滑的功能。更换的顺利回归样条曲线的表达得到了全参数化的二元样本选择模型。原则上,经典的最大似然估计可以采用。然而,为了避免过拟合,处罚的可能性最大化,而不是被雇用。这里使用的刑罚矩阵可以抑制的那部分光滑术语复杂的数据不支持从。之间的权衡平滑和健身的处罚矩阵相关的平滑参数控制。平滑参数的选择,以尽量减少近似无偏风险估计(UBRE)得分。
The optimization problem is solved by Newton-Raphson's method. Automatic smoothing parameter selection is integrated using a performance-oriented iteration approach (Gu, 1992; Wood, 2004) combined with a "leapfrog" algorithm (Smith, 1996). Roughly speaking, at each iteration, (i) the penalized weighted least squares problem is solved, then (ii) the smoothing parameters of that problem estimated by approximate UBRE. Steps (i) and (ii) are iterated until convergence. Details of the underlying fitting methods are given in Marra and Radice (submitted).
用牛顿 - 拉夫逊法求解该优化问题。自动平滑参数的选择结合使用以业绩为导向的迭代方法(顾,1992年,木,2004年)相结合的“跨越式”算法(史密斯,1996)。粗略地说,在每次迭代中,(一)惩罚加权最小二乘问题得到解决,则(ii)平滑参数这个问题的的近似UBRE估计的。步骤(i)及(ii)进行迭代直到收敛为止。详细的基本拟合方法在马拉和雷迪斯(提交)。
值----------Value----------
The function returns an object of class SemiParSampleSel as described in SemiParSampleSelObject.
该函数返回一个类的对象SemiParSampleSel所描述的SemiParSampleSelObject。
警告----------WARNINGS----------
Any automatic smoothing parameter selection procedure is not likely to work well when the data have low information content. In the current context, convergence failure may especially occur when ρ is high and the total number and selected number of observations is low. If this happens, then one might either (i) lower the total number of parameters to estimate by reducing the dimension of the regression spline bases, (ii) set the smoothing parameters to the values obtained from the univariate fits (aut.sp=FALSE), or (iii) set the smoothing parameters to the values obtained from the non-converged algorithm. The default option is (iii).
任何自动平滑参数选择程序是不可能很好地工作的数据时,具有较低的信息内容。在目前情况下,收敛失败,特别是发生当ρ是高是低的总数和选定的若干意见。如果发生这种情况,则可能要么(ⅰ)降低估计的参数的总数减少的回归样条基的尺寸,(ⅱ)设置的平滑化的参数的值从单变量配合(aut.sp=FALSE获得 ),或(iii)设置从非收敛算法得到的值的平滑化参数。默认的选项是(III)。
Fully parametric modelling is allowed for. However, it is not possible to specify one linear predictor as a function of parametric and smooth components, and the other as a function of parametric terms only. If continuous covariates are available, then we should let the data determine which effects are linear or non-linear and for which equations.
全参数化建模是允许的。然而,这是不可能指定一个线性预测参数和平滑分量作为一个功能,和其他的作为的函数的参数术语只。连续协变量的情况下,那么我们就应该让数据确定它的效果是线性或非线性的方程。
(作者)----------Author(s)----------
Maintainer: Giampiero Marra <a href="mailto:giampiero@stats.ucl.ac.uk">giampiero@stats.ucl.ac.uk</a>
参考文献----------References----------
参见----------See Also----------
InfCr, plot.SemiParSampleSel, SemiParSampleSel-package, SemiParSampleSelObject, summary.SemiParSampleSel
InfCr,plot.SemiParSampleSel,SemiParSampleSel-package,SemiParSampleSelObject,summary.SemiParSampleSel
实例----------Examples----------
library(SemiParSampleSel)
############[###########]
## Generate data[#生成数据]
## Correlation between the two equations and covariate correlation 0.5 [#两个方程的相关性和协变量的相关性0.5]
## Sample size 2000 []
set.seed(0)
n <- 2000
rhC <- rhU <- 0.5
SigmaU <- matrix(c(1,rhU,rhU,1),2,2)
U <- rmvnorm(n,rep(0,2),SigmaU)
SigmaC <- matrix( c(1,rhC,rhC,
rhC,1,rhC,
rhC,rhC,1), 3 , 3)
cov <- rmvnorm(n,rep(0,3),SigmaC, method="svd")
cov <- pnorm(cov)
bi <- round(cov[,1]); x1 <- cov[,2]; x2 <- cov[,3]
f11 <- function(x) -0.7*(4*x + 2.5*x^2 + 0.7*sin(5*x) + cos(7.5*x))
f12 <- function(x) -0.4*( -0.3 - 1.6*x + sin(5*x))
f21 <- function(x) 0.6*(exp(x) + sin(2.9*x))
ys <- 0.58 + 2.5*bi + f11(x1) + f12(x2) + U[, 1] > 0
y <- -0.68 - 1.5*bi + f21(x1) + + U[, 2]
yo <- y*(ys > 0)
dataSim <- data.frame(ys,yo,bi,x1,x2)
## CLASSIC SAMPLE SELECTION MODEL[#经典样本选择模型]
## the first equation must be the selection equation[#第一个公式必须选择方程]
out <- SemiParSampleSel(ys ~ bi + x1 + x2,
yo ~ bi + x1,
data=dataSim)
summary(out)
InfCr(out)
InfCr(out,cr="BIC")
## SEMIPARAMETRIC SAMPLE SELECTION MODEL[#半参数样本选择模型]
## the first equation MUST be the selection equation[#第的方程必需的选择方程]
## "cr" cubic regression spline basis - "cs" shrinkage version of "cr"[#“CR”三次回归样条曲线的基础 - “CS”缩水版“CR”]
## "tp" thin plate regression spline basis - "ts" shrinkage version of "tp"[#“TP”薄板回归样条曲线的基础 - “TS”缩水版“TP”]
## for smooths of one variable, "cr/cs" and "tp/ts" achieve similar results [#平滑的一个变量,“CR / CS”和“TP / TS”达到类似的效果]
## k is the basis dimension - default is 10[#k为基础的维度 - 默认值是10]
## m is the order of the penalty for the specific term - default is 2[#m是为了处罚的具体期限 - 默认为2]
out <- SemiParSampleSel(ys ~ bi + s(x1,bs="cr",k=10,m=2) + s(x2,bs="cr",k=10),
yo ~ bi + s(x1,bs="cr",k=10),
data=dataSim)
InfCr(out)
## compare the two summary outputs[#比较两个摘要输出,]
## the second output produces a summary of the results obtained when only [#第二个输出产生一个总结时得到的结果只]
## the outcome equation is fitted, i.e. selection bias is not accounted for[#结果方程拟合,即选择偏倚不占]
summary(out)
summary(out$gam2)
## estimated smooth function plots[#估计光滑的函数曲线]
## the red line is the true curve[#红线是真正的曲线]
## the blue line is the naive curve not accounting for selection bias[#蓝线是天真的曲线,不占选择偏倚]
x1 <- sort(x1)
f21.x1 <- f21(x1)[order(x1)]-mean(f21(x1))
plot(out, eq=2, select=1, ylim=c(-1,0.8)); lines(x1, f21.x1, col="red")
par(new=TRUE)
plot(out$gam2, select=1, se=FALSE, col="blue",ylim=c(-1,0.8),ylab="",rug=FALSE)
#[]
#[]
转载请注明:出自 生物统计家园网(http://www.biostatistic.net)。
注:
注1:为了方便大家学习,本文档为生物统计家园网机器人LoveR翻译而成,仅供个人R语言学习参考使用,生物统计家园保留版权。
注2:由于是机器人自动翻译,难免有不准确之处,使用时仔细对照中、英文内容进行反复理解,可以帮助R语言的学习。
注3:如遇到不准确之处,请在本贴的后面进行回帖,我们会逐渐进行修订。
|