R语言 sdtoolkit包 sdprim()函数中文帮助文档(中英文对照)

loveR · 发表于 2012-9-29 23:39:19

sdprim(sdtoolkit)
sdprim()所属R语言包：sdtoolkit

                                    Patient Rule Induction Method Adapted For Scenario Discovery
                                       适合场景发现的患者规则归纳方法

                                       译者：生物统计家园网机器人LoveR

描述----------Description----------

This is the primary function for the sdtoolkit package.  It organizes many subsidiary functions into an interactive session to aid the user in identifying policy-relevant scenarios.  It is based on Friedman and Fisher's PRIM, but includes additional modifications and diagnostics to better suit the scenario discovery task.
sdtoolkit包，这是一个主要功能。很多附属功能组织成一个交互式会话，以帮助用户在明确与政策相关的情况。它是基于弗里德曼和费舍尔的PRIM，但包括附加的修改和诊断，以更好地适应场景的发现任务。

用法----------Usage----------

sdprim(x, y = NULL,
thresh = NULL,
peel.alpha = 0.1,
paste.alpha = 0.05,
mass.min = 0.001,
pasting = TRUE,
box.init = NULL,
coverage = TRUE,
outfile = "boxsum.txt",
csvfile = "primboxes.csv",
repro = TRUE,
nbump = 10,
dfrac = 0.5,
threshtype = ">")

参数----------Arguments----------

参数：x
Matrix of input variables - all values must be numeric at present.  (No non-binary categorical values allowed.)
目前，矩阵的输入变量 - 所有的值必须是数字。（没有非二进制的分类值）。

参数：y
Output vector, which may or may not be thresholded to be zeros and ones.  If not thresholded, you must specify both thresh and threshtype below.
输出向量，这可能会或可能不会被阈值是零和一。如果没有阈值，则必须同时指定thresh和threshtype下面。

参数：thresh
Numeric - if y is not a zero-one vector, you must specify a real value on which to threshold the data.
数字 - 如果y是不是一个零一个向量，你必须指定一个真正的价值阈值的数据。

参数：peel.alpha
Peeling parameter for PRIM.  Typically around .05 to .1.
PRIM剥离参数。通常情况下大约0.05到0.1。

参数：paste.alpha
Pasting parameter for PRIM.
PRIM粘贴参数。

参数：mass.min
Minimum fraction of original points that must remain to allow continued peeling.
最低分数的原始基准点，必须仍然允许继续剥离。

参数：pasting
Logical indicating whether or not to paste.
逻辑表明是否将其粘贴。

参数：box.init
Matrix, providing specification of the initial box bounds, if for some reason you wanted to limit the area over which it searched.  The format is 2 rows by ncol(x), with the first row specifying lower bounds on each dimension, and the second row specifying upper bounds.  The matrix should have column names matching colnames(x).
矩阵，提供的初始箱边界的规范，如果由于某种原因，你想限制区域搜索。该格式是2行ncol(x)，指定每个维度上的下限的第一行，并且第二行指定上限。矩阵应该有相匹配的列名colnames(x)。

参数：coverage
Logical, indicating whether or not to provide coverage-oriented statistics during plotting, rather than support-oriented statistics
逻辑，表示是否提供面向覆盖统计在绘图过程中，而不是面向支持统计

参数：outfile
Optional character, naming a text file where the box summary information will be copied.  Set equal to NA (no quotes) if you would not like a file written out.
可选字符，一个文本文件，命名箱的摘要信息将被复制。设置等于NA（不包括引号），如果你不喜欢写出来的文件。

参数：csvfile
Optional character, naming a default csv file where CARs-readable output will be sent.  Regardless of what you put here, sdprim will also ask if you would like to write it out as csv, and give you the chance to change the filename.
可选字符，命名一个默认的csv文件，其中汽车可读的输出将被发送。无论你放在这里，sdprim也将询问您是否想为csv写出来，让您有机会更改文件名。

参数：repro
Logical, specifying whether or not to automatically generate reproducibility statistics by rerunning PRIM on random samplings of the dataset.  The only reason to set to FALSE is if there is a prohibitively long run time.  Another alternative is to lower the number of resamplings, nbump, below.
逻辑，指定是否要自动生成的重现性统计的数据集随机抽样通过重新运行PRIM。设置为false的唯一原因是，如果有一个难以承受较长的运行时间。另一种选择是降低重取样的数目，nbump，下面。

参数：nbump
Integer - If repro = TRUE, how many resamplings should be performed?  Currently this only allows sampling without replacement.
整数 - 如果repro = TRUE，应进行多少重采样？目前只允许不放回抽样。

参数：dfrac
Numeric between 0 and 1.  If repro = TRUE, what fraction of the dataset should be resampled each time?
数字0和1之间。如果repro = TRUE，哪一部分的数据集进行重采样，每次？

参数：threshtype
Required only if the output data is not already in a zero-one formate.  A relational operator entered as a character (either “<”,“>”,“<=” or “>=”, describing how to values in y should be thresholded.  The threshold is treated as being on the right side of the operator, thus, for example, “<” would make all values in y that are less than than thresh ones, and all values greater into zeros.
只需要如果输出的数据是不是已经在一个零 - 酮的甲酸盐。关系运算符输入一个字符（或者“<”，“>”，“<=”或“> =”，描述中的值y应该阈值，该阈值被视为是对操作人员的右侧，因此，例如，“<”将使所有的值在y小于比thresh的和所有的值到零。

Details

详细信息----------Details----------

Here we discuss several terminology issues that will be useful in understanding the output below, and then describe the interactive process used to identify scenarios.
在这里，我们讨论了一些术语问题，将是有益的理解下面的输出，然后描述用于识别场景的互动过程。

This package is generally oriented around producing <DFN>boxes</DFN>, which are defined by a set of orthogonal restrictions on the input (uncertainty) space of a model.  In a policy context, a box or set of related boxes can be interpreted as a <DFN>scenario</DFN>.  See the references at the end of this entry for a more thorough explanation of the concept of analytically generated scenarios.
这个软件包通常是围绕生产<DFN>框</ DFN>，其中所定义的一组正交的限制的一个模型的输入（不确定性）的空间取向。在政策范围内，一箱或一组相关的盒可以被解释作为<DFN>场景</ DFN>。在此项目的结束，一个更详尽的解释的概念，解析产生的情况下，请参阅参考文献。

Two important measures of scenario adequacy are termed <DFN>coverage</DFN> and <DFN>density</DFN>.  These are defined based on the binary output variable, which typically denotes some measure of “interestingness” of the input-output combination.  Coverage is the ratio of interesting points (those with an output value of one) captured to the total interesting points in the dataset, and density is the fraction of captured points that are actually interesting (ones in the box to total points in the box).  (These have analogues to Type I and Type II errors and other measures from information theory.)
两个重要的措施充足的情况下被称为<DFN>覆盖</ DFN>和<DFN>密度</ DFN>。这些被定义的基础上，这通常表示一些措施，“兴趣性”的输入输出的组合的二进制输出变量。覆盖的比例是的有趣点（，产值一）采集到的数据集的总有趣的地方，捕获的点密度的比例，实际上是有趣的（包装盒中的总积分在框中）。（这些类似物Type I和Type II错误和其他措施从信息理论）。

The basic PRIM algorithm operates in two steps, first generating a <DFN>trajectory</DFN> of boxes (in coverage-density space) that provide a tradeoff frontier for scenarios, with coverage, density and the number of restricted dimensions generally in tension.  From this trajectory, the user selects and further inspects one or more candidate boxes that appropriately balance the measures of interest.  After identifying and possibly modifying a box, the data points contained within that box are removed from the input matrix, and a new trajectory is generated.  This process (<DFN>covering</DFN>) can continue until the user is satisfied with the total coverage achieved, or until other conditions prohibit further covering.  The resulting set of selected boxes is referred to as a <DFN>box sequence</DFN>, and is the primary product of the PRIM algorithm.
的基本的PRIM算法运行在两个步骤中，首先生成一个<DFN>轨迹</ DFN>框（在覆盖密度空间）提供一个折衷的前沿情景，与覆盖范围，密度和数目的限制尺寸，一般在张力。从这个轨迹中，用户选择和进一步检查适当平衡的利益的措施的一个或多个候选框。识别和可能修改一个盒子后，该方块内包含的数据点从输入矩阵中除去，并产生一个新的轨迹。（<DFN>覆盖</ DFN>）这个过程可以继续下去，直到用户感到满意，达到全覆盖，或直至禁止进一步覆盖其他条件。产生的一组选定的框被称为作为<DFN>盒序列</ DFN>，并且是PRIM算法实现的初级产物。

The sdprim algorithm is very interactive, and the user will receive several prompts while running it.  Note that during this process, at least on MS Windows versions, you will receive a “busy” hourglass even when you have an activated R console awaiting user input.  Input can still (must) be entered in this state.
sdprim算法是非常互动的，和几个提示运行时，用户会收到。请注意，在这个过程中，至少在MS Windows版本，您将收到一个“忙”沙漏，即使你有一个激活的R控制台，等待用户输入。输入仍然可以（必须）在此状态下输入。

The first asks whether they would like the peeling trajectory displayed with dimension contours, with dominating points, or in the standard form.
第一，询问他们是否愿意剥离轨迹显示的尺寸轮廓，主导点，涨幅为标准形式。

Dimension contours represent coverage and density values achieved by boxes while holding the total number of restricted dimensions constant.  They do not represent a complete or optimal contour derived by considering the frontier of all possible boxes of a given dimension, but rather simply display the coverage and density associated with boxes that could be created by dropping dimensions (in order of least importance) from other boxes in the current trajectory.  That is, if there is a 3 dimensional box with restrictions x1 < 1, x8 > 5, x2 < 200, then the box defined by x1 < 1 and x8 >5 will be included in the contour for 2-d boxes.  Often, each point on a dimension contour will be defined by different restrictions on the same set of dimensions, but this is not always true.
尺寸轮廓代表框总数的限制尺寸保持恒定的同时实现覆盖范围和密度值。他们并不代表一个完整的或最佳的轮廓是通过考虑所有可能的前沿盒一个给定的尺寸，而是简单地显示框可以通过降低尺寸创建（为了最不重要的）从其他的覆盖面和密度与箱在当前的轨迹。也就是说，如果有一个3维有限制的方块×1 <1，×8> 5，×2 <200，然后定义的框由x1 <1和x8> 5将包含在轮廓2  - ð箱。通常情况下，每个点上的尺寸轮廓将被定义的一组相同的尺寸的不同的限制，但是这并不总是正确的。

Dominating points are those coverage and density combinations which satisfy two conditions:  They are associated with boxes generated by dropping dimensions from those boxes on the trajectory, and they lie “above and to the right” of the current trajectory.  That is, they are more simply defined, and also extend the density coverage frontier.  While a dominating point may not represent an overall ideal tradeoff between density, coverage and interpretability, it should be locally preferred to points nearby.
支配点的覆盖范围和密度的组合，满足以下两个条件：它们与盒子下降的轨迹从这些箱子的尺寸，和他们的谎言“的上方和右侧的”当前的轨迹。也就是说，它们被更简单地定义，并且还延长密度覆盖边界。虽然一个主导点可能并不代表理想的整体密度之间的权衡，覆盖范围和解释性，它应在本地优选的点附近。

After selecting which display option they would like, the user is then asked to select candidate points from the trajectory displayed.  Clicking on points displays a number for the box associated with that point.  When clicking on points on the original trajectory (denoted by large filled circles), the number shown refers to the index number of exactly that box.  When clicking on dominating points or points on contour lines, the number shown references the box on the trajectory from which the box represented by the point of interest was derived.  That is, points not on the trajectory are all derived by taking some box on the trajectory and removing some dimension restrictions - and the number shown when you click on those points refers to the original full box.  This should be born in mind later on.
在选择他们想的显示选项，用户将被要求选择候选点的轨迹显示。点击在点显示一个与该点相关的方块数目。当点击原始轨迹的点上（由大的实心圆表示）时，显示的号码指的那个方块的索引号。点击支配对轮廓线的点或点时，显示的号码引用从其中导出的方块表示的兴趣点的轨迹上的方块。也就是说，点的轨迹都取得一些箱子上的轨迹和删除一些尺寸限制 - 当你点击这些点的数量，是指原来的满箱。这应该是诞生记上。

After identifying candidate boxes, their statistics are displayed in the R console.  The user then picks a box to consider in more detail.  The program then shows various additional information about the box, and afterwards the user is given the opportunity to drop specific dimension restrictions.  This completes the process of selecting a single box.  The user is then given opportunity to continue <DFN>covering</DFN>, in which they repeat the process above on all the data not encompassed by the previously selected box.
后确定候选框，他们的统计数据显示，在R控制台。然后，用户选择一个方块，以更详细地考虑。程序，然后显示了各种额外的信息框，事后为用户提供机会掉落特定尺寸的限制。这完成的过程中，选择一个盒子。用户再给予机会继续<DFN>覆盖</ DFN>，其中，他们重复上面的过程不包含先前选定的框上的所有数据。

值----------Value----------

A <DFN>box sequence</DFN> object, which is a list, each entry containing a <DFN>box</DFN> object (which, in an ideal more formalized world, would be a class).  A box object contains a great deal of information about the particular box in question, including it's definition and associated statistics (described below).  Additionally, the box sequence object contains two attributes, estats and olap which are ensemble statistics for the entire box sequence.  While the structure of the output below may be of interest for advanced users, there is no need for the non-R user to be familiar with these outputs, as there are multiple functions for interpreting and displaying the output in a more friendly manner, such as seqinfo and dimplot.
一个<DFN> box序列</ DFN>对象，这是一个列表，每个条目包含一个<DFN>框</ DFN>对象（其中，在一个理想的更为正式的世界，将是一个类）。一个的框对象包含了大量的信息，特别是盒子的问题，包括它的定义和相关的统计数据（详见下文）。此外，box序列对象包含两个属性，estats和olap这是整个box序列的系综统计。的结构，下面的输出可能是为高级用户的利益，也没有必要非-R的用户，是熟悉这些输出，有多种功能在一个更友好的方式解释和显示输出， seqinfo和dimplot。

<table summary="R valueblock"> <tr valign="top"><td>y.mean</td> <td> Numeric, the mean of the points inside the box.  If the output data is 0-1, then this is the <DFN>density</DFN>.  Note that if this is something other than the first box in the sequence, the density is contingent on the covering process up to that point.</td></tr> <tr valign="top"><td>box</td> <td> A 2 by d matrix giving the absolute bounds of the box, including those bounds that were not restricted.  For unrestricted dimension ends, they are taken from the range of the data along that dimension.</td></tr> <tr valign="top"><td>mass</td> <td> The number of points inside the box, expressed as a fraction of the (sub)dataset used to generate this box.  The “full” mass can be found in the olap attribute of the box sequence.</td></tr> <tr valign="top"><td>dimlist</td> <td> A list of three logical vectors (either, lower, and upper), each having length equal to the number of input dimension.  upper and lower indicate whether the upper and lower end were restricted, and either is just an OR of lower and upper.  Thus, thus one way to only see the restricted box dimensions is with the code bs[[boxnumber]]$box[,bs[[boxnumber]]$dimlist$either]., where "bs" is replaced by the name of the box sequence object.</td></tr> <tr valign="top"><td>morestats</td> <td> A matrix with one row per restricted box definition, which contains the “remove variable” statistics.  The columns are as follows:  Column number in input matrix, density, coverage, support.</td></tr> <tr valign="top"><td>relcoverage</td> <td> Tracks the coverage of the entire trajectory from which this box was taken, with it normalized to the subdataset the box was taken from.</td></tr> <tr valign="top"><td>pvvalist</td> <td> A matrix giving the quasi-p-values for each dimension restriction.</td></tr> <tr valign="top"><td>freqmat</td> <td> A matrix giving the reproducibility statistics (assuming the sdprim was called with argument repro=TRUE).  The columns correspond to the columns of the input matrix, and the first row gives the reproducibility statistics when PRIM was matched on coverage, the second when it was matched on density.  The entries represent the fraction of time each dimension was restricted when PRIM was rerun on nbump random subsamples (of size N*dfrac) of the dataset.</td></tr> <tr valign="top"><td>index</td> <td> For the sake of reproducibility, this gives the index of the box in the trajectory from which it was selected.  Note that, unless this is the first box in the sequence, the accuracy of this index is contingent on selection of the identical index, dimensions and restrictions for the previous boxes, and parameters for PRIM - ie, it refers to the trajectory that results after the previous boxes in the sequence have been selected.</td></tr> <tr valign="top"><td>relbox</td> <td> A matrix of structure similar to box, except that the bounds are normalized so that they range from zero to one.  These are used in the dimplot command for visualizing dimension restrictions.</td></tr> </table>
<table summary="R valueblock"> <tr valign="top"> <TD> y.mean</ TD> <TD>数字，方框内的点的平均值。如果输出数据是0-1，那么这是<DFN>的密度</ DFN>。需要注意的是，如果这是序列中的第一个框以外，密度队伍覆盖到这一点。</ TD> </ TR> <tr valign="top"> <TD><X > </ TD> <td>一个二维矩阵提供的绝对界限的盒子，包括那些没有被限制范围。对于不受限制的尺寸两端，他们是从沿着该维度的数据的范围。</ TD> </ TR> <tr valign="top"> <TD> box</ TD> <TD>的方框内的点，表示为一小部分的（子）的数据集，用于生成此框数。 “全面”的质量可以在mass属性框序列。</ TD> </ TR> <tr valign="top"> <TD> olap</ TD> <td>一个列表中的三个逻辑向量（dimlist，either和lower），每个长度等于输入维数的。 upper，和upper表示的上部和下部端是否受到限制，和lower是只是一个OReither和lower。因此，这样的一种方式的代码，只看到受限制的箱体尺寸是upper，“BS”被替换为框序列对象的名称。</ TD> </ TR> <TR VALIGN = “顶”> <TD> bs[[boxnumber]]$box[,bs[[boxnumber]]$dimlist$either]. </ TD> <td>一个矩阵的每行限制框的定义，其中包含了“删除变量”统计。列如下：列数输入矩阵，密度，覆盖范围，支持。</ TD> </ TR> <tr valign="top"> <TD>morestats </ TD> <TD>曲目这盒是整个轨迹，它归到subdataset盒是从的覆盖面。</ TD> </ TR> <tr valign="top"> <TD>relcoverage / TD> <td>一个矩阵，给准每个维度限制的p值</ TD> </ TR> <tr valign="top"> <TD>pvvalist </ TD> <TD > A矩阵提供的重复性统计（假设freqmat被称为参数sdprim）。列对应于输入矩阵的列，和第一行给出的再现统计覆盖PRIM上匹配时，第二，当它被对密度匹配。参赛作品的比例，每个维度被限制时，PRIM上重新运行repro=TRUE随机子样本（大小为nbump）的数据集。</ TD> </ TR> <TR VALIGN =“顶“> <TD> N*dfrac </ TD> <TD>为了重现性，给出了指数的盒子，它被选中的轨迹。请注意的是，除非这是在序列中的第一个框，此指数的准确性是然选择先前框相同的索引，尺寸和限制，和参数PRIM  - 即，它指的轨迹后的结果以前的盒子序列中已被选中。</ TD> </ TR> <tr valign="top"> <TD>index </ TD> <td>一个矩阵的结构类似<X >，除了边界被标准化，使得它们的范围从零到一。这些都是用在relbox命令的可视化尺寸的限制。</ TD> </ TR> </ TABLE>

box序列对象属性----------Box Sequence Object Attributes----------

The overall box sequence object returns three attributes as well.  The list estats contains “ensemble” statistics for the entire dataset, as follows:
整体中的序列对象返回三个属性。的列表estats包含整个数据集的“合奏”的统计，如下所示：



ecovTotal coverage for the box sequence.
框序列的ecovTotal覆盖。

esupTotal support for the box sequence.
的方块序列esupTotal支持。

edenOverall density for the space encompassed by the box sequence.
edenOverall密度的空间所包含的框序列。

nptsTotal points in the dataset used.
的数据集nptsTotal点。

ninterTotal number of interesting points in the dataset (after thresholding).
数据集（阈值）后ninterTotal许多有趣的要点。

intinTotal number of interesting points captured by the box sequence.
intinTotal box序列的捕获有趣的点。

totinTotal number of points captured by the box sequence.
totinTotal盒序列捕获的点数目。

totdimsTotal dimensions in the input data.
在输入数据totdimsTotal尺寸。

The attribute olaps contains a matrix that holds several pieces of information.  The diagonals display the absolute coverage of box i (in position matrix[i,i]).  The lower triangle displays the the overlap in total interesting points between boxes i and j.  The upper triangle displays the total points in common.  All are expressed as a fraction of total points in the dataset.
属性olaps包含一个矩阵，拥有多条信息。显示的对角线绝对的覆盖面框（在位置矩阵[I]）。下三角显示在总箱i和j之间的有趣的点重叠。上部三角形显示共同的总点数。所有被表示为数据集的总积分中的一小部分。

（作者）----------Author(s)----------

Benjamin P. Bryant, <a href="mailto:bryant@prgs.edu">bryant@prgs.edu</a>

参见----------See Also----------

sd.start for reading in and cleaning data, seqinfo for viewing the output of sdprim,  and dimplot for visualizing dimension restrictions.
sd.start用于读取和清理数据，seqinfo查看sdprim，dimplot可视尺寸限制的输出。

实例----------Examples----------

#Load some example data to play with:[一些示例数据加载到玩：]
data(quakes)
#quakes is a 1000 by 5 dataset of earthquake information.  This has no obvious[地震1000是一个数据集的地震信息。这有没有明显的]
#policy significance, but we can use this built-in dataset to illustrate the use[政策的意义，但我们可以使用这个内置的数据来说明使用]
#of PRIM.[PRIM。]

#Here are the columns:[下面是列如下：]
colnames(quakes)

#We will say magnitude is the output of interest, and call earthquakes greater[我们说大小是输出的兴趣，并呼吁地震更大]
#5.0 'interesting.'  We can then call sdprim two different ways.[5.0“有趣的”。然后我们可以调用sdprim两种不同的方式。]

#First, make an input matrix from columns 1,2,3 and 5 [首先，输入矩阵列1,2,3和5]
inputs <- quakes[,c(1:3,5)]  #could also do quakes[,-4][也可以这样做地震[-4]]

#Now put our unthresholded y vector:[现在把我们unthresholded的Y向量：]
yout <- quakes[,"mag"] #could also do quakes[,4][也可以做地震[4]]

#Now we can either call sdprim and threshold inside PRIM, like this:[现在，我们可以调用内PRIM sdprim和阈值，就像这样：]
## Not run: myboxes <- sdprim(x=inputs, y=yout, thresh=5.0, threshtype=">")[＃不运行：myboxes < -  sdprim（X =输入，Y =现有阈值= 5.0，threshtype =“>”）]

#Or we can first threshold yout:[或者，我们可以第一阈值YOUT：]
ythresh <- 1*(yout>5.0)

#and then call sdprim without worrying about the thresholds:[然后调用sdprim不担心的阈值：]
## Not run: myboxes <- sdprim(x=inputs, y=ythresh)[＃不运行：myboxes < -  sdprim（X =输入，Y = ythresh）]

转载请注明:出自生物统计家园网(http://www.biostatistic.net)。

注：
注1：为了方便大家学习，本文档为生物统计家园网机器人LoveR翻译而成，仅供个人R语言学习参考使用，生物统计家园保留版权。
注2：由于是机器人自动翻译，难免有不准确之处，使用时仔细对照中、英文内容进行反复理解，可以帮助R语言的学习。
注3：如遇到不准确之处，请在本贴的后面进行回帖，我们会逐渐进行修订。

账号		自动登录	找回密码
密码			注册