R语言 RODM包 RODM_create_oc_model()函数中文帮助文档(中英文对照)

loveR · 发表于 2012-9-27 22:51:56

RODM_create_oc_model(RODM)
RODM_create_oc_model()所属R语言包：RODM

                                    Create an O-cluster model
                                       O-簇模型

                                       译者：生物统计家园网机器人LoveR

描述----------Description----------

This function creates a O-cluster model.
这个函数创建一个O-簇模型。

用法----------Usage----------

RODM_create_oc_model(database,
                     data_table_name,
                     case_id_column_name,
                     model_name = "OC_MODEL",
                     auto_data_prep = TRUE,
                     num_clusters = NULL,
                     max_buffer = NULL,
                     sensitivity = NULL,
                     retrieve_outputs_to_R = TRUE,
                     leave_model_in_dbms = TRUE,
                     sql.log.file = NULL)

参数----------Arguments----------

参数：database
Database ODBC channel identifier returned from a call to RODM_open_dbms_connection
数据库的ODBC通道标识符返回调用RODM_open_dbms_connection

参数：data_table_name
Database table/view containing the training dataset.
数据库表/视图包含训练数据集。

参数：case_id_column_name
Row unique case identifier in data_table_name.
行独特的标识符的data_table_name。

参数：model_name
ODM Model name.
ODM产品型号名称。

参数：auto_data_prep
Whether or not ODM should invoke automatic data preparation for the build.
无论ODM应该调用自动构建数据准备。

参数：num_clusters
Setting that specifies the number of clusters for the clustering model.
设置指定簇数目的聚类分析模型。

参数：max_buffer
Buffer size for O-Cluster. Default is 50,000.
O-聚类的缓冲区大小。默认值是50000。

参数：sensitivity
A fraction that specifies the peak density required for separating a new  cluster. The fraction is related to the global uniform density. Default is 0.5.
分离一个新的聚类所需的一小部分，规定峰值密度。馏分有关的全球均匀的密度。默认值是0.5。

参数：retrieve_outputs_to_R
Flag controlling if the output results are moved to the R environment.
船籍控制，如果输出的结果被移动到R环境。

参数：leave_model_in_dbms
Flag controlling if the model is deleted or left in RDBMS.
如果模型被删除或留在RDBMS标志控制。

参数：sql.log.file
File where to append the log of all the SQL calls made by this function.
文件中追加的log所有的SQL调用此功能。

Details

详细信息----------Details----------

The O-Cluster algorithm creates a hierarchical grid-based clustering model, that is,  it creates axis-parallel (orthogonal) partitions in the input attribute space. The algorithm  operates recursively. The resulting hierarchical structure represents an irregular grid that  tessellates the attribute space into clusters. The resulting clusters define dense areas in the  attribute space.
O-聚类算法创建一个分层的基于网格的聚类分析模型，也就是说，它中创建轴平行（垂直）在输入属性空间的分区。的算法运作递归。由此产生的层次结构代表一个不规则的网格，聚类镶嵌属性空间。聚类定义的属性空间中密集的区域。

The clusters are described by intervals along the attribute axes and the corresponding centroids  and histograms. A parameter called sensitivity defines a baseline density level. Only areas with  peak density above this baseline level can be identified as clusters.
各专题组的属性区间上沿轴和相应的质心和直方图。称为灵敏度的参数定义了一个基本的密度水平。只有这个基线水平以上的峰值电流密度的区域，可以认定为聚类。

The k-means algorithm tessellates the space even when natural clusters may not exist. For example,  if there is a region of uniform density, k-Means tessellates it into n clusters (where n is specified  by the user). O-Cluster separates areas of high density by placing cutting planes through areas of  low density. O-Cluster needs multi-modal histograms (peaks and valleys). If an area has projections  with uniform or monotonically changing density, O-Cluster does not partition it.
k-means算法镶嵌天然聚类的空间，即使可能不存在。例如，如果有一个区域的密度是均匀的，k均值镶嵌到n团簇（其中，n是由用户指定的）。 O-聚类分离高密度的区域，将通过切割面密度低的区域。 O-聚类需要多模态的的直方图（波峰和波谷）。如果一个区域有均匀的或单调密度变化的预测，O-Cluster不对其进行分区。

The clusters discovered by O-Cluster are used to generate a Bayesian probability model that is then  used during scoring (model apply) for assigning data points to clusters. The generated probability  model is a mixture model where the mixture components are represented by a product of independent  normal distributions for numerical attributes and multinomial distributions for categorical attributes.
由O-聚类发现的簇被用来生成一个贝叶斯概率模型，然后用来在评分过程中（模型应用）分配数据点到聚类。所产生的概率模型是模型，其中的混合物组分表示的分类属性的数值属性和多项式分布独立正态分布的产物的混合物。

Keep the following in mind if you choose to prepare the data for O-Cluster: 1. O-Cluster does not  necessarily use all the input data when it builds a model. It reads the data in batches (the default  batch size is 50000). It will only read another batch if it believes, based on statistical tests,  that there may still exist clusters that it has not yet uncovered. 2. Because O-Cluster may stop the  model build before it reads all of the data, it is highly recommended that the data be randomized. 3. Binary attributes should be declared as categorical. O-Cluster maps categorical data to numerical  values. 4. The use of Oracle Data Mining's equi-width binning transformation with automated estimation  of the required number of bins is highly recommended. 5. The presence of outliers can significantly  impact clustering algorithms. Use a clipping transformation before binning or normalizing. Outliers  with equi-width binning can prevent O-Cluster from detecting clusters. As a result, the whole  population appears to falls within a single cluster.
如果你准备的数据O-群：1，请注意以下几点。 O-Cluster不建立一个模型时，一定要用所有的输入数据。分批读取数据（缺省的批处理大小为50000）。一批又一批，如果它认为它只会读，统计检验的基础上，有可能还存在尚未发现的聚类。 2。由于O-聚类可以停止模型建立之前，它读取所有的数据，这是强烈建议的数据是随机的。 3。二进制属性应被宣布为绝对的。 O-聚类分类数据映射到的数值。 4。强烈建议使用Oracle数据挖掘等宽的分级改造与自动估算所需数量的垃圾箱。 5。异常值的存在可以显着影响的聚类算法。使用的裁剪转型前分级或正火。离群值等宽的分级可以防止O型产业聚类的检测聚类。其结果是，出现整个人口跌倒单个聚类内。

For more details on the algotithm implementation, parameters settings and  characteristics of the ODM function itself consult the following Oracle documents: ODM Concepts,  ODM Developer's Guide and Oracle SQL Packages: Data Mining, and Oracle Database SQL Language  Reference (Data Mining functions), listed in the references below.
有关algotithm的实现的详细信息，参数设置和的ODM函数本身的特性请咨询以下Oracle文档：列出ODM的概念，ODM开发的指南和Oracle的SQL套件：数据挖掘，和甲骨文数据库SQL语言参考的数据挖掘功能，在下面的参考资料。

值----------Value----------

If retrieve_outputs_to_R is TRUE, returns a list with the following elements: <table summary="R valueblock"> <tr valign="top"><td>model.model_settings</td> <td> Table of settings used to build the model.</td></tr> <tr valign="top"><td>model.model_attributes</td> <td> Table of attributes used to build the model.</td></tr> <tr valign="top"><td>oc.clusters</td> <td> General per-cluster information.</td></tr> <tr valign="top"><td>oc.split_predicate</td> <td> Cluster split predicates.</td></tr> <tr valign="top"><td>oc.taxonomy</td> <td> Parent-child cluster relationship.</td></tr> <tr valign="top"><td>oc.centroid</td> <td> Per cluster-attribute centroid information.</td></tr> <tr valign="top"><td>oc.histogram</td> <td> Per cluster-attribute hitogram information.</td></tr> <tr valign="top"><td>oc.rule</td> <td> Cluster rules.</td></tr> <tr valign="top"><td>oc.leaf_cluster_count</td> <td> Leaf clusters with support.</td></tr> <tr valign="top"><td>oc.assignment</td> <td> Assignment of training data to clusters (with probability).</td></tr> </table>
如果retrieve_outputs_to_R是TRUE，返回一个列表，包含下列元素：<table summary="R valueblock"> <tr valign="top"> <TD> model.model_settings</ TD> <TD>表，用来设置建立模型。</ TD> </ TR> <tr valign="top"> <TD> model.model_attributes</ TD> <TD>表用于建立模型的属性。</ TD> </ TR> <tr valign="top"> <TD> oc.clusters </ TD> <TD>每个聚类信息。</ TD> </ TR> <tr valign="top"> <TD> oc.split_predicate </ TD> <TD> Cluster分割谓词的。</ TD> </ TR> <tr valign="top"> <TD>oc.taxonomy </ TD> <TD>家长的孩子聚类关系。</ TD> </ TR> <tr valign="top"> <TD>oc.centroid </ TD> <TD>每个聚类属性的质心的信息。</ TD> </ TR> < TR VALIGN =“”> <TD> oc.histogram </ TD> <TD>每个聚类属性hitogram信息。</ TD> </ TR> <tr valign="top"> <TD> X> </ TD> <TD>聚类规则。</ TD> </ TR> <tr valign="top"> <TD>oc.rule </ TD> <TD>叶簇的支持。 / TD> </ TR> <tr valign="top"> <TD>oc.leaf_cluster_count </ TD> <TD>训练数据分配到聚类（概率）。</ TD> </ TR> < / TABLE>

（作者）----------Author(s)----------

Pablo Tamayo <a href="mailto:pablo.tamayo@oracle.com">pablo.tamayo@oracle.com</a>

Ari Mozes <a href="mailto:ari.mozes@oracle.com">ari.mozes@oracle.com</a>

参考文献----------References----------

B.L. Milenova and M.M. Campos, Clustering Large Databases with Numeric and Nominal Values Using Orthogonal Projection, Proceeding of the 29th VLDB Conference, Berlin, Germany (2003).
Oracle9i O-Cluster: Scalable Clustering of Large High Dimensional Data Sets http://www.oracle.com/technology/products/bi/odm/pdf/o_cluster_algorithm.pdf
Oracle Data Mining Concepts 11g Release 1 (11.1) http://download.oracle.com/docs/cd/B28359_01/datamine.111/b28129/toc.htm
Oracle Data Mining Application Developer's Guide 11g Release 1 (11.1) http://download.oracle.com/docs/cd/B28359_01/datamine.111/b28131/toc.htm
Oracle Data Mining Administrator's Guide 11g Release 1 (11.1)  http://download.oracle.com/docs/cd/B28359_01/datamine.111/b28130/toc.htm
Oracle Database PL/SQL Packages and Types Reference 11g Release 1 (11.1) http://download.oracle.com/docs/cd/B28359_01/appdev.111/b28419/d_datmin.htm#ARPLS192
Oracle Database SQL Language Reference (Data Mining functions) 11g Release 1 (11.1) http://download.oracle.com/docs/cd/B28359_01/server.111/b28286/functions001.htm#SQLRF20030

参见----------See Also----------

RODM_apply_model,
RODM_apply_model，

实例----------Examples----------

## Not run: [＃不运行：]
DB <- RODM_open_dbms_connection(dsn="orcl11g", uid= "rodm", pwd = "rodm")

### Clustering a 2D multi-Gaussian distribution of points into clusters[＃＃聚类的二维高斯分布的点多到聚类]

set.seed(seed=6218945)
X1 <- c(rnorm(100, mean = 2, sd = 1), rnorm(100, mean = 8, sd = 2), rnorm(100, mean = 5, sd = 0.6),
      rnorm(100, mean = 4, sd = 1), rnorm(100, mean = 10, sd = 1)) # Create and merge 5 Gaussian distributions[创建和合并高斯分布]
Y1 <- c(rnorm(100, mean = 1, sd = 2), rnorm(100, mean = 4, sd = 1.5), rnorm(100, mean = 6, sd = 0.5),
      rnorm(100, mean = 3, sd = 0.2), rnorm(100, mean = 2, sd = 1))
ds <- data.frame(cbind(X1, Y1))
n.rows <- length(ds[,1])                                                 # Number of rows[行数]
row.id <- matrix(seq(1, n.rows), nrow=n.rows, ncol=1, dimnames= list(NULL, c("ROW_ID"))) # Row id[行ID]
ds <- cbind(row.id, ds)                                                    # Add row id to dataset [添加行号数据集]
RODM_create_dbms_table(DB, "ds")

oc <- RODM_create_oc_model(
database = DB,                # database ODBC channel identifier[数据库的ODBC通道标识符]
data_table_name = "ds",       # data frame containing the input dataset[数据框包含输入数据集]
case_id_column_name = "ROW_ID", # case id to enable assignments during build[情况编号，使任务在构建]
num_clusters = 5)

oc2 <- RODM_apply_model(
database = DB,                # database ODBC channel identifier[数据库的ODBC通道标识符]
data_table_name = "ds",       # data frame containing the input dataset[数据框包含输入数据集]
model_name = "OC_MODEL",
supplemental_cols = c("X1","Y1"))

x1a <- oc2$model.apply.results[, "X1"]
y1a <- oc2$model.apply.results[, "Y1"]
clu <- oc2$model.apply.results[, "CLUSTER_ID"]
c.numbers <- unique(as.numeric(clu))
c.assign <- match(clu, c.numbers)
color.map <- c("blue", "green", "red")
color <- color.map[c.assign]
nf <- layout(matrix(c(1, 2), 1, 2, byrow=T), widths = c(1, 1), heights = 1, respect = FALSE)
plot(x1a, y1a, pch=20, col=1, xlab="X1", ylab="Y1", main="Original Data Points")
plot(x1a, y1a, pch=20, type = "n", xlab="X1", ylab="Y1", main="After OC clustering")
for (i in 1:n.rows) {
points(x1a[i], y1a[i], col= color[i], pch=20)
}
legend(5, -0.5, legend=c("Cluster 1", "Cluster 2", "Cluster 3"), pch = rep(20, 3),
   col = color.map, pt.bg = color.map, cex = 0.8, pt.cex=1, bty="n")

oc       # look at the model details and cluster assignments[看模型的详细信息和聚类分配]

RODM_drop_model(DB, "OC_MODEL") # Drop the database table[删除数据库中的表]
RODM_drop_dbms_table(DB, "ds") # Drop the database table[删除数据库中的表]

RODM_close_dbms_connection(DB)

## End(Not run)[＃（不执行）]

转载请注明:出自生物统计家园网(http://www.biostatistic.net)。

注：
注1：为了方便大家学习，本文档为生物统计家园网机器人LoveR翻译而成，仅供个人R语言学习参考使用，生物统计家园保留版权。
注2：由于是机器人自动翻译，难免有不准确之处，使用时仔细对照中、英文内容进行反复理解，可以帮助R语言的学习。
注3：如遇到不准确之处，请在本贴的后面进行回帖，我们会逐渐进行修订。

账号		自动登录	找回密码
密码			注册