R语言 RODM包 RODM_create_kmeans_model()函数中文帮助文档(中英文对照)

loveR · 发表于 2012-9-27 22:51:28

RODM_create_kmeans_model(RODM)
RODM_create_kmeans_model()所属R语言包：RODM

                                    Create a Hierarchical k-means model
                                       创建一个分层的K-Means模型

                                       译者：生物统计家园网机器人LoveR

描述----------Description----------

This function creates a Hierarchical k-means model.
这个函数创建一个分层的K-Means模型。

用法----------Usage----------

RODM_create_kmeans_model(database,
                     data_table_name,
                     case_id_column_name = NULL,
                     model_name = "KM_MODEL",
                     auto_data_prep = TRUE,
                     num_clusters = NULL,
                     block_growth = NULL,
                     conv_tolerance = NULL,
                     euclidean_distance = TRUE,
                     iterations = NULL,
                     min_pct_attr_support = NULL,
                     num_bins = NULL,
                     variance_split = TRUE,
                     retrieve_outputs_to_R = TRUE,
                     leave_model_in_dbms = TRUE,
                     sql.log.file = NULL)

参数----------Arguments----------

参数：database
Database ODBC channel identifier returned from a call to RODM_open_dbms_connection
数据库的ODBC通道标识符返回调用RODM_open_dbms_connection

参数：data_table_name
Database table/view containing the training dataset.
数据库表/视图包含训练数据集。

参数：case_id_column_name
Row unique case identifier in data_table_name.
行独特的标识符的data_table_name。

参数：model_name
ODM Model name.
ODM产品型号名称。

参数：auto_data_prep
Whether or not ODM should invoke automatic data preparation for the build.
无论ODM应该调用自动构建数据准备。

参数：num_clusters
Setting that specifies the number of clusters for a clustering model.
设置指定簇数目的聚类分析模型。

参数：block_growth
Setting that specifies the growth factor for memory to hold cluster data for k-Means.
设置指定的内存来保存聚类数据k均值的生长因子。

参数：conv_tolerance
Setting that specifies the convergence tolerance for k-Means.
设置指定的收敛公差k均值。

参数：euclidean_distance
Distance function (cosine, euclidean or fast_cosine).
距离函数（余弦欧氏或fast_cosine），的。

参数：iterations
Setting that specifies the number of iterations for k-Means.
设置指定数量的迭代k均值。

参数：min_pct_attr_support
Setting that specifies the minimum percent required for attributes in rules.
需要在规则的属性设置，指定的最低百分比。

参数：num_bins
Setting that specifies the number of histogram bins k-Means.
设置指定数量的直方图k均值。

参数：variance_split
Setting that specifies the split criterion for k-Means.
设置指定K均值分割标准。

参数：retrieve_outputs_to_R
Flag controlling if the output results are moved to the R environment.
船籍控制，如果输出的结果被移动到R环境。

参数：leave_model_in_dbms
Flag controlling if the model is deleted or left in RDBMS.
如果模型被删除或留在RDBMS标志控制。

参数：sql.log.file
File where to append the log of all the SQL calls made by this function.
文件中追加的log所有的SQL调用此功能。

Details

详细信息----------Details----------

The algorithm k-means (kmeans) uses a distance-based similarity measure and tessellates the data space creating hierarchies. It handles large data volumes via summarization and supports sparse data. It is especially useful when the dataset has a moderate number of numerical attributes and one has a predetermined number of clusters. The main parameters settings correspond to the choice of distance function (e.g., Euclidean or cosine), number of iterations, convergence tolerance and split criterion.
该算法的k-means（k均值）使用基于距离的相似性度量和镶嵌创建层次结构的数据空间。处理大数据量通过总结和支持稀疏的数据。这是特别有用的数据集时，有一个中等数量的数值属性，和一个具有预定数目的簇。主要参数设置对应的距离函数的选择（例如，欧氏或余弦），迭代次数，收敛性和拆分条件。

For more details on the algotithm implementation, parameters settings and  characteristics of the ODM function itself consult the following Oracle documents: ODM Concepts,  ODM Developer's Guide and Oracle SQL Packages: Data Mining, and Oracle Database SQL Language  Reference (Data Mining functions), listed in the references below.
有关algotithm的实现的详细信息，参数设置和的ODM函数本身的特性请咨询以下Oracle文档：列出ODM的概念，ODM开发的指南和Oracle的SQL套件：数据挖掘，和甲骨文数据库SQL语言参考的数据挖掘功能，在下面的参考资料。

值----------Value----------

If retrieve_outputs_to_R is TRUE, returns a list with the following elements: <table summary="R valueblock"> <tr valign="top"><td>model.model_settings</td> <td> Table of settings used to build the model.</td></tr> <tr valign="top"><td>model.model_attributes</td> <td> Table of attributes used to build the model.</td></tr> <tr valign="top"><td>km.clusters</td> <td> General per-cluster information.</td></tr> <tr valign="top"><td>km.taxonomy</td> <td> Parent-child cluster relationship.</td></tr> <tr valign="top"><td>km.centroid</td> <td> Per cluster-attribute centroid information.</td></tr> <tr valign="top"><td>km.histogram</td> <td> Per cluster-attribute hitogram information.</td></tr> <tr valign="top"><td>km.rule</td> <td> Cluster rules.</td></tr> <tr valign="top"><td>km.leaf_cluster_count</td> <td> Leaf clusters with support.</td></tr> <tr valign="top"><td>km.assignment</td> <td> Assignment of training data to clusters (with probability).</td></tr> </table>
如果retrieve_outputs_to_R是TRUE，返回一个列表，包含下列元素：<table summary="R valueblock"> <tr valign="top"> <TD> model.model_settings</ TD> <TD>表，用来设置建立模型。</ TD> </ TR> <tr valign="top"> <TD> model.model_attributes</ TD> <TD>表用于建立模型的属性。</ TD> </ TR> <tr valign="top"> <TD> km.clusters </ TD> <TD>每个聚类信息。</ TD> </ TR> <tr valign="top"> <TD> km.taxonomy</ TD> <TD>亲子群关系。</ TD> </ TR> <tr valign="top"> <TD> km.centroid</ TD> <TD>每聚类属性的质心的信息。</ TD> </ TR> <tr valign="top"> <TD> km.histogram </ TD> <TD>每个聚类属性hitogram信息。</ TD> </ TR> <tr valign="top"> <TD> km.rule </ TD> <TD>聚类规则。</ TD> </ TR> <tr valign="top"> <TD><X > </ TD> <TD>叶簇的支持。</ TD> </ TR> <tr valign="top"> <TD> km.leaf_cluster_count </ TD> <TD>分配训练数据，聚类（概率）。</ TD> </ TR> </ TABLE>

（作者）----------Author(s)----------

Pablo Tamayo <a href="mailto:pablo.tamayo@oracle.com">pablo.tamayo@oracle.com</a>

Ari Mozes <a href="mailto:ari.mozes@oracle.com">ari.mozes@oracle.com</a>

参考文献----------References----------

Oracle Data Mining Concepts 11g Release 1 (11.1) http://download.oracle.com/docs/cd/B28359_01/datamine.111/b28129/toc.htm
Oracle Data Mining Application Developer's Guide 11g Release 1 (11.1) http://download.oracle.com/docs/cd/B28359_01/datamine.111/b28131/toc.htm
Oracle Data Mining Administrator's Guide 11g Release 1 (11.1)  http://download.oracle.com/docs/cd/B28359_01/datamine.111/b28130/toc.htm
Oracle Database PL/SQL Packages and Types Reference 11g Release 1 (11.1) http://download.oracle.com/docs/cd/B28359_01/appdev.111/b28419/d_datmin.htm#ARPLS192
Oracle Database SQL Language Reference (Data Mining functions) 11g Release 1 (11.1) http://download.oracle.com/docs/cd/B28359_01/server.111/b28286/functions001.htm#SQLRF20030

参见----------See Also----------

RODM_apply_model,
RODM_apply_model，

实例----------Examples----------

## Not run: [＃不运行：]
DB <- RODM_open_dbms_connection(dsn="orcl11g", uid= "rodm", pwd = "rodm")

### Clustering a 2D multi-Gaussian distribution of points into clusters[＃＃聚类的二维高斯分布的点多到聚类]

set.seed(seed=6218945)
X1 <- c(rnorm(100, mean = 2, sd = 1), rnorm(100, mean = 8, sd = 2), rnorm(100, mean = 5, sd = 0.6),
      rnorm(100, mean = 4, sd = 1), rnorm(100, mean = 10, sd = 1)) # Create and merge 5 Gaussian distributions[创建和合并高斯分布]
Y1 <- c(rnorm(100, mean = 1, sd = 2), rnorm(100, mean = 4, sd = 1.5), rnorm(100, mean = 6, sd = 0.5),
      rnorm(100, mean = 3, sd = 0.2), rnorm(100, mean = 2, sd = 1))
ds <- data.frame(cbind(X1, Y1))
n.rows <- length(ds[,1])                                                 # Number of rows[行数]
row.id <- matrix(seq(1, n.rows), nrow=n.rows, ncol=1, dimnames= list(NULL, c("ROW_ID"))) # Row id[行ID]
ds <- cbind(row.id, ds)                                                    # Add row id to dataset [添加行号数据集]
RODM_create_dbms_table(DB, "ds")

km <- RODM_create_kmeans_model(
database = DB,                # database ODBC channel identifier[数据库的ODBC通道标识符]
data_table_name = "ds",       # data frame containing the input dataset[数据框包含输入数据集]
case_id_column_name = "ROW_ID", # case id to enable assignments during build[情况编号，使任务在构建]
num_clusters = 5)

km2 <- RODM_apply_model(
database = DB,                # database ODBC channel identifier[数据库的ODBC通道标识符]
data_table_name = "ds",       # data frame containing the input dataset[数据框包含输入数据集]
model_name = "KM_MODEL",
supplemental_cols = c("X1","Y1"))

x1a <- km2$model.apply.results[, "X1"]
y1a <- km2$model.apply.results[, "Y1"]
clu <- km2$model.apply.results[, "CLUSTER_ID"]
c.numbers <- unique(as.numeric(clu))
c.assign <- match(clu, c.numbers)
color.map <- c("blue", "green", "red", "orange", "purple")
color <- color.map[c.assign]
nf <- layout(matrix(c(1, 2), 1, 2, byrow=T), widths = c(1, 1), heights = 1, respect = FALSE)
plot(x1a, y1a, pch=20, col=1, xlab="X1", ylab="Y1", main="Original Data Points")
plot(x1a, y1a, pch=20, type = "n", xlab="X1", ylab="Y1", main="After kmeans clustering")
for (i in 1:n.rows) {
points(x1a[i], y1a[i], col= color[i], pch=20)
}
legend(5, -0.5, legend=c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5"), pch = rep(20, 5),
   col = color.map, pt.bg = color.map, cex = 0.8, pt.cex=1, bty="n")

km       # look at the model details and cluster assignments[看模型的详细信息和聚类分配]

RODM_drop_model(DB, "KM_MODEL") # Drop the model[掉落的模型]
RODM_drop_dbms_table(DB, "ds") # Drop the database table[删除数据库中的表]

RODM_close_dbms_connection(DB)

## End(Not run)[＃（不执行）]

转载请注明:出自生物统计家园网(http://www.biostatistic.net)。

注：
注1：为了方便大家学习，本文档为生物统计家园网机器人LoveR翻译而成，仅供个人R语言学习参考使用，生物统计家园保留版权。
注2：由于是机器人自动翻译，难免有不准确之处，使用时仔细对照中、英文内容进行反复理解，可以帮助R语言的学习。
注3：如遇到不准确之处，请在本贴的后面进行回帖，我们会逐渐进行修订。

账号		自动登录	找回密码
密码			注册

R语言 RODM包 RODM_create_kmeans_model()函数中文帮助文档(中英文对照)

浏览过的版块