RODM_create_kmeans_model(RODM)
RODM_create_kmeans_model()所属R语言包:RODM
Create a Hierarchical k-means model
创建一个分层的K-Means模型
译者:生物统计家园网 机器人LoveR
描述----------Description----------
This function creates a Hierarchical k-means model.
这个函数创建一个分层的K-Means模型。
用法----------Usage----------
RODM_create_kmeans_model(database,
data_table_name,
case_id_column_name = NULL,
model_name = "KM_MODEL",
auto_data_prep = TRUE,
num_clusters = NULL,
block_growth = NULL,
conv_tolerance = NULL,
euclidean_distance = TRUE,
iterations = NULL,
min_pct_attr_support = NULL,
num_bins = NULL,
variance_split = TRUE,
retrieve_outputs_to_R = TRUE,
leave_model_in_dbms = TRUE,
sql.log.file = NULL)
参数----------Arguments----------
参数:database
Database ODBC channel identifier returned from a call to RODM_open_dbms_connection
数据库的ODBC通道标识符返回调用RODM_open_dbms_connection
参数:data_table_name
Database table/view containing the training dataset.
数据库表/视图包含训练数据集。
参数:case_id_column_name
Row unique case identifier in data_table_name.
行独特的标识符的data_table_name。
参数:model_name
ODM Model name.
ODM产品型号名称。
参数:auto_data_prep
Whether or not ODM should invoke automatic data preparation for the build.
无论ODM应该调用自动构建数据准备。
参数:num_clusters
Setting that specifies the number of clusters for a clustering model.
设置指定簇数目的聚类分析模型。
参数:block_growth
Setting that specifies the growth factor for memory to hold cluster data for k-Means.
设置指定的内存来保存聚类数据k均值的生长因子。
参数:conv_tolerance
Setting that specifies the convergence tolerance for k-Means.
设置指定的收敛公差k均值。
参数:euclidean_distance
Distance function (cosine, euclidean or fast_cosine).
距离函数(余弦欧氏或fast_cosine),的。
参数:iterations
Setting that specifies the number of iterations for k-Means.
设置指定数量的迭代k均值。
参数:min_pct_attr_support
Setting that specifies the minimum percent required for attributes in rules.
需要在规则的属性设置,指定的最低百分比。
参数:num_bins
Setting that specifies the number of histogram bins k-Means.
设置指定数量的直方图k均值。
参数:variance_split
Setting that specifies the split criterion for k-Means.
设置指定K均值分割标准。
参数:retrieve_outputs_to_R
Flag controlling if the output results are moved to the R environment.
船籍控制,如果输出的结果被移动到R环境。
参数:leave_model_in_dbms
Flag controlling if the model is deleted or left in RDBMS.
如果模型被删除或留在RDBMS标志控制。
参数:sql.log.file
File where to append the log of all the SQL calls made by this function.
文件中追加的log所有的SQL调用此功能。
Details
详细信息----------Details----------
The algorithm k-means (kmeans) uses a distance-based similarity measure and tessellates the data space creating hierarchies. It handles large data volumes via summarization and supports sparse data. It is especially useful when the dataset has a moderate number of numerical attributes and one has a predetermined number of clusters. The main parameters settings correspond to the choice of distance function (e.g., Euclidean or cosine), number of iterations, convergence tolerance and split criterion.
该算法的k-means(k均值)使用基于距离的相似性度量和镶嵌创建层次结构的数据空间。处理大数据量通过总结和支持稀疏的数据。这是特别有用的数据集时,有一个中等数量的数值属性,和一个具有预定数目的簇。主要参数设置对应的距离函数的选择(例如,欧氏或余弦),迭代次数,收敛性和拆分条件。
For more details on the algotithm implementation, parameters settings and characteristics of the ODM function itself consult the following Oracle documents: ODM Concepts, ODM Developer's Guide and Oracle SQL Packages: Data Mining, and Oracle Database SQL Language Reference (Data Mining functions), listed in the references below.
有关algotithm的实现的详细信息,参数设置和的ODM函数本身的特性请咨询以下Oracle文档:列出ODM的概念,ODM开发的指南和Oracle的SQL套件:数据挖掘,和甲骨文数据库SQL语言参考的数据挖掘功能,在下面的参考资料。
值----------Value----------
If retrieve_outputs_to_R is TRUE, returns a list with the following elements: <table summary="R valueblock"> <tr valign="top"><td>model.model_settings</td> <td> Table of settings used to build the model.</td></tr> <tr valign="top"><td>model.model_attributes</td> <td> Table of attributes used to build the model.</td></tr> <tr valign="top"><td>km.clusters</td> <td> General per-cluster information.</td></tr> <tr valign="top"><td>km.taxonomy</td> <td> Parent-child cluster relationship.</td></tr> <tr valign="top"><td>km.centroid</td> <td> Per cluster-attribute centroid information.</td></tr> <tr valign="top"><td>km.histogram</td> <td> Per cluster-attribute hitogram information.</td></tr> <tr valign="top"><td>km.rule</td> <td> Cluster rules.</td></tr> <tr valign="top"><td>km.leaf_cluster_count</td> <td> Leaf clusters with support.</td></tr> <tr valign="top"><td>km.assignment</td> <td> Assignment of training data to clusters (with probability).</td></tr> </table>
如果retrieve_outputs_to_R是TRUE,返回一个列表,包含下列元素:<table summary="R valueblock"> <tr valign="top"> <TD> model.model_settings</ TD> <TD>表,用来设置建立模型。</ TD> </ TR> <tr valign="top"> <TD> model.model_attributes</ TD> <TD>表用于建立模型的属性。</ TD> </ TR> <tr valign="top"> <TD> km.clusters </ TD> <TD>每个聚类信息。</ TD> </ TR> <tr valign="top"> <TD> km.taxonomy</ TD> <TD>亲子群关系。</ TD> </ TR> <tr valign="top"> <TD> km.centroid</ TD> <TD>每聚类属性的质心的信息。</ TD> </ TR> <tr valign="top"> <TD> km.histogram </ TD> <TD>每个聚类属性hitogram信息。</ TD> </ TR> <tr valign="top"> <TD> km.rule </ TD> <TD>聚类规则。</ TD> </ TR> <tr valign="top"> <TD><X > </ TD> <TD>叶簇的支持。</ TD> </ TR> <tr valign="top"> <TD> km.leaf_cluster_count </ TD> <TD>分配训练数据,聚类(概率)。</ TD> </ TR> </ TABLE>
(作者)----------Author(s)----------
Pablo Tamayo <a href="mailto:pablo.tamayo@oracle.com">pablo.tamayo@oracle.com</a>
Ari Mozes <a href="mailto:ari.mozes@oracle.com">ari.mozes@oracle.com</a>
参考文献----------References----------
Oracle Data Mining Concepts 11g Release 1 (11.1) http://download.oracle.com/docs/cd/B28359_01/datamine.111/b28129/toc.htm
Oracle Data Mining Application Developer's Guide 11g Release 1 (11.1) http://download.oracle.com/docs/cd/B28359_01/datamine.111/b28131/toc.htm
Oracle Data Mining Administrator's Guide 11g Release 1 (11.1) http://download.oracle.com/docs/cd/B28359_01/datamine.111/b28130/toc.htm
Oracle Database PL/SQL Packages and Types Reference 11g Release 1 (11.1) http://download.oracle.com/docs/cd/B28359_01/appdev.111/b28419/d_datmin.htm#ARPLS192
Oracle Database SQL Language Reference (Data Mining functions) 11g Release 1 (11.1) http://download.oracle.com/docs/cd/B28359_01/server.111/b28286/functions001.htm#SQLRF20030
参见----------See Also----------
RODM_apply_model,
RODM_apply_model,
实例----------Examples----------
## Not run: [#不运行:]
DB <- RODM_open_dbms_connection(dsn="orcl11g", uid= "rodm", pwd = "rodm")
### Clustering a 2D multi-Gaussian distribution of points into clusters[##聚类的二维高斯分布的点多到聚类]
set.seed(seed=6218945)
X1 <- c(rnorm(100, mean = 2, sd = 1), rnorm(100, mean = 8, sd = 2), rnorm(100, mean = 5, sd = 0.6),
rnorm(100, mean = 4, sd = 1), rnorm(100, mean = 10, sd = 1)) # Create and merge 5 Gaussian distributions[创建和合并高斯分布]
Y1 <- c(rnorm(100, mean = 1, sd = 2), rnorm(100, mean = 4, sd = 1.5), rnorm(100, mean = 6, sd = 0.5),
rnorm(100, mean = 3, sd = 0.2), rnorm(100, mean = 2, sd = 1))
ds <- data.frame(cbind(X1, Y1))
n.rows <- length(ds[,1]) # Number of rows[行数]
row.id <- matrix(seq(1, n.rows), nrow=n.rows, ncol=1, dimnames= list(NULL, c("ROW_ID"))) # Row id[行ID]
ds <- cbind(row.id, ds) # Add row id to dataset [添加行号数据集]
RODM_create_dbms_table(DB, "ds")
km <- RODM_create_kmeans_model(
database = DB, # database ODBC channel identifier[数据库的ODBC通道标识符]
data_table_name = "ds", # data frame containing the input dataset[数据框包含输入数据集]
case_id_column_name = "ROW_ID", # case id to enable assignments during build[情况编号,使任务在构建]
num_clusters = 5)
km2 <- RODM_apply_model(
database = DB, # database ODBC channel identifier[数据库的ODBC通道标识符]
data_table_name = "ds", # data frame containing the input dataset[数据框包含输入数据集]
model_name = "KM_MODEL",
supplemental_cols = c("X1","Y1"))
x1a <- km2$model.apply.results[, "X1"]
y1a <- km2$model.apply.results[, "Y1"]
clu <- km2$model.apply.results[, "CLUSTER_ID"]
c.numbers <- unique(as.numeric(clu))
c.assign <- match(clu, c.numbers)
color.map <- c("blue", "green", "red", "orange", "purple")
color <- color.map[c.assign]
nf <- layout(matrix(c(1, 2), 1, 2, byrow=T), widths = c(1, 1), heights = 1, respect = FALSE)
plot(x1a, y1a, pch=20, col=1, xlab="X1", ylab="Y1", main="Original Data Points")
plot(x1a, y1a, pch=20, type = "n", xlab="X1", ylab="Y1", main="After kmeans clustering")
for (i in 1:n.rows) {
points(x1a[i], y1a[i], col= color[i], pch=20)
}
legend(5, -0.5, legend=c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5"), pch = rep(20, 5),
col = color.map, pt.bg = color.map, cex = 0.8, pt.cex=1, bty="n")
km # look at the model details and cluster assignments[看模型的详细信息和聚类分配]
RODM_drop_model(DB, "KM_MODEL") # Drop the model[掉落的模型]
RODM_drop_dbms_table(DB, "ds") # Drop the database table[删除数据库中的表]
RODM_close_dbms_connection(DB)
## End(Not run)[#(不执行)]
转载请注明:出自 生物统计家园网(http://www.biostatistic.net)。
注:
注1:为了方便大家学习,本文档为生物统计家园网机器人LoveR翻译而成,仅供个人R语言学习参考使用,生物统计家园保留版权。
注2:由于是机器人自动翻译,难免有不准确之处,使用时仔细对照中、英文内容进行反复理解,可以帮助R语言的学习。
注3:如遇到不准确之处,请在本贴的后面进行回帖,我们会逐渐进行修订。
|