RODM_create_svm_model(RODM)
RODM_create_svm_model()所属R语言包:RODM
Create an ODM Support Vector Machine model
创建ODM的支持向量机模型
译者:生物统计家园网 机器人LoveR
描述----------Description----------
This function creates an ODM Support Vector Machine model.
这个函数创建一个的ODM支持向量机模型。
用法----------Usage----------
RODM_create_svm_model(database,
data_table_name,
case_id_column_name = NULL,
target_column_name = NULL,
model_name = "SVM_MODEL",
mining_function = "classification",
auto_data_prep = TRUE,
class_weights = NULL,
active_learning = TRUE,
complexity_factor = NULL,
conv_tolerance = NULL,
epsilon = NULL,
kernel_cache_size = NULL,
kernel_function = NULL,
outlier_rate = NULL,
std_dev = NULL,
retrieve_outputs_to_R = TRUE,
leave_model_in_dbms = TRUE,
sql.log.file = NULL)
参数----------Arguments----------
参数:database
Database ODBC channel identifier returned from a call to RODM_open_dbms_connection
数据库的ODBC通道标识符返回调用RODM_open_dbms_connection
参数:data_table_name
Database table/view containing the training dataset.
数据库表/视图包含训练数据集。
参数:case_id_column_name
Row unique case identifier in data_table_name.
行独特的标识符的data_table_name。
参数:target_column_name
Target column name in data_table_name.
目标列名data_table_name。
参数:model_name
ODM Model name.
ODM产品型号名称。
参数:mining_function
Type of mining function for SVM model: "classification" (default), "regression" or "anomaly_detection".
SVM模型:“分类”(默认),“回归”或“anomaly_detection”挖掘功能的类型。
参数:auto_data_prep
Whether or not ODM should invoke automatic data preparation for the build.
无论ODM应该调用自动构建数据准备。
参数:class_weights
User-specified weights for the target classes.
用户指定的目标类权重。
参数:active_learning
Whether or not ODM should use active learning.
无论ODM应该使用主动学习。
参数:complexity_factor
Setting that specifies the complexity factor for SVM. The default is NULL.
设置指定的支持向量机的复杂因素。默认值是NULL。
参数:conv_tolerance
Setting that specifies tolerance for SVM. The default is 0.001.
设置,指定用于SVM的耐受性。默认值是0.001。
参数:epsilon
Regularization setting for regression, similar to complexity factor. Epsilon specifies the allowable residuals, or noise, in the data. The default is NULL.
正则设置为回归,类似的复杂因素。小量指定了允许的残留物,噪音的数据。默认值是NULL。
参数:kernel_cache_size
Setting that specifiefs the Gaussian kernel cache size (bytes) for SVM. The default is 5e+07.
设置为SVM,specifiefs高斯内核缓存大小(字节)。默认值是5E +07。
参数:kernel_function
Setting for specifying the kernel function for SVM (Gaussian or Linear). The default is to let ODM decide based on the data.
设置指定SVM的核函数(高斯或线性)。默认情况下是让ODM决定数据的基础上。
参数:outlier_rate
A setting specifying the desired rate of outliers in the training data for anomaly detection one-class SVM. The default is NULL.
设置指定所需的速度异常检测一类SVM的训练数据中的离群值。默认值是NULL。
参数:std_dev
A setting that specifies the standard deviation for the SVM Gaussian kernel. The default is NULL (algorithm generated).
设置指定的SVM高斯核的标准差。默认值是NULL(算法生成)。
参数:retrieve_outputs_to_R
Flag controlling if the output results are moved to the R environment.
船籍控制,如果输出的结果被移动到R环境。
参数:leave_model_in_dbms
Flag controlling if the model is deleted or left in RDBMS.
如果模型被删除或留在RDBMS标志控制。
参数:sql.log.file
File where to append the log of all the SQL calls made by this function.
文件中追加的log所有的SQL调用此功能。
Details
详细信息----------Details----------
Support Vector Machines (SVMs) for classification belong to a class of algorithms known as "kernel" methods (Cristianini and Shawe-Taylor 2000). Kernel methods rely on applying predefined functions (kernels) to the input data. The boundary is a function of the predictor values. The key concept behind SVMs is that the points lying closest to the boundary, i.e., the support vectors, can be used to define the boundary. The goal of the SVM algorithm is to identify the support vectors and assign them weights that produce an optimal, largest margin, class-separating boundary.
支持向量机(SVMs)分类属于一类被称为“内核”的方法(Cristianini和2000年Shawe先生 - 泰勒)的算法。内核的方法依赖于应用预定义的功能(内核)的输入数据。的边界是一个函数的预测值。背后支持向量机的关键概念是最靠近的点的边界,即,支持向量机,可以使用定义的边界。支持向量机算法的目标是要找出支持向量,并将其分配权重,产生最佳,幅度最大,等级分边界。
This function enables to call Oracle Data Mining's SVM implementation (for details see Milenova et al 2005) that supports classification, regression and anomaly detection (one-class classification) with linear or Gaussian kernels and an automatic and efficient estimation of the complexity factor (C) and standard deviation (sigma). It also supports sparse data, which makes it very efficient for problems such as text mining. Support Vector Machines (SVMs) for regression utilizes an epsilon-insensitive loss function and works particularly well for high-dimensional noisy data. The scalability and usability of this function are particularly useful when deploying predictive models in a production database data mining system. The implementation also supports Active learning which forces the SVM algorithm to restrict learning to the most informative training examples and not to attempt to use the entire body of data. In most cases, the resulting models have predictive accuracy comparable to that of a standard (exact) SVM model. Active learning provides a significant improvement in both linear and Gaussian SVM models, whether for classification, regression, or anomaly detection. However, active learning is especially advantageous when using the Gaussian kernel, because nonlinear models can otherwise grow to be very large and can place considerable demands on memory and other system resources.
此功能可让到调用Oracle的数据挖掘的SVM实现(有关详细信息,请参阅Milenova等人2005年),支持分类,回归和异常检测(单值分类)与线性或高斯内核和一个全自动高效估计的复杂因素(C )和标准偏差(σ)。它还支持稀疏数据,这使得它非常高效的问题,如文本挖掘。对回归的支持向量机(SVMs)利用ε-不敏感损失函数,并特别为高维的噪声数据。此功能的可扩展性和可用性的部署时特别有用,在生产数据库中的数据挖掘系统的预测模型。实现还支持强制限制的学习,以最翔实的训练实例,而不是试图用整个身体数据的支持向量机算法的主动学习。在大多数情况下,由此产生的模型的标准(精确)SVM模型的预测准确度。主动学习在直链和高斯SVM模型提供了一个显着的改善,无论是用于分类,回归,或异常检测。然而,积极的学习是特别有利的,当使用高斯内核,因为非线性模型,否则成长是非常大的,可以将大量的内存和其他系统资源的需求。
The SVM algorithm operates natively on numeric attributes. The function automatically "explodes" categorical data into a set of binary attributes, one per category value. For example, a character column for marital status with values married or single would be transformed to two numeric attributes: married and single. The new attributes could have the value 1 (true) or 0 (false). When there are missing values in columns with simple data types (not nested), SVM interprets them as missing at random. The algorithm automatically replaces missing categorical values with the mode and missing numerical values with the mean. SVM requires the normalization of numeric input. Normalization places the values of numeric attributes on the same scale and prevents attributes with a large original scale from biasing the solution. Normalization also minimizes the likelihood of overflows and underflows. Furthermore, normalization brings the numerical attributes to the same scale (0,1) as the exploded categorical data. The SVM algorithm automatically handles missing value treatment and the transformation of categorical data, but normalization and outlier detection must be handled manually.
SVM算法原生支持数字属性。该功能会自动“爆炸”分类数据为一组二进制属性,每一个类别值。例如,婚姻状况已婚或未婚的值将被转换的字符列两个数值属性:已婚和单身。新的属性值1(真)或0(假)。当有缺失值的简单数据类型的列(不嵌套),SVM将它们解释为随机缺失。该算法将自动替换缺少的模式和失踪的分类值的平均数值。 SVM需要的数字输入标准化。标准化数值属性的值放置在相同尺度和防止具有大原规模从属性施力的溶液。标准化的可能性最小化的溢出和下溢。此外,标准化带来同等规模的(0,1),发生爆炸的分类数据的数值属性。 SVM算法自动处理缺失值处理和分类数据的改造,但必须手工处理的规范化和孤立点检测。
For more details on the algotithm implementation see Milenova et al 2005. For details on the parameters and characteristics of the ODM function itself consult the ODM Concepts, the ODM Developer's Guide and the Oracle SQL Packages: Data Mining documents in the references below.
欲了解更多详细信息,上的algotithm的实施,2005年Milenova等。本身的参数和特征的ODM功能的详细信息咨询ODM的概念,的ODM开发人员指南“和Oracle SQL程序包:数据挖掘文件在下面的参考资料。
值----------Value----------
If retrieve_outputs_to_R is TRUE, returns a list with the following elements: <table summary="R valueblock"> <tr valign="top"><td>model.model_settings</td> <td> Table of settings used to build the model.</td></tr> <tr valign="top"><td>model.model_attributes</td> <td> Table of attributes used to build the model.</td></tr> </table> If the model that was built uses a linear kernel, then the following is additionally returned: <table summary="R valueblock"> <tr valign="top"><td>svm.coefficients</td> <td> The coefficients of the SVM model, one for each input attribute. If auto_data_prep, then these coefficients will be in the transformed space (after automatic outlier-aware normalization is applied).</td></tr> </table>
如果retrieve_outputs_to_R是TRUE,返回一个列表,包含下列元素:<table summary="R valueblock"> <tr valign="top"> <TD> model.model_settings</ TD> <TD>表,用来设置建立模型。</ TD> </ TR> <tr valign="top"> <TD> model.model_attributes</ TD> <TD>表用于建立模型的属性。</ TD> </ TR> </ TABLE>,如果模型,采用了线性核,然后下面是另外回:<table summary="R valueblock"> <tr valign="top"> <TD> svm.coefficients< / TD> <TD>系数的SVM模型,为每个输入属性之一。如果auto_data_prep,然后将这些系数转换后的空间(自动离群感知标准化后)。</ TD> </ TR> </表>
(作者)----------Author(s)----------
Pablo Tamayo <a href="mailto:pablo.tamayo@oracle.com">pablo.tamayo@oracle.com</a>
Ari Mozes <a href="mailto:ari.mozes@oracle.com">ari.mozes@oracle.com</a>
参考文献----------References----------
B. L. Milenova, J. S. Yarmus, and M. M. Campos. SVM in oracle database 10g: removing the barriers to widespread adoption of support vector machines. In Proceedings of the ”31st international Conference on Very Large Data Bases” (Trondheim, Norway, August 30 - September 02, 2005). pp1152-1163, ISBN:1-59593-154-6.
Milenova, B.L. Campos, M.M., Mining high-dimensional data for information fusion: a database-centric approach 8th International Conference on Information Fusion, 2005. Publication Date: 25-28 July 2005. ISBN: 0-7803-9286-8. John Shawe-Taylor & Nello Cristianini. Support Vector Machines and other kernel-based learning methods. Cambridge University Press, 2000.
Oracle Data Mining Concepts 11g Release 1 (11.1) http://download.oracle.com/docs/cd/B28359_01/datamine.111/b28129/toc.htm
Oracle Data Mining Application Developer's Guide 11g Release 1 (11.1) http://download.oracle.com/docs/cd/B28359_01/datamine.111/b28131/toc.htm
Oracle Data Mining Administrator's Guide 11g Release 1 (11.1) http://download.oracle.com/docs/cd/B28359_01/datamine.111/b28130/toc.htm
Oracle Database PL/SQL Packages and Types Reference 11g Release 1 (11.1) http://download.oracle.com/docs/cd/B28359_01/appdev.111/b28419/d_datmin.htm#ARPLS192
Oracle Database SQL Language Reference (Data Mining functions) 11g Release 1 (11.1) http://download.oracle.com/docs/cd/B28359_01/server.111/b28286/functions001.htm#SQLRF20030
参见----------See Also----------
RODM_apply_model,
RODM_apply_model,
实例----------Examples----------
## Not run: [#不运行:]
DB <- RODM_open_dbms_connection(dsn="orcl11g", uid= "rodm", pwd = "rodm")
# Separating three Gaussian classes in 2D[在2D分离三个高斯类]
X1 <- c(rnorm(200, mean = 2, sd = 1), rnorm(300, mean = 8, sd = 2), rnorm(300, mean = 5, sd = 0.6))
Y1 <- c(rnorm(200, mean = 1, sd = 2), rnorm(300, mean = 4, sd = 1.5), rnorm(300, mean = 6, sd = 0.5))
target <- c(rep(1, 200), rep(2, 300), rep(3, 300))
ds <- data.frame(cbind(X1, Y1, target))
n.rows <- length(ds[,1]) # Number of rows[行数]
set.seed(seed=6218945)
random_sample <- sample(1:n.rows, ceiling(n.rows/2)) # Split dataset randomly in train/test subsets[随机拆分数据集火车/测试子集]
ds_train <- ds[random_sample,] # Training set[训练集]
ds_test <- ds[setdiff(1:n.rows, random_sample),] # Test set[测试集]
RODM_create_dbms_table(DB, "ds_train") # Push the training table to the database[推训练表到数据库]
RODM_create_dbms_table(DB, "ds_test") # Push the testing table to the database[将测试表到数据库]
svm <- RODM_create_svm_model(database = DB, # Create ODM SVM classification model[ODM SVM分类模型]
data_table_name = "ds_train",
target_column_name = "target")
svm2 <- RODM_apply_model(database = DB, # Predict test data[预测测试数据]
data_table_name = "ds_test",
model_name = "SVM_MODEL",
supplemental_cols = c("X1","Y1","TARGET"))
color.map <- c("blue", "green", "red")
color <- color.map[svm2$model.apply.results[, "TARGET"]]
plot(svm2$model.apply.results[, "X1"],
svm2$model.apply.results[, "Y1"],
pch=20, col=color, ylim=c(-2,10), xlab="X1", ylab="Y1",
main="Test Set")
actual <- svm2$model.apply.results[, "TARGET"]
predicted <- svm2$model.apply.results[, "PREDICTION"]
for (i in 1:length(ds_test[,1])) {
if (actual[i] != predicted[i])
points(x=svm2$model.apply.results[i, "X1"],
y=svm2$model.apply.results[i, "Y1"],
col = "black", pch=20)
}
legend(6, 1.5, legend=c("Class 1 (correct)", "Class 2 (correct)", "Class 3 (correct)", "Error"),
pch = rep(20, 4), col = c(color.map, "black"), pt.bg = c(color.map, "black"), cex = 1.20,
pt.cex=1.5, bty="n")
RODM_drop_model(DB, "SVM_MODEL") # Drop the model[掉落的模型]
RODM_drop_dbms_table(DB, "ds_train") # Drop the database table[删除数据库中的表]
RODM_drop_dbms_table(DB, "ds_test") # Drop the database table[删除数据库中的表]
## End(Not run)[#(不执行)]
### SVM Classification[##SVM分类]
# Predicting survival in the sinking of the Titanic based on pasenger's sex, age, class, etc.[在泰坦尼克号沉没的基础上pasenger的性别,年龄,阶级,等预测生存]
## Not run: [#不运行:]
data(titanic3, package="PASWR") # Load survival data from Titanic[加载生存“泰坦尼克”]
ds <- titanic3[,c("pclass", "survived", "sex", "age", "fare", "embarked")] # Select subset of attributes[选择的属性子集]
ds[,"survived"] <- ifelse(ds[,"survived"] == 1, "Yes", "No") # Rename target values[重命名目标值]
n.rows <- length(ds[,1]) # Number of rows[行数]
random_sample <- sample(1:n.rows, ceiling(n.rows/2)) # Split dataset randomly in train/test subsets[随机拆分数据集火车/测试子集]
titanic_train <- ds[random_sample,] # Training set[训练集]
titanic_test <- ds[setdiff(1:n.rows, random_sample),] # Test set[测试集]
RODM_create_dbms_table(DB, "titanic_train") # Push the training table to the database[推训练表到数据库]
RODM_create_dbms_table(DB, "titanic_test") # Push the testing table to the database[将测试表到数据库]
svm <- RODM_create_svm_model(database = DB, # Create ODM SVM classification model[ODM SVM分类模型]
data_table_name = "titanic_train",
target_column_name = "survived",
model_name = "SVM_MODEL",
mining_function = "classification")
svm2 <- RODM_apply_model(database = DB, # Predict test data[预测测试数据]
data_table_name = "titanic_test",
model_name = "SVM_MODEL",
supplemental_cols = "survived")
print(svm2$model.apply.results[1:10,]) # Print example of prediction results[打印示例的预测结果]
actual <- svm2$model.apply.results[, "SURVIVED"]
predicted <- svm2$model.apply.results[, "PREDICTION"]
probs <- as.real(as.character(svm2$model.apply.results[, "'Yes'"]))
table(actual, predicted, dnn = c("Actual", "Predicted")) # Confusion matrix[混淆矩阵]
library(verification)
perf.auc <- roc.area(ifelse(actual == "Yes", 1, 0), probs) # Compute ROC and plot[计算ROC和图]
auc.roc <- signif(perf.auc$A, digits=3)
auc.roc.p <- signif(perf.auc$p.value, digits=3)
roc.plot(ifelse(actual == "Yes", 1, 0), probs, binormal=T, plot="both", xlab="False Positive Rate",
ylab="True Postive Rate", main= "Titanic survival ODM SVM model ROC Curve")
text(0.7, 0.4, labels= paste("AUC ROC:", signif(perf.auc$A, digits=3)))
text(0.7, 0.3, labels= paste("p-value:", signif(perf.auc$p.value, digits=3)))
RODM_drop_model(DB, "SVM_MODEL") # Drop the model[掉落的模型]
RODM_drop_dbms_table(DB, "titanic_train") # Drop the database table[删除数据库中的表]
RODM_drop_dbms_table(DB, "titanic_test") # Drop the database table[删除数据库中的表]
## End(Not run)[#(不执行)]
### SVM Regression[##SVM回归]
# Aproximating a one-dimensional non-linear function[Aproximating一个一维的非线性函数]
## Not run: [#不运行:]
X1 <- 10 * runif(500) - 5
Y1 <- X1*cos(X1) + 2*runif(500)
ds <- data.frame(cbind(X1, Y1))
RODM_create_dbms_table(DB, "ds") # Push the training table to the database[推训练表到数据库]
svm <- RODM_create_svm_model(database = DB, # Create ODM SVM regression model[ODM SVM回归模型]
data_table_name = "ds",
target_column_name = "Y1",
mining_function = "regression")
svm2 <- RODM_apply_model(database = DB, # Predict training data[预测训练数据]
data_table_name = "ds",
model_name = "SVM_MODEL",
supplemental_cols = "X1")
plot(X1, Y1, pch=20, col="blue")
points(x=svm2$model.apply.results[, "X1"],
svm2$model.apply.results[, "PREDICTION"], pch=20, col="red")
legend(-4, -1.5, legend = c("actual", "SVM regression"), pch = c(20, 20), col = c("blue", "red"),
pt.bg = c("blue", "red"), cex = 1.20, pt.cex=1.5, bty="n")
RODM_drop_model(DB, "SVM_MODEL") # Drop the model[掉落的模型]
RODM_drop_dbms_table(DB, "ds") # Drop the database table[删除数据库中的表]
## End(Not run)[#(不执行)]
### Anomaly detection[##异常检测]
# Finding outliers in a 2D-dimensional discrete distribution of points[寻找一个2D二维离散分布点中的离群值]
## Not run: [#不运行:]
X1 <- c(rnorm(200, mean = 2, sd = 1), rnorm(300, mean = 8, sd = 2))
Y1 <- c(rnorm(200, mean = 2, sd = 1.5), rnorm(300, mean = 8, sd = 1.5))
ds <- data.frame(cbind(X1, Y1))
RODM_create_dbms_table(DB, "ds") # Push the table to the database[按下表的数据库]
svm <- RODM_create_svm_model(database = DB, # Create ODM SVM anomaly detection model[ODM SVM异常检测模型]
data_table_name = "ds",
target_column_name = NULL,
model_name = "SVM_MODEL",
mining_function = "anomaly_detection")
svm2 <- RODM_apply_model(database = DB, # Predict training data[预测训练数据]
data_table_name = "ds",
model_name = "SVM_MODEL",
supplemental_cols = c("X1","Y1"))
plot(X1, Y1, pch=20, col="white")
col <- ifelse(svm2$model.apply.results[, "PREDICTION"] == 1, "green", "red")
for (i in 1:500) points(x=svm2$model.apply.results[i, "X1"],
y=svm2$model.apply.results[i, "Y1"],
col = col[i], pch=20)
legend(8, 2, legend = c("typical", "anomaly"), pch = c(20, 20), col = c("green", "red"),
pt.bg = c("green", "red"), cex = 1.20, pt.cex=1.5, bty="n")
RODM_drop_model(DB, "SVM_MODEL") # Drop the model[掉落的模型]
RODM_drop_dbms_table(DB, "ds") # Drop the database table[删除数据库中的表]
RODM_close_dbms_connection(DB)
## End(Not run)[#(不执行)]
转载请注明:出自 生物统计家园网(http://www.biostatistic.net)。
注:
注1:为了方便大家学习,本文档为生物统计家园网机器人LoveR翻译而成,仅供个人R语言学习参考使用,生物统计家园保留版权。
注2:由于是机器人自动翻译,难免有不准确之处,使用时仔细对照中、英文内容进行反复理解,可以帮助R语言的学习。
注3:如遇到不准确之处,请在本贴的后面进行回帖,我们会逐渐进行修订。
|