RODM_create_glm_model(RODM)
RODM_create_glm_model()所属R语言包:RODM
Create an ODM Generalized Linear Model
创建ODM广义线性模型
译者:生物统计家园网 机器人LoveR
描述----------Description----------
This function creates an ODM generalized linear model.
这个函数创建一个的ODM广义线性模型。
用法----------Usage----------
RODM_create_glm_model(database,
data_table_name,
case_id_column_name = NULL,
target_column_name,
model_name = "GLM_MODEL",
mining_function = "classification",
auto_data_prep = TRUE,
class_weights = NULL,
weight_column_name = NULL,
conf_level = NULL,
reference_class_name = NULL,
missing_value_treatment = NULL,
ridge_regression = NULL,
ridge_value = NULL,
vif_for_ridge = NULL,
diagnostics_table_name = NULL,
retrieve_outputs_to_R = TRUE,
leave_model_in_dbms = TRUE,
sql.log.file = NULL)
参数----------Arguments----------
参数:database
Database ODBC channel identifier returned from a call to RODM_open_dbms_connection
数据库的ODBC通道标识符返回调用RODM_open_dbms_connection
参数:data_table_name
Database table/view containing the training dataset.
数据库表/视图包含训练数据集。
参数:case_id_column_name
Row unique case identifier in data_table_name.
行独特的标识符的data_table_name。
参数:target_column_name
Target column name in data_table_name.
目标列名data_table_name。
参数:model_name
ODM Model name.
ODM产品型号名称。
参数:mining_function
Type of mining function for GLM model: "classification" (default) or "regression".
挖掘功能GLM模型的类型:“分类”(默认)或“回归”。
参数:auto_data_prep
Whether or not ODM should invoke automatic data preparation for the build.
无论ODM应该调用自动构建数据准备。
参数:class_weights
User-specified weights for the target classes.
用户指定的目标类权重。
参数:weight_column_name
Name of a column in data_table_name that contains a weighting factor for the rows. Row weights can be used as a compact representation of repeated rows, and can also be used to emphasize certain rows during model construction.
一个的列中data_table_name包含行的加权系数的名称。可以用作行权重的紧凑表示的重复行,并且也可以用来强调模型构建的过程中的某些行。
参数:conf_level
The confidence level for coefficient confidence intervals.
系数置信区间的置信水平。
参数:reference_class_name
The target value to be used as the reference value in a logistic regression model. Probabilities will be produced for the other (non-reference) class. By default, the algorithm chooses the value with the highest prevalence (the most cases) for the reference class.
被用来作为一个逻辑回归模型的基准值的目标值。概率会产生的其他(非基准)类。默认情况下,该算法选择的参考类的患病率最高(大多数情况下)。
参数:missing_value_treatment
How to handle missing values. Either replace by the mean or mode by setting ODMS_MISSING_VALUE_MEAN_MODE, or delete the entire row when a missing value is present by setting ODMS_MISSING_VALUE_DELETE_ROW.
如何处理缺失值。要么更换平均值或模式通过设置ODMS_MISSING_VALUE_MEAN_MODE,或删除整行缺失值时,是通过设置ODMS_MISSING_VALUE_DELETE_ROW的。
参数:ridge_regression
Whether or not ridge regression will be enabled. By default, the algorithm determines whether or not to use ridge. You can explicitly enable ridge by setting GLMS_RIDGE_REGRESSION to GLMS_RIDGE_REG_ENABLE. Ridge applies to both regression and classification mining functions. When ridge is enabled, no prediction bounds are produced by the PREDICTION_BOUNDS SQL operator.
无论岭回归将被启用。默认情况下,该算法确定是否使用脊。你可以明确地启用的脊通过设置GLMS_RIDGE_REGRESSION GLMS_RIDGE_REG_ENABLE。岭适用于回归和分类挖掘功能。当启用脊,没有预测范围是由SQL运营商的PREDICTION_BOUNDS。
参数:ridge_value
The value for the ridge parameter used by the algorithm. This setting is only used when you explicitly enable ridge regression by setting GLMS_RIDGE_REGRESSION to GLMS_RIDGE_REG_ENABLE. If ridge regression is enabled internally by the algorithm, the ridge parameter is determined by the algorithm.
脊所使用的算法的参数的值。此设置仅用于当你明确地启用岭回归通过设置GLMS_RIDGE_REGRESSION GLMS_RIDGE_REG_ENABLE。如果脊回归内部被启动,通过该算法,皮丘参数是由算法确定的。
参数:vif_for_ridge
(Linear regression only) Whether or not to produce Variance Inflation Factor (VIF) statistics when ridge is being used. By default, VIF is not produced when ridge is enabled.When you explicitly enable ridge regression by setting GLMS_RIDGE_REGRESSION to GLMS_RIDGE_REG_ENABLE, you can request VIF statistics by setting GLMS_VIF_FOR_RIDGE to GLMS_VIF_RIDGE_ENABLE; the algorithm will produce VIF if enough system resources are available.
(线性回归)不论脊被使用时产生方差膨胀因子(VIF)的统计数据。默认情况下,VIF产生的当脊enabled.When你明确地启用岭回归通过设置GLMS_RIDGE_REGRESSION GLMS_RIDGE_REG_ENABLE时,你可以要求VIF统计通过设置GLMS_VIF_FOR_RIDGE GLMS_VIF_RIDGE_ENABLE;该算法将产生VIF,如果有足够的系统资源可用。
参数:diagnostics_table_name
Non-existing database table to hold per-row diagnostic information. Requires a case_id_column_name to be specified. The table will remain in the database and must be dropped explicitly when desired.;
不存在的数据库表来保存每行的诊断信息。需要指定一个case_id_column_name。该表将保留在数据库中,当需要时,必须明确被丢弃;
参数:retrieve_outputs_to_R
Flag controlling if the output results are moved to the R environment.
船籍控制,如果输出的结果被移动到R环境。
参数:leave_model_in_dbms
Flag controlling if the model is deleted or left in RDBMS.
如果模型被删除或留在RDBMS标志控制。
参数:sql.log.file
File where to append the log of all the SQL calls made by this function.
文件中追加的log所有的SQL调用此功能。
Details
详细信息----------Details----------
Generalized linear models (GLM) implements logistic regression for classification of binary targets and linear regression for continuous targets. GLM classification supports confidence bounds for prediction probabilities. GLM regression supports confidence bounds for predictions and supports linear and logistic regression with the logit link and binomial variance functions. Ridge regression is a technique that compensates for multicollinearity. Oracle Data Mining supports ridge regression for both regression and classification mining functions. The algorithm automatically uses ridge if it detects singularity (exact multicollinearity) in the data.
广义线性模型(GLM)实现连续目标的的二进制目标和线性回归分类Logistic回归。 GLM分类支持的置信区间预测概率。 GLM回归支持的置信区间的预测,并支持线性和logistic回归,logit的关联和二项式方差函数。岭回归是一种技术,多重共线性补偿。 Oracle数据挖掘岭回归的回归和分类挖掘功能。该算法会自动使用脊,如果它检测到的奇异性(确切的多重共线性)中的数据。
For more details on the algotithm implementation, parameters settings and characteristics of the ODM function itself consult the following Oracle documents: ODM Concepts, ODM Developer's Guide, Oracle SQL Packages: Data Mining, and Oracle Database SQL Language Reference (Data Mining functions), listed in the references below.
有关algotithm的实现的详细信息,参数设置和的ODM函数本身的特性请咨询以下Oracle文档:ODM的概念,ODM开发的指南,Oracle的SQL套件:数据挖掘,和甲骨文数据库SQL语言参考(数据挖掘功能),上市在下面的参考资料。
值----------Value----------
If retrieve_outputs_to_R is TRUE, returns a list with the following elements: <table summary="R valueblock"> <tr valign="top"><td>model.model_settings</td> <td> Table of settings used to build the model.</td></tr> <tr valign="top"><td>model.model_attributes</td> <td> Table of attributes used to build the model.</td></tr> <tr valign="top"><td>glm.globals</td> <td> Global details for the GLM model.</td></tr> <tr valign="top"><td>glm.coefficients</td> <td> The coefficients of the GLM model, along with more per-attribute information.</td></tr> </table>
如果retrieve_outputs_to_R是TRUE,返回一个列表,包含下列元素:<table summary="R valueblock"> <tr valign="top"> <TD> model.model_settings</ TD> <TD>表,用来设置建立模型。</ TD> </ TR> <tr valign="top"> <TD> model.model_attributes</ TD> <TD>表用于建立模型的属性。</ TD> </ TR> <tr valign="top"> <TD> glm.globals </ TD> <TD>全球的GLM模型的细节。</ TD> </ TR> <tr valign="top"> <TD >glm.coefficients </ TD> <TD> GLM模型的系数,以及与每个属性信息。</ TD> </ TR> </ TABLE>
(作者)----------Author(s)----------
Pablo Tamayo <a href="mailto:pablo.tamayo@oracle.com">pablo.tamayo@oracle.com</a>
Ari Mozes <a href="mailto:ari.mozes@oracle.com">ari.mozes@oracle.com</a>
参考文献----------References----------
Dobson, Annette J. and Barnett, Adrian G. (2008) An Introduction to Generalized Linear Models, Third Edition. Texts in Statistical Science ,77 . Chapman & Hall/CRC Press, Boca Raton, FL.
B. L. Milenova, J. S. Yarmus, and M. M. Campos. SVM in oracle database 10g: removing the barriers to widespread adoption of support vector machines. In Proceedings of the ”31st international Conference on Very Large Data Bases” (Trondheim, Norway, August 30 - September 02, 2005). pp1152-1163, ISBN:1-59593-154-6.
Milenova, B.L. Campos, M.M., Mining high-dimensional data for information fusion: a database-centric approach 8th International Conference on Information Fusion, 2005. Publication Date: 25-28 July 2005. ISBN: 0-7803-9286-8. John Shawe-Taylor & Nello Cristianini. Support Vector Machines and other kernel-based learning methods. Cambridge University Press, 2000.
Oracle Data Mining Concepts 11g Release 1 (11.1) http://download.oracle.com/docs/cd/B28359_01/datamine.111/b28129/toc.htm
Oracle Data Mining Application Developer's Guide 11g Release 1 (11.1) http://download.oracle.com/docs/cd/B28359_01/datamine.111/b28131/toc.htm
Oracle Data Mining Administrator's Guide 11g Release 1 (11.1) http://download.oracle.com/docs/cd/B28359_01/datamine.111/b28130/toc.htm
Oracle Database PL/SQL Packages and Types Reference 11g Release 1 (11.1) http://download.oracle.com/docs/cd/B28359_01/appdev.111/b28419/d_datmin.htm#ARPLS192
Oracle Database SQL Language Reference (Data Mining functions) 11g Release 1 (11.1) http://download.oracle.com/docs/cd/B28359_01/server.111/b28286/functions001.htm#SQLRF20030
参见----------See Also----------
RODM_apply_model,
RODM_apply_model,
实例----------Examples----------
## Not run: [#不运行:]
DB <- RODM_open_dbms_connection(dsn="orcl11g", uid= "rodm", pwd = "rodm")
### GLM Classification[##GLM分类]
# Predicting survival in the sinking of the Titanic based on pasenger's sex, age, class, etc.[在泰坦尼克号沉没的基础上pasenger的性别,年龄,阶级,等预测生存]
data(titanic3, package="PASWR") # Load survival data from Titanic[加载生存“泰坦尼克”]
ds <- titanic3[,c("pclass", "survived", "sex", "age", "fare", "embarked")] # Select subset of attributes[选择的属性子集]
ds[,"survived"] <- ifelse(ds[,"survived"] == 1, "Yes", "No") # Rename target values[重命名目标值]
n.rows <- length(ds[,1]) # Number of rows[行数]
random_sample <- sample(1:n.rows, ceiling(n.rows/2)) # Split dataset randomly in train/test subsets[随机拆分数据集火车/测试子集]
titanic_train <- ds[random_sample,] # Training set[训练集]
train.rows <- length(titanic_train[,1]) # Number of rows[行数]
row.id <- matrix(seq(1, train.rows), nrow=train.rows, ncol=1, dimnames= list(NULL, c("ROW_ID"))) # Row id[行ID]
titanic_train <- cbind(row.id, titanic_train) # Add row id to dataset [添加行号数据集]
titanic_test <- ds[setdiff(1:n.rows, random_sample),] # Test set[测试集]
RODM_create_dbms_table(DB, "titanic_train") # Push the training table to the database[推训练表到数据库]
RODM_create_dbms_table(DB, "titanic_test") # Push the testing table to the database[将测试表到数据库]
# Weight one class more heavily than the other[重量:一类比其他更严重]
weights <- data.frame(
target_value = c("Yes", "No"),
class_weight = c(1, 10))
glm <- RODM_create_glm_model(database = DB, # Create ODM GLM classification model[ODM GLM分类模型]
data_table_name = "titanic_train",
case_id_column_name = "ROW_ID",
target_column_name = "survived",
model_name = "GLM_MODEL",
class_weights = weights,
diagnostics_table_name = "GLM_DIAG",
mining_function = "classification")
glm2 <- RODM_apply_model(database = DB, # Predict test data[预测测试数据]
data_table_name = "titanic_test",
model_name = "GLM_MODEL",
supplemental_cols = "survived")
print(glm2$model.apply.results[1:10,]) # Print example of prediction results[打印示例的预测结果]
actual <- glm2$model.apply.results[, "SURVIVED"]
predicted <- glm2$model.apply.results[, "PREDICTION"]
probs <- as.real(as.character(glm2$model.apply.results[, "'Yes'"]))
table(actual, predicted, dnn = c("Actual", "Predicted")) # Confusion matrix[混淆矩阵]
library(verification)
perf.auc <- roc.area(ifelse(actual == "Yes", 1, 0), probs) # Compute ROC and plot[计算ROC和图]
auc.roc <- signif(perf.auc$A, digits=3)
auc.roc.p <- signif(perf.auc$p.value, digits=3)
roc.plot(ifelse(actual == "Yes", 1, 0), probs, binormal=T, plot="both", xlab="False Positive Rate",
ylab="True Postive Rate", main= "Titanic survival ODM GLM model ROC Curve")
text(0.7, 0.4, labels= paste("AUC ROC:", signif(perf.auc$A, digits=3)))
text(0.7, 0.3, labels= paste("p-value:", signif(perf.auc$p.value, digits=3)))
glm # look at the model details[在模型的详细信息]
# access and look at the per-row diagnostics from model training[访问,并期待在每行诊断模型训练]
diaginfo <- sqlQuery(DB, query = "SELECT * FROM GLM_DIAG")
diaginfo
RODM_drop_model(DB, "GLM_MODEL") # Drop the model[掉落的模型]
RODM_drop_dbms_table(DB, "GLM_DIAG") # Drop the diagnostics table[掉落诊断表]
RODM_drop_dbms_table(DB, "titanic_train") # Drop the database table[删除数据库中的表]
RODM_drop_dbms_table(DB, "titanic_test") # Drop the database table[删除数据库中的表]
## End(Not run)[#(不执行)]
### GLM Regression[##GLM回归]
## Not run: [#不运行:]
x1 <- 2 * runif(200)
noise <- 3 * runif(200) - 1.5
y1 <- 2 + 2*x1 + x1*x1 + noise
dataset <- data.frame(x1, y1)
names(dataset) <- c("X1", "Y1")
RODM_create_dbms_table(DB, "dataset") # Push the training table to the database[推训练表到数据库]
glm <- RODM_create_glm_model(database = DB, # Create ODM GLM model[创建ODM GLM模型]
data_table_name = "dataset",
target_column_name = "Y1",
mining_function = "regression")
glm2 <- RODM_apply_model(database = DB, # Predict training data[预测训练数据]
data_table_name = "dataset",
model_name = "GLM_MODEL",
supplemental_cols = "X1")
windows(height=8, width=12)
plot(x1, y1, pch=20, col="blue")
points(x=glm2$model.apply.results[, "X1"],
glm2$model.apply.results[, "PREDICTION"], pch=20, col="red")
legend(0.5, 9, legend = c("actual", "GLM regression"), pch = c(20, 20),
col = c("blue", "red"),
pt.bg = c("blue", "red"), cex = 1.20, pt.cex=1.5, bty="n")
RODM_drop_model(DB, "GLM_MODEL") # Drop the model[掉落的模型]
RODM_drop_dbms_table(DB, "dataset") # Drop the database table[删除数据库中的表]
RODM_close_dbms_connection(DB)
## End(Not run)[#(不执行)]
转载请注明:出自 生物统计家园网(http://www.biostatistic.net)。
注:
注1:为了方便大家学习,本文档为生物统计家园网机器人LoveR翻译而成,仅供个人R语言学习参考使用,生物统计家园保留版权。
注2:由于是机器人自动翻译,难免有不准确之处,使用时仔细对照中、英文内容进行反复理解,可以帮助R语言的学习。
注3:如遇到不准确之处,请在本贴的后面进行回帖,我们会逐渐进行修订。
|