R语言 XML包 readHTMLTable()函数中文帮助文档(中英文对照)

loveR · 发表于 2012-10-1 23:57:53

readHTMLTable(XML)
readHTMLTable()所属R语言包：XML

                                    Read data from one or more HTML tables
                                       读取数据从一个或多个HTML表格

                                       译者：生物统计家园网机器人LoveR

描述----------Description----------

This function and its methods provide somewhat robust methods for extracting data from HTML tables in an HTML document. One can read all the tables in a document given by filename or URL, or having already parsed the document via htmlParse. Alternatively, one can specify an individual <table> node in the document.
这个函数和方法提供一些可靠的方法，在一个HTML文档中的HTML表格中提取数据。一个可以读取的文件，文件名或URL中的所有表，或已经通过htmlParse解析文件。或者，用户可以指定一个单独的<table>在文档中的节点。

The methods attempt to do some heuristic computations to determine the header labels for the columns, the name of the table, etc.
该方法试图做一些启发式的计算，以确定列的表的名称等的标头的标签

用法----------Usage----------

readHTMLTable(doc, header = NA,
            colClasses = NULL, skip.rows = integer(), trim = TRUE,
            elFun = xmlValue, as.data.frame = TRUE, which = integer(),
            ...)

参数----------Arguments----------

参数：doc
the HTML document which can be a file name or a URL or an already parsed HTMLInternalDocument, or an HTML node of class XMLInternalElementNode, or a character vector containing the HTML content to parse and process.
HTML文件，它可以是一个文件名或URL一个已经解析HTMLInternalDocument，或HTML节点的类XMLInternalElementNode，或字符向量包含HTML内容解析和处理。

参数：header
either a logical value indicating whether the table has column labels, e.g. the first row or a thead, or alternatively a character vector giving the names to use for the resulting columns. This can be a logical vector and the individual values will be used in turn for the different tables. This allows the caller to control whether individual tables are processed as having column names. Alternatively, one can read a specific table via the which parameter and control how that is processed with a single scalar logical.
无论是逻辑值，表示该表是否有列标签，例如第一行或thead，或者提供要使用的名称的列的字符向量。这可以是一个逻辑的矢量和的单个值，将被用在不同的表的转。这允许调用者来控制是否处理各个表列名。另外，我们可以通过which参数和如何处理与一个单一的标量逻辑控制读取一个特定的表。

参数：colClasses
either a list or a vector that gives the names of the data types for the different columns in the table, or alternatively a function used to convert the string values to the appropriate type.  A value of NULL means that we should drop that column from the result. Note that currently the conversion occurs before the vectors are converted to a data frame (if as.data.frame is TRUE). As a result, to ensure that character vectors remain as characters and not factors, use stringsAsFactors = FALSE. This typically applies only to an individual table and so for the method applied to a XMLInternalElementNode object.  In addition to the usual "integer", "numeric", "logical", "character", etc. names of R data types, one can use "FormattedInteger", "FormattedNumber" and "Percent" to specify that format of the values are numbers possibly with commas (,) separating groups of digits or a number followed by a percent sign (%). This mechanism allows one to introduce new classes and specify these as targets in colClasses.
一个列表或一个向量，让不同的表中的列，或者使用的函数将字符串转换为适当的类型的值的数据类型的名称。值NULL意味着我们应该放弃该列的结果。请注意，目前转换之前发生的向量转换为一个数据框（如果as.data.frame是TRUE）。其结果，以确保字符向量仍然为字符，而不是因素，使用stringsAsFactors = FALSE。这通常只适用于单个表的方法应用到一个XMLInternalElementNode对象。除了通常的“整数”，“数字”，“逻辑”，“字符”，等的R数据类型的名称，人们可以使用“FormattedInteger”，“FormattedNumber”和“百分比”指定值的格式是数字可能用逗号（，）分隔组数字或一个数字，后面有一个百分号（％）。这种机制允许一个引入新的类，并指定这些目标colClasses。

参数：skip.rows
an integer vector indicating which rows to ignore.
一个整数向量表示所忽略的行。

参数：trim
a logical value indicating whether to remove leading and trailing white space from the content cells.
一个逻辑值，该值指示是否要删除开头和结尾的空白，从内容单元。

参数：elFun
a function which, if specified, is called when converting each cell. Currently, only the node is specified. In the future, we might  additionally pass the index of the column so that the function has some context, e.g. whether the value is a row label or a regular value, or if the caller knows the type of columns.
函数，如果指定的话，将每个单元格时调用。目前，只有节点被指定。在未来，我们可能还通过列的索引，这样的功能有一些情况下，如的值是否是一排标签或定期的值，或者如果在调用者知道的列的类型。

参数：as.data.frame
a logical value indicating whether to turn the resluting table(s) into data frames or leave them as matrices.
一逻辑值，表明是否打开resluting的表（s）为数据框，或将它们作为矩阵。

参数：which
an integer vector identifying which tables to return from within the document. This applies to the method for the document, not individual tables.
确定哪些表，从文档返回一个整数向量。这也适用于文件，而不是单独的表的方法。

参数：...
currently additional parameters that are passed on to as.data.frame if as.data.frame is TRUE. We may change this to use these as additional arguments for calls to elFun.
目前额外的参数，通过as.data.frame的如果as.data.frame是TRUE。我们可能会改变这种使用这些额外的参数调用elFun。

值----------Value----------

If the document (either by name or parsed tree) is specified, the return vale is a list of data frames or matrices. If a single HTML node is provided
如果指定的文件（通过名称或分析树），单向阀是一系列的数据框或矩阵。如果有一个HTML节点

（作者）----------Author(s)----------

Duncan Temple Lang

参考文献----------References----------

参见----------See Also----------

htmlParse getNodeSet xpathSApply
htmlParsegetNodeSetxpathSApply

实例----------Examples----------

# u = "http://en.wikipedia.org/wiki/World_population"[U =“http://en.wikipedia.org/wiki/World_population”]
u = "http://en.wikipedia.org/wiki/List_of_countries_by_population"

tables = readHTMLTable(u)
names(tables)

tables[[2]]
  # Print the table. Note that the values are all characters[打印表格。请注意，这些值是所有字符]
  # not numbers. Also the column names have a preceding X since[而不是数字。此外，列名前面的X自]
  # R doesn't allow the variable names to start with digits.[R不允许从数字的变量名。]
tmp = tables[[2]]

  # We can transform this to get the rows to be years and the columns[我们可以把这个年的行和列]
  # to be population counts. We'll create a matrix.[人口数。我们将创建一个矩阵。]
vals = cbind(year = as.integer(gsub("X", "", names(tmp)[-1])),
            matrix(as.integer(gsub(",", "", as.character(unlist(tmp[-1])))),
                  ncol(tmp)-1, byrow = TRUE, dimnames = list(NULL, as.character(tmp[[1]]))))

# Let's just read the second table directly by itself.[让我们直接读第二个表本身。]
doc = htmlParse(u)
tableNodes = getNodeSet(doc, "//table")
tb = readHTMLTable(tableNodes[[2]])

  # Let's try to adapt the values on the fly.[让我们尝试在飞行中适应值。]
  # We'll create a function that turns a th/td node into a val[我们将创建一个函数，把一个个/ TD节点到一个val]
tryAsInteger = function(node) {
               val = xmlValue(node)
               ans = as.integer(gsub(",", "", val))
               if(is.na(ans))
                  val
               else
                  ans
            }

tb = readHTMLTable(tableNodes[[2]], elFun = tryAsInteger)

tb = readHTMLTable(tableNodes[[2]], elFun = tryAsInteger,
                     colClasses = c("character", rep("integer", 9)))

  zz = readHTMLTable("http://www.inflationdata.com/Inflation/Consumer_Price_Index/HistoricalCPI.aspx")
  if(any(i <- sapply(zz, ncol) == 14)) {  # guard against the structure of the page changing.[防止换页结构。]
zz = zz[[which(i)[1]]]  # 4th table[第四表]
      # convert columns to numeric.  Could use colClasses in the call to readHTMLTable()[列转换为数字。可以使用colClasses在调用readHTMLTable（）]
zz[-1] = lapply(zz[-1], function(x) as.numeric(gsub(".* ", "", as.character(x))))
matplot(1:12, t(zz[-c(1, 14)]), type = "l")
  }

# From Marsh Feldman on R-help[从沼泽费尔德曼在R-帮助]
doc <- "http://www.nber.org/cycles/cyclesmain.html"
   # The  main table is the second one because it's embedded in the page table.[主表是第二个，因为它是嵌入在页表中。]
table <- getNodeSet(htmlParse(doc),"//table") [[2]]
xt <- readHTMLTable(table,
                  header = c("peak","trough","contraction",
                           "expansion","trough2trough","peak2peak"),
                  colClasses = c("character","character","character",
                                 "character","character","character"),
                  trim = TRUE, stringsAsFactors = FALSE
               )

if(FALSE) {
# Here is a totally different way of reading tables from HTML documents.[这里是一个完全不同的方法从HTML文件中读取表。]
# The data are formatted using a PRE and so can be read via read.table[数据被格式化使用前，因此可以读通过了read.table]
u = "http://tidesonline.nos.noaa.gov/data_read.shtml?station_info=9414290+San+Francisco,+CA"
h = htmlParse(u)
p = getNodeSet(h, "//pre")
con = textConnection(xmlValue(p[[2]]))
tides = read.table(con)
}

# header as a logical vector[作为一个逻辑向量的标头]
tt = readHTMLTable("http://www.sfgate.com/weather/rainfall.shtml",
                  header = c(FALSE, FALSE, TRUE, FALSE, FALSE))

#[]
tt = readHTMLTable("http://www.sfgate.com/weather/rainfall.shtml",
                     which = 3, header = TRUE)

if(require(RCurl)) {
  tt =  getURL("http://www.omegahat.org/RCurl/testPassword/table.html",  userpwd = "bob:duncantl")
  readHTMLTable(tt)
}

转载请注明:出自生物统计家园网(http://www.biostatistic.net)。

注：
注1：为了方便大家学习，本文档为生物统计家园网机器人LoveR翻译而成，仅供个人R语言学习参考使用，生物统计家园保留版权。
注2：由于是机器人自动翻译，难免有不准确之处，使用时仔细对照中、英文内容进行反复理解，可以帮助R语言的学习。
注3：如遇到不准确之处，请在本贴的后面进行回帖，我们会逐渐进行修订。

账号		自动登录	找回密码
密码			注册