scrape(scrapeR)
scrape()所属R语言包:scrapeR
A Tool For Scraping and Parsing HTML and XML Documents From the Web
从Web刮和解析HTML和XML文档的工具,
译者:生物统计家园网 机器人LoveR
描述----------Description----------
This function assists the user with retrieving HTML and XML files, parsing their contents, and diagnosing potential errors that may have occurred along the way.
此功能帮助用户检索HTML和XML文件,解析其内容,和诊断潜在的错误可能已经发生的一路上。
用法----------Usage----------
scrape(url=NULL,object=NULL,file=NULL,chunkSize=50,maxSleep=5,
userAgent=unlist(options("HTTPUserAgent")),follow=FALSE,
headers=TRUE,parse=TRUE,isXML=FALSE,.encoding=integer(),
verbose=FALSE)
参数----------Arguments----------
参数:url
a vector of URLs, each as a character string. Either the url, object, or the file parameter must be provided.
一个向量的URL,每个作为一个字符串。无论是url,object,file参数必须提供。
参数:object
character; the name of an R object that contains the raw source code of an HTML or XML. This parameter is likely useful when a previous call to scrape simply gathered document source code, followed redirects, and/or returned the headers, thus allowing the user to inspect the output first for potential problems before deciding to parse it into an R-friendly tree-like structure. Either the object, url, or the file parameter must be provided.
字符R对象的名称包含原始的HTML或XML的源代码。此参数可能是有用的,以前调用scrape简单地聚集文档的源代码,随后重定向,和/或返回的头文件,从而允许用户检查可能出现的问题,然后才决定把它解析成一个输出第一R友好的树状结构。无论是object,url,file 参数必须提供。
参数:file
a vector of paths to local files, as a character string. Either the file, url, or the object parameter must be provided.
一个向量的本地文件的路径,作为一个字符的字符串。无论是file,url,object参数必须提供。
参数:chunkSize
integer; if a vector of urls is supplied whose size is greater than the value of chunkSize, the urls will be split into chunks of size chunkSize. By splitting the urls into chunks, the number of simultaneous HTTP requests is reduced, thus placing less burden on the server. The default value of chunkSize is 50. It is not recommended that one specifies a value of chunkSize larger than 100.
整数;如果url的向量提供的大小值大于chunkSize,urls将被分裂成块的大小chunkSize。分裂url的成块,并发的HTTP请求的数量减少,从而减少服务器负担。 chunkSize的默认值是50。不建议指定一个值的chunkSize大于100。
参数:maxSleep
integer; if the vector of urls is larger than the value of chunkSize, the function will “sleep" for ceiling(runif(1,min=0,max=maxSleep)) seconds between chunks. It is often helpful to use a sleep parameter when making repeated HTTP requests so as to not overwhelm the servers with gapless sequential requests. The default value for this parameter is 5.
整数,如果向量urls是大于chunkSize的价值,该函数将“休眠”ceiling(runif(1,min=0,max=maxSleep))秒块之间,这是经常使用睡眠参数时,重复的HTTP请求,以不淹没无间断的连续请求的服务器。此参数的默认值是:5。
参数:userAgent
the User-Agent HTTP header that is supplied with any HTTP requests made by this function. This header is used to identify your HTTP calls to the host server. It is strongly recommended that one uses an informative User-Agent header, perhaps with a link to one's email or web address. This information may prove helpful to system administrators when they are unsure of the legitimacy your HTTP requests, as it provides them a way of contacting you. See the URL reference for “User-Agent" headers below for more information. By default, the User-Agent header is assigned the value given by unlist(options("HTTPUserAgent")), but the user is encouraged to construct a customized version.
的User-Agent HTTP标头中提供的HTTP请求,通过此功能。这头被用来识别您的主机服务器的HTTP调用。我们强烈建议您使用一个内容丰富的User-Agent头,也许是一个链接的电子邮件或网页地址。这些信息可能证明是有用的系统管理员时,他们不知道你的HTTP请求的合法性,因为它为他们提供一个与您联系的方式。 “用户代理”报头下面的详细信息,请参阅URL参考。默认情况下,User-Agent头分配的unlist(options("HTTPUserAgent"))给定的值,但鼓励用户构建一个定制版本。
参数:follow
logical; should these HTTP requests follow URL redirects if they are encountered? Here, redirection will only occur with HTTP requests for which the status code is of the 3xx type (see the reference to HTTP status codes below). This parameter is only meaningful if the url parameter is supplied. The default value for this parameter is FALSE.
逻辑,这些HTTP请求遵循URL重定向,如果遇到吗?在这里,重定向将仅发生在使用HTTP请求的状态代码的3xx类型(请参阅参考以下HTTP状态码)。如果url提供参数,此参数是唯一有意义的。此参数的默认值是FALSE。
参数:headers
logical; should these HTTP requests retrieve the resulting HTTP headers? This parameter is only meaningful if the url parameter is supplied. The default value for this parameter is FALSE.
逻辑,这些HTTP请求检索生成的HTTP标头?如果url提供参数,此参数是唯一有意义的。此参数的默认值是FALSE。
参数:parse
logical, should the url or file vectors be parsed into R-friendly tree-like structures? See xmlTreeParse for more information about this feature and how the object is returned. If parse==TRUE, this tree-like structure is easily navigable using the XPath language (see the corresponding url reference provided below and the help page for xpathSApply). The default value for this parameter is TRUE.
逻辑,应该url或file向量R友好的树结构被解析成?见xmlTreeParse有关此功能的更多信息,以及如何返回对象的。 parse==TRUE如果,这个树状结构,很容易通航使用XPath语言(参见相应的URL参考下面提供的帮助页面xpathSApply)。此参数的默认值是TRUE。
参数:isXML
logical; do the url or file vectors point to well-formed XML files? See xmlTreeParse for the differences between parsing XML and HTML documents. The default value for this parameter is FALSE.
逻辑;做url或file向量指向格式良好的XML文件?见xmlTreeParse解析XML和HTML文件之间的差异。此参数的默认值是FALSE。
参数:.encoding
integer or a string; identifies the encoding of the retrieved content. See getURL for more information.
整数或字符串;标识所检索的内容的编码。见getURL更多信息。
参数:verbose
logical; shall the function print extra information to the console? The default value for this parameter is FALSE.
逻辑的功能;打印额外的信息到控制台?此参数的默认值是FALSE。
值----------Value----------
If url or file is supplied, then either the raw source code of the urls (files) is returned as a list of (potentially long) character vectors (when parse==FALSE), or a list of R-friendly tree-like structures of the documents is returned (when parse==TRUE). If object is supplied, then either the raw source code contained within the object is returned as a list object of (potentially long) character strings (when parse==FALSE), or a list object of R-friendly tree-like structures for the documents is returned (when parse==TRUE). If url or object are supplied, the resulting object may have the following attributes:
如果url或file提供,然后url(file)的原始源代码的形式返回的列表(可能很长)特征向量(当parse==FALSE),或R友好的树形结构的文件的列表,则返回(当parse==TRUE)。如果object提供,那么您可以在对象的原始源代码中包含一个列表对象(可能很长)字符串(如返回时parse==FALSE),或一个列表对象的R 友好的树结构的文件被返回(当parse==TRUE“)。如果url或object提供,所产生的对象具有以下属性:
<table summary="R valueblock"> <tr valign="top"><td>redirect.URL </td> <td> the destination URLs that resulted from a series of redirects, if they occurred; else NA. This is only returned if follow==TRUE.</td></tr>
<table summary="R valueblock"> <tr valign="top"> <TD> redirect.URL </ TD> <TD>,导致了一系列的重定向的目标网址,如果发生其他NA。这仅仅是返回follow==TRUE。</ TD> </ TR>
<tr valign="top"><td>headers </td> <td> the HTTP headers resulting from these HTTP requests. These are only returned if headers==TRUE.</td></tr> </table>
<tr valign="top"> <TD>headers </ TD> <TD>从这些HTTP请求的HTTP标头。这些仅仅是返回headers==TRUE。</ TD> </ TR> </ TABLE>
(作者)----------Author(s)----------
Ryan M. Acton <racton@uci.edu> <a href="http://www.ryanacton.com">http://www.ryanacton.com</a>
参考文献----------References----------
Duncan Temple Lang. (2009). XML: Tools for parsing and generating XML within R and S-Plus. http://CRAN.R-project.org/package=XML.
Duncan Temple Lang. (2009). RCurl: General network (HTTP/FTP/...) client interface for R. http://CRAN.R-project.org/package=RCurl.
Information about HTTP status codes: http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html.
Information about User-Agent headers: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.43.
Information about the <code>XPath</code> language: http://www.w3schools.com/XPath/default.asp.
实例----------Examples----------
## Not run: [#不运行:]
## Example 1. Getting all of the package names available for download[#示例1。让所有可供下载的包名]
## from CRAN (http://cran.r-project.org/web/packages/)[#从CRAN(http://cran.r-project.org/web/packages/)]
# First, pull in the page's source code, check for (and follow) a page redirection, [首先,打开页面的源代码,检查,并按照一个页面重定向,]
# and retrieve the headers before deciding to parse the code.[和检索的头,然后才决定来解析的代码。]
pageSource<-scrape(url="http://cran.r-project.org/web/packages/",headers=TRUE,
parse=FALSE)
# Second, inspect the headers to ensure a status code of 200, which means the page[二,检查的头,以确保状态码为200,这意味着页]
# was served properly. If okay, then parse the object into an XML tree and retrieve[妥善送达。如果好了,然后解析对象到XML树和检索]
# all of the package names.[所有的包名。]
if(attributes(pageSource)$headers["statusCode"]==200) {
page<-scrape(object="pageSource")
xpathSApply(page,"//table//td/a",xmlValue)
} else {
cat("There was an error with the page. \n")
}
## End(Not run)[#(不执行)]
## Example 2. Parsing a local XML file, then pulling out information of interest[#例2。解析一个本地的XML文件,然后拉出感兴趣的信息]
# First, locate and parse the demo recipe file supplied with this package[首先,定位和解析这个包提供的演示配方文件]
fileToLoad<-system.file("recipe.xml",package="scrapeR")
mmmCookies<-scrape(file=fileToLoad,isXML=TRUE)
# Next, retrieve the names of the dry ingredients that I'll need to buy[接下来,检索的干成分的名称,我需要买]
xpathSApply(mmmCookies[[1]],"//recipe/ingredient[@type='dry']/item",xmlValue)
# Next, remind myself how much flour is needed[接下来,提醒自己需要多少面粉]
paste(xpathSApply(mmmCookies[[1]],"//item[.='flour']/preceding-sibling::amount",
xmlValue),xpathSApply(mmmCookies[[1]],"//item[.='flour']/
preceding-sibling::unit",xmlValue))
# Finally, remind myself who the author of this recipe is[最后,提醒自己这个食谱的作者是谁]
xpathSApply(mmmCookies[[1]],"//recipe",xmlGetAttr,"from")
转载请注明:出自 生物统计家园网(http://www.biostatistic.net)。
注:
注1:为了方便大家学习,本文档为生物统计家园网机器人LoveR翻译而成,仅供个人R语言学习参考使用,生物统计家园保留版权。
注2:由于是机器人自动翻译,难免有不准确之处,使用时仔细对照中、英文内容进行反复理解,可以帮助R语言的学习。
注3:如遇到不准确之处,请在本贴的后面进行回帖,我们会逐渐进行修订。
|