ã¡ãã£ã¨å¤§éã®HTMLãã¡ã¤ã«ããã§ãã¯ããä½æ¥ããã£ã¦ãgrepï¼Perl One Linerã§é å¼µãã®ãå³ããããªãã¨æããHTMLãã¡ã¤ã«ãJavaã§ãã¼ã¹ãã¦ã©ãã«ããããã¨æãç«ã¡ã¾ãããä»æ¥ã
ã§ãJavaã§HTMLãã¼ãµã¨ããã°ãå人çã«ã¯ããã¨æãæµ®ãã¶ã®ãNekoHTMLã
CyberNeko HTML Parser
http://nekohtml.sourceforge.net/
ãããããããããã¯å¤ããHTML5ã«ã対å¿ãã¦ãã¾ãããã
ãã£ã¦ãä»ã®ãã¼ãµãæ¢ãã¦ã¿ã¾ããã2ã¤ã»ã©è¦ã¤ãã£ãã®ã§ããç´¹ä»ãã¾ãã
HTMLããã¼ã¹ããã®ã§ã以ä¸ã®ãããªéãã¿ã°ããªãHTMLããã¼ã¹ã§ããªããã°ãªãã¾ããã
index.html
<!DOCTYPE html> <html> <head> <title>ã¿ã¤ãã«</title> </head> <body> <div id="wrapper"> <h1>è¦åºã</h1> <br> <p class="text-content">Hello</p> <p class="text-content">World</p> </div> </body> </html>
â¦ãªãã¨ããããããªãHTMLã
ããããé常ã®DOMãã¼ãµã§ãã¼ã¹ããã¨
jaxp.groovy
import javax.xml.parsers.* def dbf = DocumentBuilderFactory.newInstance() def db = dbf.newDocumentBuilder() def doc = db.parse(new File('index.html')) println(doc)
è¦äºã«ã³ã±ã¾ãã
$ groovy jaxp.groovy [Fatal Error] index.html:12:7: è¦ç´ ã¿ã¤ã"br"ã¯ã対å¿ããçµäºã¿ã°"</br>"ã§çµäºããå¿ è¦ãããã¾ãã Caught: org.xml.sax.SAXParseException; systemId: file:/xxxxx/index.html; lineNumber: 12; columnNumber: 7; è¦ç´ ã¿ã¤ã"br"ã¯ã対å¿ããçµäºã¿ã°"</br>"ã§çµäºããå¿ è¦ãããã¾ãã org.xml.sax.SAXParseException; systemId: file:/xxxxx/index.html; lineNumber: 12; columnNumber: 7; è¦ç´ ã¿ã¤ã"br"ã¯ã対å¿ããçµäºã¿ã°"</br>"ã§çµäºããå¿ è¦ãããã¾ãã at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:257) at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:347) at jaxp.run(jaxp.groovy:5)
ã¨ããããã§ãHTMLããã¼ã¹ããã«ã¯ããã£ã±ãHTMLãã¼ãµã欲ããã®ã§ãã
ã¡ãªã¿ã«ãJavaã§è¨ãã¤ã¤ãã³ã¼ãã¯é¢åãªã®ã§Groovyã§ããã¾ãâ¦ã
jsoup
ä»åãæçµçã«ä½¿ç¨ããã®ã¯ãã¡ãã®ãã¼ãµã§ããHTML5ã«å¯¾å¿ãã¦ãã¾ãã
jsoup: Java HTML Parser
http://jsoup.org/
Mavenä¾åé¢ä¿ï¼ã¨ããããããã§ã¯Grapeã§ããï¼ã使ç¨ãã¦ã以ä¸ã®æ§ã«ä¾åé¢ä¿ãå®ç¾©ãã¾ãã
@Grab('org.jsoup:jsoup:1.7.3') import org.jsoup.Jsoup
ã§ãç®çã®ãã¡ã¤ã«ããã¼ã¹ããã¨ãDocumentãå¾ããã¾ãã
def file = new File('index.html') def encoding = 'UTF-8' def doc = Jsoup.parse(file, encoding)
ããã§ã®Documentã¯ãorg.w3c.dom.Documentã§ã¯ãªããjsoupç¬èªã®ã¯ã©ã¹ã§ãã
ããã§ãElementãç¶æ¿ãããµãã¯ã©ã¹ã«å¯¾ãã¦ãããããã¨ä¾¿å©ãªæä½ãå®è¡ã§ãã¾ãããªããDocumentã¯Elementã®ãµãã¯ã©ã¹ã§ãã
idæå®ãclassæå®ãjQueryãªã©ã®ãããªCSSã»ã¬ã¯ã¿ã®æå®ãªã©ãã§ãããªããªã便å©ã§ãã
// idã§åå¾ def elm1 = doc.getElementById('wrapper') // classã§åå¾ def elems1 = doc.getElementsByClass('text-content') // CSSã»ã¬ã¯ã¿ã§åå¾ def elems2 = doc.select('p.text-content') // CSSã»ã¬ã¯ã¿ã§åå¾ def elems3 = doc.select('#wrapper')
Element#selectã®çµæã¯Elementsã¨ããã¯ã©ã¹ãè¿ã£ã¦ããã®ã§ããããã®ã¯ã©ã¹ã¯CollectionãListãå®è£ ãã¦ããã®ã§ãå復å¦çããã£ãã楽ã§ããGroovyãªãgrepï¼eachãªã©ãæ®éã«å¯è½ã
ãã®ä»ã®ã¡ã½ããã¯ã詳ããã¯Javadocãåç §ãã¦ãã ããã
Element
http://jsoup.org/apidocs/org/jsoup/nodes/Element.html
ããã¯ããªããªãè¯ããã¼ãµãè¦ã¤ããã¾ããããã£ããå½¹ã«ç«ã¡ããã§ãã
ãã®ä»ãURLãããã¼ã¹ããã
def doc = Jsoup.connect('https://www.google.co.jp/').get()
HTMLãã©ã°ã¡ã³ãããã¼ã¹ãããã¨ãã§ãã¾ãã
jsoup-fragment.groovy @Grab('org.jsoup:jsoup:1.7.3') import org.jsoup.Jsoup def html = '''\ <div id="wrapper"> <h1>è¦åºã</h1> <br> <p class="text-content">Hello</p> <p class="text-content">World</p> </div>''' def doc = Jsoup.parseBodyFragment(html) def body = doc.body()
ãªãã£ã·ã£ã«ã®Cookbookãè¦ãã¨ãããããã§ãããã§ããã
jsoup cookbook
http://jsoup.org/cookbook/
The Validator.nu HTML Parser
jsoupãè¦ã¤ããæç¹ã§ãä½æ¥ã¯ãã¡ããå©ç¨ããã®ã§ãã®æã¯ä½¿ãã¾ããã§ãããããã¡ãããç´¹ä»ã
The Validator.nu HTML Parser
http://about.validator.nu/htmlparser/
Firefox4ã®HTML5ãã¼ãµã¯ããã¡ãã使ç¨ãã¦ããã¨ã®ãã¨ã
W3Cæºæ ã£ã½ãã¦ãSAXãDOMã®APIãåãã¾ãããã¡ãããMaven Centralããåå¾ã§ããã®ã§ããããªæãã§DOMã®ãã¼ã¹ã³ã¼ããæ¸ãã¾ãã
validator-nu.groovy
@Grab('nu.validator.htmlparser:htmlparser:1.4') import nu.validator.htmlparser.dom.HtmlDocumentBuilder import org.xml.sax.InputSource def builder = new HtmlDocumentBuilder() def reader = new BufferedReader(new InputStreamReader(new FileInputStream('index.html'), 'UTF-8')) def doc = builder.parse(new InputSource(reader)) // ãã¨ã¯ãorg.w3c.dom.Documentã¨ãã¦æä½ãã reader.close()
HtmlDocumentBuilder#parseã§ãorg.w3c.dom.Documentãè¿ãã®ã§ããã¨ã¯é常ã®DOMããã°ã©ãã³ã°ãããã°OKã§ãã
DOMãSAXãæ±ãå¿ è¦ãããå ´åã¯ããã¡ãã使ãã¨ããæãã§ããããããã¹æå®ããã¦é¸æãããå ´åã¯ãXPathã§ããã
便å©ãããªã®ã§ã両æ¹è¦ãã¦ããã¾ãããã