飽きる前にãããªãã«å½¢ã«ãªã£ãã®ã§ãªãªã¼ã¹ãã¦ããã¾ã
Python 2.5*ã¨BeautifulSoup 3.0.7* or 3.1.0*ã®ç°å¢ã§ã¨ããããåãXPathEvaluatorã§ãã
ã¢ã¼ã«ã¤ããã¡ã¤ã«(ZIP)ï¼BSXPath.py: XPathEvaluator Extension for BeautifulSoup
ä¸è¨ãã¡ã¤ã«(BSXPath.pyï¼ã使ã£ããµã³ãã«ã¯こちら
ã2009/04/05追è¨ã
BSXPath.pyã使ã£ãサービスを公開ãã¾ããã
ä»»æã®ãµã¤ãã®ãã£ã¼ããã¿ã¼ã³ãä½æã»å ±ç¨ã§ãããµã¼ãã¹
使ãæ¹
from BSXPath import BSXPathEvaluator,XPathResult #*** æºå document = BSXPathEvaluator(<html>) # html: HTMLããã¹ã # â»BSXPathEvaluator 㯠BeautifulSoup ã®ãµãã¯ã©ã¹ã§ãã # ãå¾ããããªãã¸ã§ã¯ã(document)㧠BeafutifulSoup ã®ã¡ã½ããã使ãã¾ãã #*** åºæ¬æä½ result = document.evaluate(<expression>,<node>,None,<type>,None) # expression: XPathè¡¨ç¾ # node : åºæºã¨ãªãã³ã³ããã¹ããã¼ã(BSXPathEvaluatorã®æ»ãå¤(document)ãROOTã¨ãªãã¾ã) # type : XPathResult.<name> ï¼çµæã¨ãã¦åå¾ãããå½¢å¼ï¼ # name : ANY_TYPE(0), NUMBER_TYPE(1), STRING_TYPE(2), BOOLEAN_TYPE(3) # UNORDERED_NODE_ITERATOR_TYPE(4), ORDERED_NODE_ITERATOR_TYPE(5) # UNORDERED_NODE_SNAPSHOT_TYPE(6), ORDERED_NODE_SNAPSHOT_TYPE(7) # ANY_UNORDERED_NODE_TYPE(8), FIRST_ORDERED_NODE_TYPE(9) # â»ç¬¬3å¼æ°(resolver)ã¨ç¬¬5å¼æ°(result)ã¯Noneåºå®ã§ãï¼æªå®è£ ï¼ # --- XPathResult.ANY_TYPE(0) æå®æ type = result.nodeType # XPathResult.NUMBER_TYPE(1)/STRING_TYPE(2)/BOOLEAN_TYPE(3)/UNORDERED_NODE_ITERATOR_TYPE(4)ã®ããããã # è¿ãã®ã§ãããã«å¿ãã¦å¦çãå®æ½ # --- XPathResult.NUMBER_TYPE(1) æå®æ value = result.numberValue # --- XPathResult.STRING_TYPE(2) æå®æ value = result.stringValue # --- XPathResult.STRING_TYPE(3) æå®æ value = result.booleanValue # --- XPathResult.ANY_UNORDERED_NODE_TYPE(8) or type==XPathResult.FIRST_ORDERED_NODE_TYPE(9) æå®æ value = result.singleNodeValue # --- XPathResult.UNORDERED_NODE_ITERATOR_TYPE(4)/ORDERED_NODE_ITERATOR_TYPE(5) # /UNORDERED_NODE_SNAPSHOT_TYPE(6)/ORDERED_NODE_SNAPSHOT_TYPE(7)ã®ããããæå®æ length = result.snapshotLength node = result.snapshotItem(<number>) for i in range(length): value = result.snapshotItem(i) #*** WRAPPERé¢æ° nodes = document.getItemList(<expression>[,<node>]) # ãã¼ãã®ãªã¹ããè¿ã first = document.getFirstItem(<expression>[,<node>]) # å é ãã¼ãã®ã¿ãè¿ã # expression: XPathè¡¨ç¾ # node : åºæºã¨ãªãã³ã³ããã¹ããã¼ã(ããã©ã«ãã¯BSXPathEvaluatorã®æ»ãå¤(document))
ãµã³ãã«
from BSXPath import BSXPathEvaluator,XPathResult html = """ <html><head><title>Hello, DOM 3 XPath!</title></head> <body><h1>Hello, DOM 3 XPath!</h1><p>This is XPathEvaluator Extension for BeautifulSoup.</p> <p>This is based on JavaScript-XPath!</p></body> """ document = BSXPathEvaluator(html) result = document.evaluate('//h1/text()[1]',document,None,XPathResult.STRING_TYPE,None) print result.stringValue # Hello, DOM 3 XPath! result = document.evaluate('//h1',document,None,XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,None) print result.snapshotLength # 1 print result.snapshotItem(0) # <h1>Hello, DOM 3 XPath!</h1> nodes = document.getItemList('//p') print len(nodes) # 2 print nodes # [<p>This is XPathEvaluator Extension for BeautifulSoup.</p>, <p>This is based on JavaScript-XPath!</p>] first = document.getFirstItem('//p') print first # <p>This is XPathEvaluator Extension for BeautifulSoup.</p>
è¬è¾
- XPath解æã®ãã¸ãã¯ã¯id:amachangããã®JavaScript-XPathã®ãã®ãã»ã¨ãã©ãã®ã¾ã¾ä½¿ããã¦ããã ãã¾ããã移æ¤ããã ãã§ãç¸å½å¤§å¤ã ã£ãã®ã«ããã¡ããä½æãããamachangããã¯ã»ãã¨ã«ãããï¼
- BeautifulSoupãæä¾ãã¦ä¸ãã£ã¦ããLeonard Richardsonããã«ãæè¬ï¼
è¦æ¸ãªã©
- ãã¾ã ã«XPathãã¤ãã§ã«DOMãããææ¡ãã¦ããªãã®ã§ããã£ã¨åä½ã¯æªããã¨æãã¾ãï¼ããï¼*1ã
- ä¸å¿ã
http://svn.coderepos.org/share/lang/javascript/javascript-xpath/trunk/test/functional/data/
ã使ã£ã試é¨ã¯ãã¦ãã¾ãã
2009/3/24ç¾å¨ã®ãã¼ã¿(0000ã0012)ã«ããã¦ã0002ã®ãã¡ã®2ã¤ãNGããã¨ã¯OKã¨ãªã£ã¦ãã¾ãã
0002ã§NGãªã®ã¯ã'.//blockquote/text()'ã¨'.//blockquote/node()'ã
BeautifulSoupã®ç¹æ§ãªã®ãã'<...>\n <...>'ã®ãããªHTMLããã£ãå ´åãããã¹ããã¼ãã¨ãã¦å¾ãã®ã¿ã°åã®ã¹ãã¼ã¹ãç¡è¦ããã¦ãã¾ã模æ§ãæ ¹ãæ·±ãããªã®ã§å¯¾å¿å°é£ã£ã½ãã§ãâ¦ã - アーカイブファイルã«ã¯è©¦é¨ç¨ã¹ã¯ãªãã(TEST_BSXPath.py)ã¨ãã¾ã¨ãã¦è©¦é¨ããç¨ã®Windowsã³ãã³ãããã³ããç¨ããããã¡ã¤ã«ï¼testbsx.cmdï¼ï¼ã¨ãã®ãã¹ãçµæï¼ãå梱ãã¦ãã¾ãã
ããããã¡ã¤ã«ãå®è¡ããã¨".\testbsxresult"ãã©ã«ããä½ã£ã¦ãã®ä¸ã«çµæãä¿åãã¾ãã - BeautifulSoupã¯3.1.0*ããã3.0.7*ã®æ¹ããParseã¨ã©ã¼ãåºã«ããããã§ãã
Currently the 3.0.x series is better at parsing bad HTML than the 3.1 series.
- é度çãªé¢ã¯æå¾ ããªãã§ä¸ãããçµæ§é ãããã§ããéãããæ¹æ³ããã£ããæãã¦ä¸ããã
- Pythonãåå¿è ãªã®ã§ãããªãããããªæ¸ãæ¹ããã¦ããã¨æããã¾ããããããæ¹ãããã¨ããã¢ããã¤ã¹ã¯æè¿ã§ãã
*1:ããããã®ä½ããªãW3Cã®ä»æ§ãèªè¾¼ã¾ã«ããªããã®ã ããããã©ãâ¦ã