ã¯ããã«
ã»ï¼ï¼ï¼ nikkieã§ã
æ£è¦è¡¨ç¾ã«ã¤ãã¦ãæ°ããç¥ã£ããã¨ãããã¾ããã
\p{L}
ã£ã¦ãªãã ã¨æãã¾ãï¼
ç®æ¬¡
- ã¯ããã«
- ç®æ¬¡
- MDNãããUnicode æåã¯ã©ã¹ã¨ã¹ã±ã¼ãã
- ãUnicode æåã¯ã©ã¹ã¨ã¹ã±ã¼ããä¸ã®ä¾
- Unicode Technical Standard #18 ãåç §
- ãæ£è¦è¡¨ç¾è¾å ¸ããç´è§£ã
- Pythonã§ã¯regex
- çµããã«
- P.S. ããããã©ãã§\pãè¦ãããï¼
MDNãããUnicode æåã¯ã©ã¹ã¨ã¹ã±ã¼ãã
Unicode æåã¯ã©ã¹ã¨ã¹ã±ã¼ãã¯æåã¯ã©ã¹ã¨ã¹ã±ã¼ãã®ä¸ç¨®ã§ãUnicode ããããã£ã§æå®ãããä¸é£ã®æåã«ä¸è´ãã¾ãã
æåã¯ã©ã¹ã¨ã¹ã±ã¼ãã¨ã¯ãæ£è¦è¡¨ç¾ã®\d
ã\w
ï¼
ããã¯è¦ããã¨ããã¾ãã
æåã¯ã©ã¹ã¨ã¹ã±ã¼ãã¯ãæåã®éåã表ãã¨ã¹ã±ã¼ãã·ã¼ã±ã³ã¹ã§ãã
文字クラスエスケープ: \d, \D, \w, \W, \s, \S - JavaScript | MDN
ã¤ã¾ã\pã\Pã¯æåã®éåã表ãã¨ã¹ã±ã¼ãã·ã¼ã±ã³ã¹ã¨ããããã§ããã
ãUnicode æåã¯ã©ã¹ã¨ã¹ã±ã¼ããã®ããã¥ã¡ã³ãã¯æ¬¡ã®ããã«ç¶ãã¾ãã
ãã㯠Unicode 対å¿ã¢ã¼ãã§ã®ã¿å¯¾å¿ãã¦ãã¾ãã
ããã¯uãã©ã°ã®ãã¨ã§ãã1ã
正規表現 - JavaScript | MDN
解説ã«ããã¨
ãã¹ã¦ã® Unicode æåã«ã¯ããããè¨è¿°ããä¸é£ã®ããããã£ãããã¾ãã
ä¾ãã°ãa ã¨ããæåã§ã¯ãGeneral_Category ããããã£ã Lowercase_Letter ã®å¤ã§ãããScript ããããã£ã Latn ã®å¤ã§ãã
ãªããªãã ã£ã¦ã¼ï¼ ããããã£ãªãã¦ãã®ããã£ãã®ã
ä¾ãã°ãa 㯠\p{Lowercase_Letter}ï¼General_Category ããããã£åã¯ãªãã·ã§ã³ï¼ã¨ã\p{Script=Latn} ã«ãã£ã¦ä¸è´ããããã¨ãã§ãã¾ãã
\Pã¯\pã®å¦å®ã§ã
ãUnicode æåã¯ã©ã¹ã¨ã¹ã±ã¼ããä¸ã®ä¾
ãä¸è¬ã«ãã´ãªã¼ã2ã«\p{L}
ã®ä¾ãããã¾ãã
ä¸ã§å¼ãããGeneral_Category ããããã£åã¯ãªãã·ã§ã³ããè£è¶³ãããããªå½¢ã«ãªã£ã¦ãã¾ãã
以ä¸ã¯æ£è¦è¡¨ç¾ãªãã©ã«ã¨ãã¦åã
/\p{L}/gu
/\p{General_Category=Letter}/gu
ï¼General_Categoryã¨ããããããã£åãæ¸ããï¼/\p{Letter}/gu
ï¼General_Categoryãçç¥ï¼
ãã©ã¦ã¶ï¼Firefoxï¼ã®éçºãã¼ã«ã®ã³ã³ã½ã¼ã«ã§å®è¡ãã¾ãã
const story = "It's the Cheshire Cat: now I shall have somebody to talk to."; story.match(/\p{L}/gu);
Array(46) [ "I", "t", "s", "t", "h", "e", "C", "h", "e", "s", ⦠]
空ç½æåãè¨å·ãé¤ããæåã«ããããã¾ããï¼
story.match(/\p{General_Category=Letter}/gu);
ã®ããã«æ¸ãæãã¦ãåãçµæã§ãã
Unicode Technical Standard #18 ãåç §
ãUnicode Technical Standard #18 Unicode Regular Expressionsãã«ãGeneral Category Propertyããããã¾ãï¼1.2.5ï¼3ã
https://unicode.org/reports/tr18/#General_Category_Property
ãã®è¡¨ã«ã¯ãããããã£ã®å¤ã¨ãã®ç縮åï¼Long formã¨Abb.ï¼ãè¨è¼ããã¦ãã¾ãã
- Letterï¼ç縮åã¯Lï¼
- Uppercase Letterï¼ç縮åã¯Luï¼
大æåå°æåãã¹ãã¼ã¹ãã¢ã³ãã¼ã¹ã³ã¢ã®æç¡ãªã©è¡¨è¨ãæºãã¦ããæãæåã®éåã¨ãã¦ã¯åãã¨ã®ãã¨ã§ãã
any of the following should be equivalent: \p{Lu}, \p{lu}, \p{uppercase letter}, \p{Uppercase Letter}, \p{Uppercase_Letter}, and \p{uppercaseletter}
ãæ£è¦è¡¨ç¾è¾å ¸ããç´è§£ã
03-04-06ã\p{...}ã\P{...} Unicodeããããã£ã«åºã¥ãæ¡ä»¶ã«åè´ããæåã«ããã
ä¾ãé¢ç½ããªã¨æãã¾ããã
\p{Lu}
ï¼Uppercase Letterï¼ã¯ãä¾ãã°å ¨è§ã®ï¼§ã«ãããã\p{InHiragana}
ã§ã²ãããªã«ããã
"åè§ã®G å ¨è§ã®ï¼§".match(/\p{Lu}/gu);
Array [ "G", "G" ]
"åè§ã®G å
¨è§ã®ï¼§".match(/\p{InHiragana}/gu);
ã¯
Uncaught SyntaxError: invalid property name in regular expression
ã§ãããæªãµãã¼ããªãã§ãããï¼
Pythonã§ã¯regex
æ¨æºã©ã¤ãã©ãªreã®æ¡å¼µã
Unicode codepoint propertiesããµãã¼ããã¦ãã¾ãï¼
>>> regex.findall(r"\p{Lu}", "åè§ã®G å ¨è§ã®ï¼§") ['G', 'G'] >>> regex.findall(r"\p{InHiragana}", "åè§ã®G å ¨è§ã®ï¼§") ['ã®', 'ã®']
Python 3.10.9ãregex 2024.4.16ã§åããã¦ãã¾ã
çµããã«
Unicodeæåã¯ã©ã¹ã¨ã¹ã±ã¼ããç¥ãã¾ããã
\p{}
ã\P{}
ã§{}
ä¸ã«Unicodeæåã®ããããã£ãæå®ã§ãã\d
ã®ããã«æåã®éåãæå®ãã¦ããã¨ãããã¨- ä¾ãã°
\p{Lu}
ã§Uppercase Letterãæå®ãããã¯åè§ã ãã§ãªãå ¨è§ãªã©ã«ãããããã
- ä¾ãã°
- JavaScriptã§ã¯unicodeãã©ã°ãæå®ããæ£è¦è¡¨ç¾ãªãã©ã«
InHiraganaãªã©ãUnicodeæåã¯ã©ã¹ã¨ã¹ã±ã¼ãã§æå®ããã®ã¯ä¾¿å©ãããªæ°ããã¾ãã
P.S. ããããã©ãã§\pãè¦ãããï¼
VS Codeã®å®è£
ãè¦ãã¦ãã¦ï¼4
https://github.com/microsoft/vscode/blob/1.88.1/src/vs/editor/contrib/linesOperations/browser/linesOperations.ts#L1135
public static titleBoundary = new BackwardsCompatibleRegExp('(^|[^\\p{L}\\p{N}\']|((^|\\P{L})\'))\\p{L}', 'gmu');
JavaScriptã®æ£è¦è¡¨ç¾ãªãã©ã«ã«ã¯ãæ¬æã§ç´¹ä»ãã/\p{Lu}/gu
ã®è¨æ³ã®ä»ã«ãRegExpãªãã¸ã§ã¯ãã使ãæ¹æ³ãããã¾ã5ã
å¾è
ã®æ¹æ³ã§ã¯ããã¯ã¹ã©ãã·ã¥ã使ã£ã¦ç¹æ®æåãã¨ã¹ã±ã¼ããã¾ãã
ãªã®ã§ã'\\p'
ã¨ãªãã¨ç解ãã¾ããã
ref: 正規表現 - JavaScript | MDN
"åè§ã®G å ¨è§ã®ï¼§".match(new RegExp("\\p{Lu}", "gu"));
Array [ "G", "G" ]
- éå»ã«ããæ£è¦è¡¨ç¾ãç解ããããã«æ¸ããè¨äºã§ããã©ã°ã«è¨åãã¾ãã ↩
- ããããã£å General_Category ã®æ¥æ¬èªè¨³ã¨æããã¾ã↩
- ãã®è¨äºã§åç §ãã¦ããMDNããã¥ã¡ã³ãã®ãä¾ãã®ãä¸è¬ã«ãã´ãªã¼ãã§ç¤ºããã¦ãã¾ãã↩
- ãã®è¨äºã¯ç»å£æºåã®ä¸ç°ã§ã ↩
- https://developer.mozilla.org/ja/docs/Web/JavaScript/Guide/Regular_Expressions#%E6%AD%A3%E8%A6%8F%E8%A1%A8%E7%8F%BE%E3%81%AE%E4%BD%9C%E6%88%90↩