ååã¯ãbulkloaderã«ããã¢ãããã¼ãã«ã¤ãã¦è²ã
調ã¹ã¦è©¦ããã
GAE/Pyでbulkloaderを使ってデータをアップロードする - すぎゃーんメモ
ä»åº¦ã¯ããã¼ã¿ã®ãã¦ã³ãã¼ãã¨åé¤ã«ã¤ãã¦èª¿ã¹ãã
ãã¼ã¿ããã¦ã³ãã¼ããã
ã¢ãããã¼ãã¨ã»ã¼åæ§ã«è¡ããã¨ãã§ãããremote_apiã¸ã®ãã³ãã©è¨å®ãªã©ã¯ååã¨åãã¨ããã
使ç¨ããã®ã¯bulkloader.Loaderã¯ã©ã¹ã§ã¯ãªãbulkloader.Exporterã¯ã©ã¹ã
ä¾ãã°åå使ã£ãTimeLineã¨ããã¢ãã«ã§ãã¼ã¿ãDatastoreã«æ ¼ç´ããã¦ããã¨ããã
# model.py from google.appengine.ext import db class TimeLine(db.Model): name = db.StringProperty() text = db.TextProperty() time = db.DateTimeProperty()
ãã¦ã³ãã¼ãç¨ã«ç¨æããã¹ã¯ãªããã¯ä»¥ä¸ã®ããã«ãªãã
# loader.py from google.appengine.tools import bulkloader from model import TimeLine class TimeLineExporter(bulkloader.Exporter): def __init__(self): bulkloader.Exporter.__init__(self, 'TimeLine', [ ('name', str, None), ('text', lambda x: x.encode('utf-8'), None), ('time', str, None), ]) exporters = [TimeLineExporter]
ãã¯ãLoaderã¯ã©ã¹ã®ã¨ãã¨åæ§ã«__init__()ã¡ã½ããå
ã§ãã¼ã¿ã®å¤ææ¹æ³ãæå®ãããã¨ã«ãªãã
ãªã¹ãã®åã¿ãã«ã®ï¼çªç®ã®è¦ç´ ã¯ããã©ã«ãå¤ãæå®ãããããã
ããã使ã£ã¦ã"appcfg.py download_data"ã³ãã³ããå®è¡ããã ãã§ããã¼ã¿ããã¦ã³ãã¼ããã¦ãã¡ã¤ã«ã«æ¸ãåºãã¦ãããã
$ PYTHONPATH=. appcfg.py download_data --config_file=loader.py --filename=./result.csv --kind=TimeLine <対象ã¢ããªã±ã¼ã·ã§ã³ã®ãã£ã¬ã¯ããª>
"--filename"ã§æå®ãããã¡ã¤ã«åã¯çµ¶å¯¾ãã¹ã«ãªãã®ããªï¼åã«ãã¡ã¤ã«åã ããæå®ãã¦ãFileNotWritableErrorã§å¤±æããã®ã§çµ¶å¯¾ãã¹ã¨ãã«ã¬ã³ããã£ã¬ã¯ããªããã®ç¸å¯¾ãã¹ã¨ãã«ããªãã¨ãã¡ã£ã½ããæ¢ã«åå¨ãããã¡ã¤ã«åãNGã
ããã§ãæå®ããkindã®ãã¼ã¿ããã¹ã¦CSVã«æ¸ãåºããã¨ãã§ããã
CSV以å¤ã®å½¢å¼ã§ãã¦ã³ãã¼ããã
ã¢ãããã¼ãã¨åæ§ã«ãã¡ã½ããã®ãªã¼ãã¼ã©ã¤ãããããã¨ã§CSV以å¤ã®å½¢å¼ã§ãã¡ã¤ã«ä¿åãããã¨ãã§ããã
bulkloader.Exporterã¯ã©ã¹ã§ããªã¼ãã¼ã©ã¤ããã¹ãã¡ã½ããã¯ãoutput_entitiesã
def output_entities(self, entity_generator): """Outputs the downloaded entities. This implementation writes CSV. Args: entity_generator: A generator that yields the downloaded entities in key order. """ CheckOutputFile(self.output_filename) output_file = open(self.output_filename, 'w') logger.debug('Export complete, writing to file') output_file.writelines(self.__SerializeEntity(entity) + '\n' for entity in entity_generator)
å¼æ°ã§æ¸¡ãããã®ã¯entityãåãåºããã¨ã®åºæ¥ãgeneratorããã¡ã¤ã«åã¯åå¦çã§self.outpt_filenameã«æ ¼ç´ããã¦ããã
ããã§å®éã«CSVã®ã¬ã³ã¼ããä½ã£ã¦ããã®ã¯__SerializeEntityã¡ã½ãããããã«ãã®ä¸ã§å¼ãã§ãã__EncodeEntityã¡ã½ãããªã®ã ãã©ããããã¯privateãªã¡ã½ãããªã®ã§ãªã¼ãã¼ã©ã¤ãã§ããªã(ã¾ãå®éã«ã¯"_Exporter__EncodeEntity"ã¨ãããããªã¡ã½ããåã«ããã°ãªã¼ãã¼ã©ã¤ãã§ãã¡ãããã©ï½)ã
ä½æ³ã¨ãã¦ã¯åºåã®ã¡ã¤ã³ã¨ãªããã®output_entitiesã¡ã½ãããæ¸ãæããã ãã«ããã®ãè¯ãã¨æãããã
ä¾ï¼XMLå½¢å¼ã§åºåãã
appengineã®db.Modelã¯ã©ã¹ã«ã¯ãto_xmlã¨ããã¤ã³ã¹ã¿ã³ã¹ã¡ã½ããããããããããã®ã¾ã¾ä½¿ãã¨ä¾¿å©ã
from google.appengine.tools import bulkloader from model import TimeLine class TimeLineExporter(bulkloader.Exporter): def __init__(self): bulkloader.Exporter.__init__(self, 'TimeLine', [ ('name', str, None), ('text', lambda x: x.encode('utf-8'), None), ('time', str, None), ]) def output_entities(self, entity_generator): file = open(self.output_filename, 'w') file.write('<entities>\n') for entity in entity_generator: file.write(entity.to_xml().encode('utf-8')) file.write('</entities>\n') exporters = [TimeLineExporter]
ãããªã«ã³ã¸ã«ã
$ PYTHONPATH=. appcfg.py download_data --config_file=loader.py --filename=./result.xml --kind=TimeLine ../application $ cat result.xml <entities> <entity kind="TimeLine" key="aghzdWdpMTk4MnIQCxIIVGltZUxpbmUY198BDA"> <key>tag:sugi1982.gmail.com,2009-07-09:TimeLine[aghzdWdpMTk4MnIQCxIIVGltZUxpbmUY198BDA]</key> <property name="name" type="string">name40</property> <property name="text" type="text">ããã¹ã40</property> <property name="time" type="gd:when">1900-01-01 10:42:26</property> </entity> <entity kind="TimeLine" key="aghzdWdpMTk4MnIQCxIIVGltZUxpbmUY2N8BDA"> <key>tag:sugi1982.gmail.com,2009-07-09:TimeLine[aghzdWdpMTk4MnIQCxIIVGltZUxpbmUY2N8BDA]</key> <property name="name" type="string">name41</property> <property name="text" type="text">ããã¹ã41</property> <property name="time" type="gd:when">1900-01-01 07:11:17</property> </entity> ... </entities>
ä¾ï¼JSONå½¢å¼ã§åºåãã
åã¨ã³ãã£ãã£ãJSONã¨ãã¦ã·ãªã¢ã©ã¤ãºã§ããããè¾æ¸ãªã©ã®ãã¼ã¿ã«å¤æãã¦ããã°è¯ãã
from google.appengine.tools import bulkloader from model import TimeLine class TimeLineExporter(bulkloader.Exporter): def __init__(self): bulkloader.Exporter.__init__(self, 'TimeLine', [ ('name', str, None), ('text', lambda x: x.encode('utf-8'), None), ('time', str, None), ]) def output_entities(self, entity_generator): from django.utils import simplejson file = open(self.output_filename, 'w') entities = [] for entity in entity_generator: entities.append({ 'name' : entity.name, 'text' : entity.text, 'time' : str(entity.time), }) file.write(simplejson.dumps(entities)) exporters = [TimeLineExporter]
ãããªã«ã³ã¸ã«ã
$ PYTHONPATH=. appcfg.py download_data --config_file=loader.py --filename=./result.json --kind=TimeLine ../application $ cat result.json [{"text": "\u30c6\u30ad\u30b9\u30c840", "name": "name40", "time": "1900-01-01 10:42:26"}, {...
ãã¼ã¿ãå ¨åé¤ãã
æ¬æ¥ã®ä½¿ãæ¹ã§ã¯ãªããããããªããã©ããã®Exporterã¯ã©ã¹ã®output_entitiesã¡ã½ããã§ãã¹ã¦ã®ã¨ã³ãã£ãã£ã触ããã¨ãã§ããã®ã§ããã¡ã¤ã«åºåã«ä½¿ããã«(ãããã¯ãã¡ã¤ã«åºåããå¾ã§)ãããã®ã¨ã³ãã£ãã£ãåé¤ãããã¨ãã§ããã
è¯ããªãä¾
ãã¹ã¦ã®ã¨ã³ãã£ãã£ãã²ã¨ã¤ãã¤åãåºãã¦ãä¸åãã¤åé¤ããã
from google.appengine.tools import bulkloader from model import TimeLine class TimeLineExporter(bulkloader.Exporter): def __init__(self): bulkloader.Exporter.__init__(self, 'TimeLine', [ ('name', str, None), ('text', lambda x: x.encode('utf-8'), None), ('time', str, None), ]) def output_entities(self, entity_generator): for entity in entity_generator: entity.delete() exporters = [TimeLineExporter]
確ãã«ããã§ãã¹ã¦ã®ã¨ã³ãã£ãã£ãåé¤ã§ããããæ¯åæ¯åDatastoreã®APIãå©ããã¨ã«ãªãã®ã§ããããDatastoreã¸ã®è² è·ãé«ãã
Dashboardã§ç¢ºèªãã¦ã¿ãã¨ããã100件ã®ãã¼ã¿ãåé¤ããããã«âDatastore API Callsâã123åã»ã©å¼ã°ããæéããããªãã«ããã£ã¦ããã
ã¾ã¨ãã¦åé¤ ãã®ï¼
google.appengine.ext.dbããã±ã¼ã¸ã§ã¯deleteã¨ããé¢æ°ãæä¾ããã¦ãã¦ãããã¯ã¨ã³ãã£ãã£ã¾ãã¯ãã¼ã®ãªã¹ãã渡ããã¨ã§ä¸æ°ã«åé¤ãããã¨ãã§ããã
from google.appengine.tools import bulkloader from model import TimeLine class TimeLineExporter(bulkloader.Exporter): def __init__(self): bulkloader.Exporter.__init__(self, 'TimeLine', [ ('name', str, None), ('text', lambda x: x.encode('utf-8'), None), ('time', str, None), ]) def output_entities(self, entity_generator): from google.appengine.ext import db db.delete(list(entity_generator)) exporters = [TimeLineExporter]
ããããã¨ãAPI Callã®åæ°ã¯ä¸è¨ã¨åã100件ã®ãã¼ã¿ã«å¯¾ã24åã¨ãªããããã£ãæéã¯é常ã«çããªã£ãã
ãããããã®æ¹æ³ã§ã¯åé¡ãçãããã¨ãããã
ãã¾ã件æ°ãå¤ãã¨
google.appengine.api.datastore_errors.BadRequestError: cannot delete more than 500 entities in a single call
ã¨ããã¨ã©ã¼ãåºã¦ãã¾ãã
ã¾ã¨ãã¦åé¤ ãã®ï¼
ä¸æ°ã«å ¨é¨ãåé¤ãããã¨ã¯ãããæ°ç¾ååºåãã§ç¹°ãè¿ãä¸æ¬åé¤ãè¡ãã
from google.appengine.tools import bulkloader from model import TimeLine class TimeLineExporter(bulkloader.Exporter): def __init__(self): bulkloader.Exporter.__init__(self, 'TimeLine', [ ('name', str, None), ('text', lambda x: x.encode('utf-8'), None), ('time', str, None), ]) def output_entities(self, entity_generator): from google.appengine.ext import db while True: entities = [] for entity in entity_generator: entities.append(entity) if len(entities) >= 300: break print len(entities) if len(entities) == 0: break db.delete(entities) exporters = [TimeLineExporter]
300åç¨åº¦ã¾ã§ãªã¹ããè¨ããã ããããä¸æ°ã«åé¤ããããã«ãããã®å¦çãwhileæã§åãã¦ã¿ãã
ãããããã¨ã§å
ã»ã©ã®ã¨ã©ã¼ã¯é¿ããããã
課é¡
ãã ãä¸è¨ã®å¦çãç¹°ãè¿ãã¦ç¶ãã¦ããã¨è² è·ããããããã¦timeoutã«ãªã£ã¦ãã¾ã£ãããããããããªãã
ãã¼ã¿ã®ã¢ãããã¼ãããã¦ã³ãã¼ãã¨åæ§ã«Threadã§å¦çãåãã¦ããç¨åº¦å¾
ã¡ãªããè¡ãããã«ããã®ãçæ³ãâ¦
ããããDatastoreããfetchãã¦ããæç¹ã§(ãã®å¦çã¯ãã«ãã¹ã¬ããã§è¡ããã¦ããã£ã½ã)åé¤å¦çãããã¦ãããã°ããã®ã ãã©ãæ®å¿µãªãããã¡ãã§ã«ã¹ã¿ãã¤ãºã§ããã®ã¯Exporterã¯ã©ã¹ã«ã¤ãã¦ã ãã®ããã§ãæ®å¿µãªããããã¤ã¯ãã¹ã¦ã®ãã¼ã¿ããã¦ã³ãã¼ãããå¾ã«çæãããgeneratorããæ±ããã¨ãã§ããªãã
å®éã«Queryãæãã¦fetchãè¡ã£ã¦ããã®ã¯RequestManagerã¯ã©ã¹ã®GetEntitiesã¡ã½ãããããããBulkExporterThreadã«å¼ã°ãã¦ããããã®ããããã©ãã«ãHackã§ããã°ãããã ãã©ãã
追è¨
æ¸ããã
GAE/PyでDatastoreのデータを全削除するためのbulkdeleter.pyを書いた - すぎゃーんメモ