Jekyll2023-11-16T18:01:03+08:00https://utf7.github.io/feed.xmlutf7utf7 的个人博客/欢迎关注公众号yechaotalkutf7/Yechao Chen2023-11-16T18:01:03+08:002023-11-16T18:01:03+08:00https://utf7.github.io/2023/11/16/2022-07-20-spark-resource-used-metrics<p>大数据资源使用情况统计:</p> <p>物理内存使用率: 物理CPU使用率:</p> <p>物理 CPU使用率(白天/晚上): 物理 内存使用率(白天/晚上):</p> <p>YARN CPU使用率: YARN 内存使用率:</p> <p>YARN CPU使用率(白天/晚上): YARN 内存使用率(白天/晚上):</p> <p>Spark 相关:<br /> 每秒 Container 数量<br /> 每天 Spark 任务数量. 平均每个 Spark 任务 executor 数量. 平均每个 Spark Executor 申请的堆内存(spark.executor.memory). 平均每个 Spark Executor 申请的堆外内存(spark.executor.memoryoverhead). 平均每个 Spark Executor 申请的 Cores 数量(spark.executor.cores). 平均每个 Spark Executor Core 对应的内存比例(spark.executor.memory=8G,spark.executor.cores=4,则为2). 平均每个 Spark Executor 申请的总内存(spark.executor.memory+spark.executor.memoryoverhead). 平均 Spark Executor 堆内存使用率 sum(executor.max_heap_used<em>executorNum)/sum(spark.executor.memory</em>executorNum) 注意算法. 平均 Spark Executor 堆外内存使用率 sum(executor.max_offheap_used<em>executorNum)/sum(spark.executor.memoryoverhead</em>executorNum) 注意算法. 平均 Spark Executor 总内存使用率 sum((executor.max_offheap_used+executor.max_heap_used)<em>executorNum)/sum((spark.executor.memory+spark.executor.memoryoverhead)</em>executorNum) 注意算法 (不太严谨,堆内外的峰值内存不一定同时出现).</p> <p>平均每个Spark 任务的 Task 数量. 平均每个Spark shuffle 读的数据量. 平均每个Spark shuffle 写的数据量. 。。。。</p>utf7/Yechao Chen2023 H2 生活 OKR2023-09-03T00:00:00+08:002023-09-03T00:00:00+08:00https://utf7.github.io/2023/09/03/my-life-okr-for-2023H2<h1 id="2023-h2-生活-okr">2023 H2 生活 OKR</h1> <h2 id="o1-健康管理升级通过引入良好的健康管理机制并在日常生活中落地以改善睡眠质量增强免疫力保持精力延缓衰老提高观感使得更加健康活力">O1 【健康管理升级】通过引入良好的健康管理机制并在日常生活中落地,以改善睡眠质量,增强免疫力、保持精力,延缓衰老,提高观感,使得更加健康活力。</h2> <h3 id="kr1------通过健身运动以及控制饮食等手段降低脂肪含量将体重从125下降到120降低4-降低关节心血管心脏等健康风险-为长期可持续健康发展打下坚实基础同时提供更好的观感使得看起来更年轻">KR1 通过健身、运动以及控制饮食等手段,降低脂肪含量,将体重从125下降到120,降低4% ,降低关节、心血管、心脏等健康风险, 为长期可持续健康发展打下坚实基础,同时提供更好的观感,使得看起来更年轻。</h3> <h3 id="kr2------通过引入早睡晚起中午午休星期天补觉以及少看手机等策略来改善睡眠质量以缓解疲劳增强免疫力提高大脑活力保持精力增强免疫力以及延缓衰老为更好的生活和工作打下坚实的基础">KR2 通过引入早睡晚起、中午午休、星期天补觉以及少看手机等策略,来改善睡眠质量,以缓解疲劳,增强免疫力,提高大脑活力,保持精力,增强免疫力以及延缓衰老,为更好的生活和工作打下坚实的基础。</h3> <h3 id="kr3------通过撸铁游泳散步以及进食高蛋白食品减少糖分饮料咖啡因摄入提高肌肉含量提高不限于胸大肌腹肌三角肌斜方肌背阔肌的维度和颗粒度使其更加有锐度和颗粒感同时提高肺活量和改善心血管健康增强免疫力改善心情使得身体更健康">KR3 通过撸铁、游泳、散步以及进食高蛋白食品、减少糖分、饮料、咖啡因摄入,提高肌肉含量,提高不限于胸大肌、腹肌、三角肌、斜方肌、背阔肌的维度和颗粒度,使其更加有锐度和颗粒感,同时提高肺活量和改善心血管健康,增强免疫力,改善心情,使得身体更健康。</h3> <h2 id="o2------家庭管理升级响应号召积极生娃改善家庭关系关注小孩学习成长提高幸福感">O2 【家庭管理升级】响应号召,积极生娃;改善家庭关系,关注小孩学习成长,提高幸福感。</h2> <h3 id="kr1------投入更多时间在小孩陪伴上一块陪伴其学习生活以及健康快乐成长">KR1 投入更多时间在小孩陪伴上,一块陪伴其学习、生活以及健康快乐成长。</h3> <h3 id="kr2------通过积极响应国家二胎号召为国家以及人类的可持续发展做贡献积极生娃丰富家庭生活增加家庭成员之间的互动和交流缓解独生子女的压力避免感到孤独提高娃的协作精神">KR2 通过积极响应国家二胎号召,为国家以及人类的可持续发展做贡献,积极生娃,丰富家庭生活,增加家庭成员之间的互动和交流,缓解独生子女的压力,避免感到孤独,提高娃的协作精神。</h3> <h3 id="kr3------每月至少做4次饭4次市内或者周边出行游玩完成一次旅游出行">KR3 每月至少做4次饭,4次市内或者周边出行游玩,完成一次旅游出行。</h3> <h2 id="o3---生活管理升级通过完成装修夯实家居基础环境为家人提供良好的生活和学习环境改善财务状况">O3 【生活管理升级】通过完成装修,夯实家居基础环境,为家人提供良好的生活和学习环境;改善财务状况。</h2> <h3 id="kr1------引入水电木瓦油等工艺高质量完成硬装工作满足水电燃气保暖制冷安全防护等基本需求">KR1 引入水电木瓦油等工艺,高质量完成硬装工作,满足水、电、燃气、保暖、制冷、安全防护等基本需求。</h3> <h3 id="kr2------通过引入全屋定制淋浴房马桶花洒等来满足基本日常储物洗簌等居住诉求">KR2 通过引入全屋定制、淋浴房、马桶、花洒等来满足基本日常储物、洗簌等居住诉求。</h3> <h3 id="kr3------通过购买家具家电以及引入软装搭配来满足日常家庭居住休闲娱乐诉求同时改善家居观感">KR3 通过购买家具、家电以及引入软装搭配来满足日常家庭居住、休闲、娱乐诉求,同时改善家居观感。</h3> <h3 id="kr4----通过开源节流主要手段是节流改善家庭入不敷出的财务状况提高家庭抗风险能力和长期可持续发展">KR4 通过开源节流(主要手段是节流),改善家庭入不敷出的财务状况,提高家庭抗风险能力和长期可持续发展。</h3> <h2 id="o4-无用技能升级调研至少1个无用技能比如乐器拳击摩托车等考虑学习掌握一个无用技能">O4 【无用技能升级】调研至少1个无用技能,比如乐器、拳击、摩托车等,考虑学习掌握一个无用技能。</h2> <h3 id="kr1----调研乐器口琴吉他等拳击等无用技能丰富日常生活">KR1 调研乐器(口琴、吉他等)、拳击等无用技能,丰富日常生活。</h3>utf7/Yechao Chen2023 H2 生活 OKR程序员书写用词规范2022-06-12T00:00:00+08:002022-06-12T00:00:00+08:00https://utf7.github.io/2022/06/12/word-specification<p>书写用词规范,特别需要注意大小组合,否则会显得不专业:比如 hbase,mysql,sla,hadoop,clickhouse 都是不推荐的,常见的名词需要注意大小写</p> <p>常见的名词规范如下:</p> <p>Hadoop<br /> HDFS<br /> YARN<br /> MapReduce 或者 MR<br /> EMR<br /> Hive<br /> HBase<br /> Cassandra<br /> MySQL<br /> Spark<br /> Flink<br /> Kafka<br /> TensorFlow 或者 TF<br /> Elasticsearch 或者 ES<br /> Kylin<br /> Hudi<br /> Iceberg<br /> Parquet<br /> ORC<br /> Presto<br /> Trino<br /> ClickHouse 或者 CK<br /> Doris<br /> StarRocks<br /> Impala<br /> Zookeeper 或者 ZK<br /> RocksDB<br /> Java<br /> Linux <br /> CentOS<br /> Docker<br /> Kubernetes 或者 K8s<br /> SLA<br /> Tableau</p>utf7/Yechao Chen程序员书写用词规范将github 代码push 到自己的仓库并保留commit log2022-03-29T00:00:00+08:002022-03-29T00:00:00+08:00https://utf7.github.io/2022/03/29/push-remote-repo-to-your-git<h2 id="1将github-某个版本放到自己的仓库保留commit-log">1.将github 某个版本放到自己的仓库,保留commit log</h2> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone [email protected]:apache/hbase.git cd hbase/ git tag git checkout rel/2.4.11 git remote set-url origin ssh://[email protected]:port/bigdata/HBase git remote -v git branch test git checkout test git push origin test:2.4.11 </code></pre></div></div>utf7/Yechao Chen将github 代码push 到自己的仓库并保留commit log给容器镜像瘦身的一个小技巧2022-03-14T00:00:00+08:002022-03-14T00:00:00+08:00https://utf7.github.io/2022/03/14/reduce-docker-image-size<h2 id="1镜像瘦身">1.镜像瘦身</h2> <p>执行完 <code class="language-plaintext highlighter-rouge">yum</code> 命令以后,可以删除 <code class="language-plaintext highlighter-rouge">cache</code></p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> yum -q clean all &amp;&amp; \ rm -rf /var/cache/yum &amp;&amp; \ </code></pre></div></div> <p>如:</p> <p>https://github.com/utf7/Dockerfile/edit/master/jdk/jdk11/Dockerfile</p> <p>这个镜像清理yum cache 可以减少 130多M</p> <p><img width="998" alt="镜像大小的比较" src="/images/posts/k8s/dockerfile/clear-yum.jpg" /></p>utf7/Yechao Chen给容器镜像瘦身的一个小tipSpark Shuffle Service 配置不合理导致的任务失败以及NodeManager OOM 问题分析2021-05-12T00:00:00+08:002021-05-12T00:00:00+08:00https://utf7.github.io/2021/05/12/spark-shuffle-service-problems<h2 id="1">1.</h2> <p>最近集群有 Spark 任务会出现失败,查看 Spark 日志发现会有 Executor 挂掉的信息。</p> <p>一般 Executor 挂掉话,通常会是几种情况:</p> <p>1、Executor 自己挂了,比如申请内存不够等,executor 运行过程中 OOM等。</p> <p>2、Executor 所在的节点出现问题,比如宕机 。</p> <p>3、Executor 所在节点的NodeManager 挂了。因为Executor 都是NodeManager 进程 fork 出来的,NodeManager 挂的话,Executor 通常也会挂掉</p> <p>4、操作系统内存紧张触发了OS的OOM Killer 功能,被操作系统干掉了。</p> <p>5、Executor 被NodeManager watch 到内存超过使用的界限等原因,被NodeManager 给干掉了。</p> <p>6、人为 kill (可能性较小)</p> <h2 id="2">2.</h2> <p>查看 Spark 日志并没有发现 Executor 有 OOM 的情况。</p> <p>登陆 Executor 所在的节点,发现机器也正常。查看 NodeManager 情况</p> <p>ps -ef|grep NodeManager|grep -v grep 查看发现 NodeManager 也在。后来仔细看了一下 NodeManager 的启动时间发现不对,是最近的时间,证明 NodeManager 之前挂过,现在又启动了</p> <p>(我们有 NodeManager 自动拉起的功能,所以挂掉以后立刻被拉起来了)</p> <p>check NodeManager 日志</p> <p>发现是 NodeManager OOM了</p> <pre><code class="language-log"> WARN org.spark_project.io.netty.channel.AbstractChannelHandlerContext: An exception 'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full stacktrace] was thrown by a user handler's exceptionCaught() method while handling the following exception: java.lang.OutOfMemoryError: GC overhead limit exceeded WARN org.spark_project.io.netty.channel.AbstractChannelHandlerContext: An exception ' java.lang.OutOfMemoryError: Java heap space FATAL org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[DeletionService #3,5,main] threw an Error. Shutting down now... java.lang.OutOfMemoryError: Java heap space </code></pre> <p>OK,那么到这里原因大概就知道了</p> <p>NodeManager OOM 导致的 Executor 挂掉。</p> <p>那么为什么会 OOM 呢?</p> <p>ps -ef|grep NodeManager|grep -v grep check 了一下 NodeManager 进程,发现设置了4096的堆内存(-Xmx4096m)</p> <p>根据之前的经验线上应用 NodeManager 推荐大概内存是 8-12G 比较靠谱,4G 确实有点低了。</p> <p>是不是就设置一下8-12G内存就 ok了呢?当然我们还是要分析一下为什么会OOM,到底什么占用了内存?是不是有内存泄露?是个例还是很多类似的?</p> <p>使用 salt 全量统计了一下所有NodeManager日志关键字段,发现并不是个例有其他节点也存在 NodeManager OOM 的情况,找到另外一个节点来check一下(该节点其实GC不过来,已经卡死了)。</p> <h2 id="3">3.</h2> <p>首先是保留一些信息:</p> <p>1、保留堆栈信息:</p> <p>jstack 31549 &gt;nm-31459.jstack1;</p> <p>sleep 2s;</p> <p>jstack 31549 &gt;nm-31459.jstack2;</p> <p>sleep 5s;</p> <p>jstack 31549 &gt; nm-31459.jstack3;</p> <p>2、保留句柄</p> <p>lsof -p 31549 &gt; nm-31459.fd</p> <p>3、check 一下GC</p> <p>jstat -gcutil 分析的gc 情况.</p> <p>图片</p> <p>4、jmap 看一下内存情况:</p> <p>jmap -histo 31549 &gt;jmap-nm-31549.jmap</p> <p>5、dump 内存</p> <p>jmap -dump:format=b,file=nm-31549.heap.bin 31549</p> <p>check 下 jmap 的信息</p> <p>图片</p> <p>发现有大量的LocalCache 对象以及ShuffleIndexInformation</p> <p>ps:</p> <p>[B 是指byte[] 数组</p> <p>[C 是char[] 数组</p> <p>一般 Java 进程 jmap 大部分都是 byte[],char[],String 这些,是比较正常的通常看不出来什么,基本上没啥关系。</p> <p>主要怀疑对象在 ShuffleIndexInformation 和 LocalCache 上面</p> <p>check 了一下代码,发现 Hadoop 中并没有这个代码。</p> <p>从名字看这块肯定是 Shuffle 相关的,所以自然而然就想到了Spark Shuffle Service 。这里简单说一下,通常 Shuffle 是在 Executor 中的,executor 挂的话,就会导致shuffle 数据丢失失败,所以spark 后来做了 Shuffle Service,原理是将 Shuffle 放到 NodeManager 中来做,NodeManager 一般不忙,用来做 Shuffle 也是可以的,这样的话,Executor 挂了,NodeManager 其实还在的,有一定的稳定性提高,当然现在还有一个 Shuffle 思路是将 shuffle 做成单独的服务放在外面,这是另外的一个话题了。</p> <p>查看 Spark 代码,Spark Shuffle Service 的代码在 NodeManager 中跑,ExternalShuffleBlockResolver 内部有一个</p> <p>LoadingCache&lt;File, ShuffleIndexInformation&gt; shuffleIndexCache 该 cache 主要是用于 cache shuffle 的 index 信息。</p> <p>cache 默认配置的是100m,由参数spark.shuffle.service.index.cache.size来配置。</p> <p>查看当前配置发现是4096m</p> <p>grep -A 1 “spark.shuffle.service.index.cache.size” /etc/apps/hadoop-conf/yarn-site.xml</p> <name>spark.shuffle.service.index.cache.size</name> <value>4096m</value> <p>当前 NodeManager 配置也就 4096m,所以当 cache 到一定程度的时候,oom 就可想而知了。</p> <p>cache 相关代码设置如下:</p> <p>ExternalShuffleBlockResolver( TransportConf conf, File registeredExecutorFile, Executor directoryCleaner) throws IOException { this.conf = conf; this.registeredExecutorFile = registeredExecutorFile; String indexCacheSize = conf.get(“spark.shuffle.service.index.cache.size”, “100m”); CacheLoader&lt;File, ShuffleIndexInformation&gt; indexCacheLoader = new CacheLoader&lt;File, ShuffleIndexInformation&gt;() { public ShuffleIndexInformation load(File file) throws IOException { return new ShuffleIndexInformation(file); } }; shuffleIndexCache = CacheBuilder.newBuilder() .maximumWeight(JavaUtils.byteStringAsBytes(indexCacheSize)) .weigher(new Weigher&lt;File, ShuffleIndexInformation&gt;() { public int weigh(File file, ShuffleIndexInformation indexInfo) { return indexInfo.getSize();//这块计算其实是有点问题的 } }) .build(indexCacheLoader);</p> <p>对 dump 出来的内存进行分析发现也是与上面的结论是一致的:</p> <p>发现4G内存,shuffleIndexCache 就占用了3.7G</p> <p>图片</p> <p>图片</p> <p>check 了代码其实感觉代码还是有一些改进的地方的:</p> <p>具体见注释</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>shuffleIndexCache = CacheBuilder.newBuilder() .maximumWeight(JavaUtils.byteStringAsBytes(indexCacheSize)) .weigher(new Weigher&lt;File, ShuffleIndexInformation&gt;() { public int weigh(File file, ShuffleIndexInformation indexInfo) { //return indexInfo.getSize(); //这块计算其实是有点问题的,只计算了indexInfo文件size ,其实有会有ovehead的 //当然这个问题主要看应该不是overhead,而是堆内存一共就配置了4096m,而cache 也配置了这么大。 //实际上cache 是会超过配置的值的,比如文件多,但是index文件内容比较少的情况下。 //这块代码做了一点点稍微小小的改动,具体其实是可以参考HBase d等代码来实际计算java对象大小。 // 不过这块没有太大必要这么细,但是还是需要考虑overhead //当然对于index.cache.size 设置成256m的话,内存8G的NM的话,即使有8倍的overhead,其实问题也不大。 //我们这里设置成了512m return file.getAbsolutePath().length() + 128+file indexInfo.getSize(); } }) 最终做了如下三个修改: </code></pre></div></div> <p>1.修改 spark.shuffle.service.index.cache.size=512m</p> <p>2.同时修改NodeManager 内存为10240m(10G)</p> <p>3.同时对shuffleIndexCache内存占用的计算做了一些小小的修正(这里不是主要原因,这次主要还是内存配置于cache.size不匹配导致的,但是内存配置即使比较大的情况下,如果shuffleIndexCache配置也比较大,也有可能会出现可能shuffleIndexCache配置2g,实际上shuffleIndexCache占用远大于2g的情况会出现问题,所以规避一下风险。</p> <p>修改以后任务稳定了很多。</p> <p>观察了一段时间,发现再也没有 NodeManager OOM了</p> <p>后来搜了一下社区,确实有类似的关于 shuffleIndexCache 内存计算的改进的改进。相关jira:</p> <p>https://issues.apache.org/jira/browse/SPARK-21501</p> <p>https://issues.apache.org/jira/browse/SPARK-33206</p>utf7/Yechao ChenSpark Shuffle Service 配置不合理导致的任务失败以及NodeManager OOM 问题分析salt 中执行 awk 注意事项2020-11-10T00:00:00+08:002020-11-10T00:00:00+08:00https://utf7.github.io/2020/11/10/salt-exec-awk-failed-case<p>使用 <code class="language-plaintext highlighter-rouge">salt</code> 执行 awk 时,发现会执行失败</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>salt "*" cmd.run " ps -ef|grep NodeManager|grep -v grep|grep org.apache.hadoop.yarn.server.nodemanager.NodeManager|awk '{print $2}'|xargs lsof -p|grep REG|grep /usr/lib/locale/locale-archive" </code></pre></div></div> <p>会发现执行失败 $2 前面需要加转义符号</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>salt "*" cmd.run " ps -ef|grep NodeManager|grep -v grep|grep org.apache.hadoop.yarn.server.nodemanager.NodeManager|awk '{print \$2}'|xargs lsof -p|grep REG|grep /usr/lib/locale/locale-archive" </code></pre></div></div>utf7/Yechao Chensalt 中执行 awk 注意事项salt 中关于locale的问题2020-11-10T00:00:00+08:002020-11-10T00:00:00+08:00https://utf7.github.io/2020/11/10/salt-locale-problem<h2 id="记录一次诡异的-hivespark-乱码问题">记录一次诡异的 Hive/Spark 乱码问题</h2> <p>最近遇到一个诡异的事情,Hive/Spark 写入汉字乱码。</p> <p>检查 Hive/Spark 的各种参数也没有发现有什么变动,客户端配置也是一样的。</p> <p>有一个诡异的地方是</p> <p><code class="language-plaintext highlighter-rouge">spark-sql -f xxx.sql,xxx.sql</code> 如果 <code class="language-plaintext highlighter-rouge">xxx.sql</code> 中如果有汉字,则也会有问题。</p> <p><code class="language-plaintext highlighter-rouge">spark-submit client</code> 模式则没有问题,<code class="language-plaintext highlighter-rouge">cluster</code> 模式则会有问题。</p> <p>这里面唯一区别是 <code class="language-plaintext highlighter-rouge">client</code> 的 <code class="language-plaintext highlighter-rouge">Driver</code> 在 本地,<code class="language-plaintext highlighter-rouge">cluster</code> 在 <code class="language-plaintext highlighter-rouge">YARN(NodeManager)</code> 节点。</p> <p>另外 <code class="language-plaintext highlighter-rouge">Spark/MR</code> 读写数据也有问题,不过 <code class="language-plaintext highlighter-rouge">hive cli</code> 不启动 <code class="language-plaintext highlighter-rouge">MR</code> 的话,则没有问题。。。比如 <code class="language-plaintext highlighter-rouge">select * from t limit 10</code> 这种不会启动 <code class="language-plaintext highlighter-rouge">MR</code>,则是本地运行的。</p> <p>分析来看,不管是 <code class="language-plaintext highlighter-rouge">Hive/Spark</code> 只要任务跑在 YARN 上面则会有问题。</p> <p>所以这块感觉和 <code class="language-plaintext highlighter-rouge">NodeManager</code> 有关系,因为 <code class="language-plaintext highlighter-rouge">on YARN</code> 的进程都是 <code class="language-plaintext highlighter-rouge">NodeManager fork</code> 出来的。</p> <p>因为最近在做 <code class="language-plaintext highlighter-rouge">Zstandard</code> 修改了代码,所以同事怀疑是否代码引起的。我反复check了代码,并没有涉及到字符编码相关的内容。不应该出现此类问题。</p> <p>由于修改了 <code class="language-plaintext highlighter-rouge">Hadoop</code> 代码,需要重启 <code class="language-plaintext highlighter-rouge">NodeManager</code> ,最近变动就是重启了 <code class="language-plaintext highlighter-rouge">NodeManager</code>以及添加了 <code class="language-plaintext highlighter-rouge">Zstandard</code> 的支持。</p> <p>登陆出现问题的主机</p> <p>执行 <code class="language-plaintext highlighter-rouge">locale</code></p> <div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="nv">LANG</span><span class="o">=</span>en_US.UTF-8 <span class="nv">LCCTYPE</span><span class="o">=</span><span class="s2">"en_US.UTF-8"</span> <span class="nv">LC_NUMERIC</span><span class="o">=</span><span class="s2">"en_US.UTF-8"</span> <span class="nv">LC_TIME</span><span class="o">=</span><span class="s2">"en_US.UTF-8"</span> <span class="nv">LC_COLLATE</span><span class="o">=</span><span class="s2">"en_US.UTF-8"</span> <span class="nv">LC_MONETARY</span><span class="o">=</span><span class="s2">"en_US.UTF-8"</span> <span class="nv">LC_MESSAGES</span><span class="o">=</span><span class="s2">"en_US.UTF-8"</span> <span class="nv">LC_PAPER</span><span class="o">=</span><span class="s2">"en_US.UTF-8"</span> <span class="nv">LC_NAME</span><span class="o">=</span><span class="s2">"en_US.UTF-8"</span> <span class="nv">LC_ADDRESS</span><span class="o">=</span><span class="s2">"en_US.UTF-8"</span> <span class="nv">LC_TELEPHONE</span><span class="o">=</span><span class="s2">"en_US.UTF-8"</span> <span class="nv">LC_MEASUREMENT</span><span class="o">=</span><span class="s2">"en_US.UTF-8"</span> <span class="nv">LC_IDENTIFICATION</span><span class="o">=</span><span class="s2">"en_US.UTF-8"</span> <span class="nv">LC_ALL</span><span class="o">=</span> </code></pre></div></div> <p>貌似也没啥问题。百思不得其解:</p> <p>此时赶快将节点回滚到之前的版本,然后慢慢重启 <code class="language-plaintext highlighter-rouge">NodeManager</code>,发现确实问题渐渐修复了。</p> <p>还有另外一个现象是并不是所有内容都是乱码的,这个猜测和 <code class="language-plaintext highlighter-rouge">Map/Reduce</code> 以及 <code class="language-plaintext highlighter-rouge">Spark Executor</code> 跑在不同的 <code class="language-plaintext highlighter-rouge">NodeManager</code> 节点有关系。</p> <p>这时候根据多年的经验直觉,赶快保存的异常的节点 <code class="language-plaintext highlighter-rouge">NodeManager</code> 保留的 <code class="language-plaintext highlighter-rouge">lsof</code> 句柄信息,</p> <p>diff 一下进程加载的 fd,是不是少加载了什么导致的,比较2个进程加载的fd 发现有一个</p> <div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ps <span class="nt">-ef</span>|grep <span class="nt">-v</span> <span class="nb">grep</span>|grep org.apache.hadoop.yarn.server.nodemanager.NodeManager|awk <span class="s1">'{print $2}'</span>|xargs lsof <span class="nt">-p</span> |sort <span class="nt">-n</span> <span class="o">&gt;</span> lsof_before </code></pre></div></div> <div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ps <span class="nt">-ef</span>|grep <span class="nt">-v</span> <span class="nb">grep</span>|grep org.apache.hadoop.yarn.server.nodemanager.NodeManager|awk <span class="s1">'{print $2}'</span>|xargs lsof <span class="nt">-p</span> |sort <span class="nt">-n</span> <span class="o">&gt;</span> lsof_after. </code></pre></div></div> <div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>diff lsof_before lsof_after </code></pre></div></div> <p>然后发现少了</p> <p><code class="language-plaintext highlighter-rouge">/usr/lib/locale/locale-archive</code></p> <p>咦,什么鬼。。。为啥没有这个,分明我登陆节点执行,<code class="language-plaintext highlighter-rouge">locale</code> 输出是正常的。</p> <p>所以就把问题定位点放在 <code class="language-plaintext highlighter-rouge">/usr/lib/locale/locale-archive</code> 为什么少了这个地方。</p> <p>出问题的 NodeManager 启动命令如下:</p> <div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>salt <span class="s1">'nm-*'</span> cmd.run <span class="s1">'sudo -u yarn bash -c "source /etc/profile;yarn-daemon.sh start nodemanager"'</span> </code></pre></div></div> <p>我在一个节点测试了一下,使用</p> <div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo</span> <span class="nt">-u</span> yarn bash <span class="nt">-c</span> <span class="s2">"source /etc/profile;yarn-daemon.sh start nodemanager"</span> </code></pre></div></div> <p>启动一个 NM 测试,发现并没有什么异常。。。。太诡异了。。。</p> <p>节点还没有重启完,我赶快把所有的节点检查了一下,是否都丢失 <code class="language-plaintext highlighter-rouge">/usr/lib/locale/locale-archive</code></p> <div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>salt <span class="s2">"*"</span> cmd.run <span class="s2">" ps -ef|grep -v grep|grep org.apache.hadoop.yarn.server.nodemanager.NodeManager|awk '{print </span><span class="se">\$</span><span class="s2">2}'|xargs lsof -p|grep REG|grep /usr/lib/locale/locale-archive"</span> </code></pre></div></div> <p>发现我之前启动的节点都是没有 <code class="language-plaintext highlighter-rouge">locale-archive</code> ,而同事新启动的则是没有问题的。莫不是启动方式有问题???诡异。<br /> 执行: <code class="language-plaintext highlighter-rouge">salt -E "nm-*" cmd.run "locale"</code></p> <pre><code class="language-LANG=en_US.UTF-8"> LC_CTYPE=C LC_NUMERIC=C LC_TIME=C LC_COLLATE=C LC_MONETARY=C LC_MESSAGES=C LC_PAPER=C LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=C LC_IDENTIFICATION=C LC_ALL= </code></pre> <p>发现 <code class="language-plaintext highlighter-rouge">locale</code> 很多输出都是<code class="language-plaintext highlighter-rouge">C</code>,这个与我之前在机器上面执行的结果是不一致的。</p> <p>所以应该是 <code class="language-plaintext highlighter-rouge">salt</code> 导致的,问了一下同事是如何重启的,他是kill 掉 <code class="language-plaintext highlighter-rouge">NodeManager</code>,并没有启动,而是让监控脚本自动拉起的。</p> <p>而我当时为了启动快一些,是直接使用 <code class="language-plaintext highlighter-rouge">salt</code> 批量 <code class="language-plaintext highlighter-rouge">stop</code> 然后 <code class="language-plaintext highlighter-rouge">start</code> 的。</p> <p>至此原因排查出来的,原来是 <code class="language-plaintext highlighter-rouge">salt</code> 的锅。</p> <p>后来去github 把salt 代码拉了下来,看了一下。问题出在这边:</p> <p><code class="language-plaintext highlighter-rouge">salt</code> 默认会 <code class="language-plaintext highlighter-rouge">reset_system_locale</code>,代码截图如下:</p> <p><img src="/images/posts/salt/reset_system_locale_code1.png" alt="reset_system_locale_code1" /></p> <p><img src="/images/posts/salt/reset_system_locale_code2.png" alt="reset_system_locale_code1" /></p> <p>个人感觉这边salt处理的很有问题,虽然提供了参数,但我仍然认为是个bug,不知道为啥会考虑默认重置locale。</p> <p>很坑!!!</p> <p>看了一下代码,可以执行<code class="language-plaintext highlighter-rouge">salt</code> 命令的时候,添加参数 <code class="language-plaintext highlighter-rouge">reset_system_locale=False</code> 解决:</p> <p>修改 <code class="language-plaintext highlighter-rouge">NodeManger</code> 启动脚本,<code class="language-plaintext highlighter-rouge">Hive/Spark</code> 乱码问题解决</p> <div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>salt <span class="s1">'nm-*'</span> cmd.run <span class="nv">reset_system_locale</span><span class="o">=</span>False <span class="s1">'sudo -u yarn bash -c "source /etc/profile;yarn-daemon.sh start nodemanager"'</span> </code></pre></div></div> <p>验证:</p> <div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>salt <span class="s2">"nm-*"</span> cmd.run <span class="s2">" ps -ef|grep -v grep|grep org.apache.hadoop.yarn.server.nodemanager.NodeManager|awk '{print </span><span class="se">\$</span><span class="s2">2}'|xargs lsof -p|grep REG|grep /usr/lib/locale/locale-archive"</span> </code></pre></div></div> <p>如果有使用 <code class="language-plaintext highlighter-rouge">salt</code> 的同学,具体可以使用如下命令来测试 测试这个问题:</p> <div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>salt <span class="s1">'nm-*'</span> cmd.run <span class="nv">reset_system_locale</span><span class="o">=</span>False <span class="s1">'locale'</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> salt 'nm-*' cmd.run 'locale' </code></pre></div></div> <p>PS: 关于 <code class="language-plaintext highlighter-rouge">locale</code> : 可参考如下几个链接:</p> <p>1.https://wiki.archlinux.org/index.php/locale<br /> 2.https://man7.org/linux/man-pages/man1/localedef.1.html<br /> 3.https://linuxhint.com/locales_debian/</p>utf7/Yechao Chensalt 中关于locale的问题Spark SQL 正确的传递 Hive 参数2020-10-18T00:00:00+08:002020-10-18T00:00:00+08:00https://utf7.github.io/2020/10/18/spark-sql-params<p>使用 <code class="language-plaintext highlighter-rouge">spark-sql</code> 导入动态分区时,出现来错误</p> <pre><code class="language-log">Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Number of dynamic partitions created is 1793, which is more than 1000. To solve this try to set hive.exec.max.dynamic.partitions to at least 1793.; </code></pre> <p>提示增加 <code class="language-plaintext highlighter-rouge">hive.exec.max.dynamic.partitions</code> 的值, 可是我明明通过 <code class="language-plaintext highlighter-rouge">--conf hive.exec.max.dynamic.partitions=1000000</code> 为什么还是报错了?难道是参数没有传递进去??? 执行 <code class="language-plaintext highlighter-rouge">spark-sql --help</code> ,通过查看 <code class="language-plaintext highlighter-rouge">spark-sql</code> 的帮助</p> <p>原来 hive的参数是通过 <code class="language-plaintext highlighter-rouge">--hiveconf</code> 来传递的而不是 <code class="language-plaintext highlighter-rouge">--conf</code></p> <p>使用 <code class="language-plaintext highlighter-rouge">--conf</code> 来传递 <code class="language-plaintext highlighter-rouge">spark</code> 参数 使用 <code class="language-plaintext highlighter-rouge">--hiveconf</code> 来传递 <code class="language-plaintext highlighter-rouge">hive</code> 参数 修改后为:</p> <div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">nohup </span>spark-sql <span class="nt">--master</span> yarn <span class="nt">--deploy-mode</span> client <span class="nt">--queue</span> root.test.myqueue <span class="nt">--driver-memory</span> 24g <span class="nt">--executor-cores</span> 2 <span class="nt">--executor-memory</span> 4g <span class="nt">--num-executors</span> 256 <span class="nt">--conf</span> spark.driver.memoryOverhead<span class="o">=</span>1g <span class="nt">--conf</span> spark.executor.memoryOverhead<span class="o">=</span>1g <span class="nt">--conf</span> spark.speculation<span class="o">=</span><span class="nb">false</span> <span class="nt">--conf</span> spark.driver.maxResultSize<span class="o">=</span>12g <span class="nt">--conf</span> spark.sql.hive.filesourcePartitionFileCacheSize<span class="o">=</span>4621440000 <span class="nt">--conf</span> spark.sql.files.maxPartitionBytes<span class="o">=</span>268435456 <span class="nt">--conf</span> spark.sql.files.openCostInBytes<span class="o">=</span>0 <span class="nt">--conf</span> spark.sql.shuffle.partitions<span class="o">=</span>512 <span class="nt">--hiveconf</span> hive.exec.max.dynamic.partitions<span class="o">=</span>1000000 <span class="nt">--hiveconf</span> hive.exec.max.dynamic.partitions.pernode<span class="o">=</span>100000 <span class="nt">--hiveconf</span> hive.exec.max.created.files<span class="o">=</span>1000000 <span class="nt">-S</span> <span class="nt">-e</span> <span class="s2">"insert overwrite table tt partition(day,hour) select * from table1 where day&lt;='2020-10-11' distribute by day,hour , cast( floor(rand() * 8) as int);"</span> <span class="o">&gt;</span> merge.log 2&gt;&amp;1 &amp; </code></pre></div></div>utf7/Yechao ChenSpark SQL 正确的传递 Hive 参数在 Spark SQL 使用 REPARTITION Hint 来减少小文件输出2020-10-08T00:00:00+08:002020-10-08T00:00:00+08:00https://utf7.github.io/2020/10/08/use-spark-repartition-hint-to-reduce-small-files<h2 id="1-问题">1. 问题</h2> <p>公司数仓业务有一个 <code class="language-plaintext highlighter-rouge">sql</code> 任务,每天会产生大量的小文件,每个文件只有几百 <code class="language-plaintext highlighter-rouge">KB</code>~几 <code class="language-plaintext highlighter-rouge">M</code> 大小,小文件过多会对 <code class="language-plaintext highlighter-rouge">HDFS</code> 性能造成比较大的影响,同时也影响数据的读写性能(Spark 任务某些情况下会缓存文件信息),虽然开发了小文件合并工具会去定期合并小文件,但还是想从源头上来解决这个问题,尽量生成比较合适的文件大小,而不是事后补救。</p> <p>下图是优化前一天 18 点这个分区的数据,可以看出小文件问题明显。</p> <p><img src="/images/posts/spark/spark-repartition-small-files/small-files.png" alt="small-files" /></p> <h2 id="2-分析">2. 分析</h2> <p>思路:</p> <p>1、统计每次生成的文件大小</p> <p>2、每个小文件的大小</p> <p>3、预期文件大小</p> <p>4、找到优化方法,使其满足预期大小</p> <p>统计一下每个分区的数据量,差不多每个小时 <code class="language-plaintext highlighter-rouge">1-6G</code> 数据</p> <div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span> hdfs dfs <span class="nt">-du</span> <span class="nt">-h</span> hdfs://dw/test.db/testtable/p1<span class="o">=</span>a/dt<span class="o">=</span>2020-09-28/ 3.4 G 10.2 G hdfs://dw/test.db/testtable/p1<span class="o">=</span>a/dt<span class="o">=</span>2020-09-28/hour<span class="o">=</span>00 2.1 G 6.3 G hdfs://dw/test.db/testtable/p1<span class="o">=</span>a/dt<span class="o">=</span>2020-09-28/hour<span class="o">=</span>01 1.4 G 4.3 G hdfs://dw/test.db/testtable/p1<span class="o">=</span>a/dt<span class="o">=</span>2020-09-28/hour<span class="o">=</span>02 1.1 G 3.4 G hdfs://dw/test.db/testtable/p1<span class="o">=</span>a/dt<span class="o">=</span>2020-09-28/hour<span class="o">=</span>03 1.2 G 3.6 G hdfs://dw/test.db/testtable/p1<span class="o">=</span>a/dt<span class="o">=</span>2020-09-28/hour<span class="o">=</span>04 1.8 G 5.5 G hdfs://dw/test.db/testtable/p1<span class="o">=</span>a/dt<span class="o">=</span>2020-09-28/hour<span class="o">=</span>05 3.1 G 9.3 G hdfs://dw/test.db/testtable/p1<span class="o">=</span>a/dt<span class="o">=</span>2020-09-28/hour<span class="o">=</span>06 3.8 G 11.4 G hdfs://dw/test.db/testtable/p1<span class="o">=</span>a/dt<span class="o">=</span>2020-09-28/hour<span class="o">=</span>07 3.8 G 11.5 G hdfs://dw/test.db/testtable/p1<span class="o">=</span>a/dt<span class="o">=</span>2020-09-28/hour<span class="o">=</span>08 3.9 G 11.6 G hdfs://dw/test.db/testtable/p1<span class="o">=</span>a/dt<span class="o">=</span>2020-09-28/hour<span class="o">=</span>09 4.1 G 12.2 G hdfs://dw/test.db/testtable/p1<span class="o">=</span>a/dt<span class="o">=</span>2020-09-28/hour<span class="o">=</span>10 4.6 G 13.7 G hdfs://dw/test.db/testtable/p1<span class="o">=</span>a/dt<span class="o">=</span>2020-09-28/hour<span class="o">=</span>11 5.4 G 16.3 G hdfs://dw/test.db/testtable/p1<span class="o">=</span>a/dt<span class="o">=</span>2020-09-28/hour<span class="o">=</span>12 4.7 G 14.0 G hdfs://dw/test.db/testtable/p1<span class="o">=</span>a/dt<span class="o">=</span>2020-09-28/hour<span class="o">=</span>13 4.1 G 12.4 G hdfs://dw/test.db/testtable/p1<span class="o">=</span>a/dt<span class="o">=</span>2020-09-28/hour<span class="o">=</span>14 4.1 G 12.3 G hdfs://dw/test.db/testtable/p1<span class="o">=</span>a/dt<span class="o">=</span>2020-09-28/hour<span class="o">=</span>15 4.1 G 12.4 G hdfs://dw/test.db/testtable/p1<span class="o">=</span>a/dt<span class="o">=</span>2020-09-28/hour<span class="o">=</span>16 4.5 G 13.4 G hdfs://dw/test.db/testtable/p1<span class="o">=</span>a/dt<span class="o">=</span>2020-09-28/hour<span class="o">=</span>17 4.6 G 13.8 G hdfs://dw/test.db/testtable/p1<span class="o">=</span>a/dt<span class="o">=</span>2020-09-28/hour<span class="o">=</span>18 5.0 G 15.1 G hdfs://dw/test.db/testtable/p1<span class="o">=</span>a/dt<span class="o">=</span>2020-09-28/hour<span class="o">=</span>19 5.5 G 16.6 G hdfs://dw/test.db/testtable/p1<span class="o">=</span>a/dt<span class="o">=</span>2020-09-28/hour<span class="o">=</span>20 6.1 G 18.4 G hdfs://dw/test.db/testtable/p1<span class="o">=</span>a/dt<span class="o">=</span>2020-09-28/hour<span class="o">=</span>21 6.2 G 18.7 G hdfs://dw/test.db/testtable/p1<span class="o">=</span>a/dt<span class="o">=</span>2020-09-28/hour<span class="o">=</span>22 5.0 G 15.1 G hdfs://dw/test.db/testtable/p1<span class="o">=</span>a/dt<span class="o">=</span>2020-09-28/hour<span class="o">=</span>23 </code></pre></div></div> <p>使用 <code class="language-plaintext highlighter-rouge">hour=18</code> 这个分区举例:</p> <div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span> hdfs dfs <span class="nt">-ls</span> <span class="nt">-h</span> <span class="nt">-R</span> hdfs://dw/test.db/testtable/p1<span class="o">=</span>a/dt<span class="o">=</span>2020-09-28/hour<span class="o">=</span>18|wc <span class="nt">-l</span> 800 </code></pre></div></div> <p>可以看出有 <code class="language-plaintext highlighter-rouge">800</code> 个文件。</p> <p>详细看一下文件大小,有一半是 <code class="language-plaintext highlighter-rouge">9.2M</code> 左右和一半<code class="language-plaintext highlighter-rouge">2.6M</code>左右(实际是由 UNION ALL 分别生成 400个 ),其他就不列出来了.</p> <div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>9.2 M 27.6 M hdfs://dw/test.db/testtable/p1<span class="o">=</span>a/dt<span class="o">=</span>2020-09-28/hour<span class="o">=</span>18/part-00397-13c5757b-294b-4010-a0f7-bbb7100455c8.c000 9.3 M 27.8 M hdfs://dw/test.db/testtable/p1<span class="o">=</span>a/dt<span class="o">=</span>2020-09-28/hour<span class="o">=</span>18/part-00398-13c5757b-294b-4010-a0f7-bbb7100455c8.c000 9.2 M 27.5 M hdfs://dw/test.db/testtable/p1<span class="o">=</span>a/dt<span class="o">=</span>2020-09-28/hour<span class="o">=</span>18/part-00399-13c5757b-294b-4010-a0f7-bbb7100455c8.c000 2.6 M 7.7 M hdfs://dw/test.db/testtable/p1<span class="o">=</span>a/dt<span class="o">=</span>2020-09-28/hour<span class="o">=</span>18/part-00400-13c5757b-294b-4010-a0f7-bbb7100455c8.c000 2.6 M 7.7 M hdfs://dw/test.db/testtable/p1<span class="o">=</span>a/dt<span class="o">=</span>2020-09-28/hour<span class="o">=</span>18/part-00401-13c5757b-294b-4010-a0f7-bbb7100455c8.c000 </code></pre></div></div> <p>文件由 <code class="language-plaintext highlighter-rouge">SQL ETL</code> 生成,清洗入库的<code class="language-plaintext highlighter-rouge">sql</code> 如下:</p> <div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">INSERT</span> <span class="n">overwrite</span> <span class="k">TABLE</span> <span class="n">testdb</span><span class="p">.</span><span class="n">testtable</span> <span class="n">PARTITION</span> <span class="p">(</span><span class="n">p1</span><span class="p">,</span> <span class="n">dt</span><span class="p">,</span> <span class="n">hour</span><span class="p">)</span> <span class="k">SELECT</span> <span class="p">...</span> <span class="k">FROM</span> <span class="n">t1</span> <span class="k">UNION</span> <span class="k">ALL</span> <span class="k">SELECT</span> <span class="p">...</span> <span class="k">FROM</span> <span class="n">t2</span> </code></pre></div></div> <p>上面是很典型的一个业务,定时向某个时间点的分区导入数据。</p> <p>从上面的sql 可以看出 <code class="language-plaintext highlighter-rouge">2</code> 个重要信息:</p> <p>1、数据表有三个分区字段 <code class="language-plaintext highlighter-rouge">p1,dt,hour</code> 其中 <code class="language-plaintext highlighter-rouge">dt</code> 表示日期,<code class="language-plaintext highlighter-rouge">hour</code> 表示小时.</p> <p>2、 <code class="language-plaintext highlighter-rouge">insert overwrite SELECT</code> 导入,有 <code class="language-plaintext highlighter-rouge">1</code> 个 <code class="language-plaintext highlighter-rouge">UNION ALL</code></p> <p>应该是每个SELECT 有 400 个 partition 导致 2 个表 UNION ALL 加起来一共 800 个文件。</p> <p>我们理想的文件大小是保持在100-200M(建议略小于等于BLOCK 大小),比如HDFS BLOCK为128MB,则可以按照100M左右去算。</p> <p>目前每小时是1-6G数据量。</p> <h2 id="3-优化">3. 优化</h2> <p>Spark SQL 支持 使用 ` /*+ REPARTITION(N) */ ` hint 来重新 <code class="language-plaintext highlighter-rouge">repartition</code> 数据,</p> <p>按照每个分区1-6 G 的数据量,设置 <code class="language-plaintext highlighter-rouge">40/20个PARTITION</code></p> <div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">INSERT</span> <span class="n">overwrite</span> <span class="k">TABLE</span> <span class="n">testdb</span><span class="p">.</span><span class="n">testtable</span> <span class="n">PARTITION</span> <span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">dt</span><span class="p">,</span> <span class="n">hour</span><span class="p">)</span> <span class="k">SELECT</span> <span class="cm">/*+ REPARTITION(40) */</span> <span class="p">...</span> <span class="k">FROM</span> <span class="n">t1</span> <span class="k">UNION</span> <span class="k">ALL</span> <span class="k">SELECT</span> <span class="cm">/*+ REPARTITION(20) */</span> <span class="p">...</span> <span class="k">FROM</span> <span class="n">t2</span> </code></pre></div></div> <p>优化以后,我们发现今天新生成的数据文件明显变大,大小都在95M 左右,小文件问题解决了。</p> <p><img src="/images/posts/spark/spark-repartition-small-files/big-files.png" alt="small-files" /></p> <p>有一个奇怪的点,优化以后数据量会比以前大一些,大概会多<code class="language-plaintext highlighter-rouge">5%-10%</code> ,数据表格式是<code class="language-plaintext highlighter-rouge">ORC</code>, 按常理来说,ORC 合并的话,应该会稍微小一点,或者基本上没有变化,因为 <code class="language-plaintext highlighter-rouge">ORC</code> 里面有很多统计信息或者如果更多相同的数据,可以编码压缩掉,经过研究,发现应该是 <code class="language-plaintext highlighter-rouge">repartition</code> 是会重新做一次全量的<code class="language-plaintext highlighter-rouge">shuffle</code>,这样的话,不利于 <code class="language-plaintext highlighter-rouge">ORC</code> 编码和压缩,除了 REPARTITION,Spark 的 Partition 还支持 <code class="language-plaintext highlighter-rouge">COALESCE</code> ,我们可以使用<code class="language-plaintext highlighter-rouge">COALESCE</code> 来替代<code class="language-plaintext highlighter-rouge">REPARTITION</code>.</p> <blockquote> <h4 id="partitioning-hints-types">Partitioning Hints Types</h4> <ul> <li><strong>COALESCE</strong></li> </ul> <p>The <code class="language-plaintext highlighter-rouge">COALESCE</code> hint can be used to reduce the number of partitions to the specified number of partitions. It takes a partition number as a parameter.</p> <ul> <li><strong>REPARTITION</strong></li> </ul> <p>The <code class="language-plaintext highlighter-rouge">REPARTITION</code> hint can be used to repartition to the specified number of partitions using the specified partitioning expressions. It takes a partition number, column names, or both as parameters.</p> <ul> <li><strong>REPARTITION_BY_RANGE</strong></li> </ul> <p>The <code class="language-plaintext highlighter-rouge">REPARTITION_BY_RANGE</code> hint can be used to repartition to the specified number of partitions using the specified partitioning expressions. It takes column names and an optional partition number as parameters.</p> </blockquote> <p>关于 coalesce 与 repartition 的区别,请参考链接[3]:</p> <blockquote> <p>The repartition algorithm does a <strong>full shuffle</strong> of the data and creates equal sized partitions of data. coalesce combines existing partitions to avoid a <strong>full shuffle</strong>.</p> </blockquote> <p>实际上翻阅代码可以得到,RDD 的 repartition 就是调用的 coalesce 函数,只是shuffle 参数设置为了true,我们的目的是减少小文件,所以这块可以使用 <code class="language-plaintext highlighter-rouge">coalesce</code> 。</p> <div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="cm">/** * Return a new RDD that has exactly numPartitions partitions. * * Can increase or decrease the level of parallelism in this RDD. Internally, this uses * a shuffle to redistribute data. * * If you are decreasing the number of partitions in this RDD, consider using `coalesce`, * which can avoid performing a shuffle. */</span> <span class="c1">//这里的repartion 调用了 coalesce 函数,然后 shuffle 参数设置了true</span> <span class="k">def</span> <span class="nf">repartition</span><span class="o">(</span><span class="n">numPartitions</span><span class="k">:</span> <span class="kt">Int</span><span class="o">)(</span><span class="k">implicit</span> <span class="n">ord</span><span class="k">:</span> <span class="kt">Ordering</span><span class="o">[</span><span class="kt">T</span><span class="o">]</span> <span class="k">=</span> <span class="kc">null</span><span class="o">)</span><span class="k">:</span> <span class="kt">RDD</span><span class="o">[</span><span class="kt">T</span><span class="o">]</span> <span class="k">=</span> <span class="n">withScope</span> <span class="o">{</span> <span class="nf">coalesce</span><span class="o">(</span><span class="n">numPartitions</span><span class="o">,</span> <span class="n">shuffle</span> <span class="k">=</span> <span class="kc">true</span><span class="o">)</span> <span class="o">}</span> <span class="cm">/** * Return a new RDD that is reduced into `numPartitions` partitions. * * This results in a narrow dependency, e.g. if you go from 1000 partitions * to 100 partitions, there will not be a shuffle, instead each of the 100 * new partitions will claim 10 of the current partitions. If a larger number * of partitions is requested, it will stay at the current number of partitions. * * However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, * this may result in your computation taking place on fewer nodes than * you like (e.g. one node in the case of numPartitions = 1). To avoid this, * you can pass shuffle = true. This will add a shuffle step, but means the * current upstream partitions will be executed in parallel (per whatever * the current partitioning is). * * @note With shuffle = true, you can actually coalesce to a larger number * of partitions. This is useful if you have a small number of partitions, * say 100, potentially with a few partitions being abnormally large. Calling * coalesce(1000, shuffle = true) will result in 1000 partitions with the * data distributed using a hash partitioner. The optional partition coalescer * passed in must be serializable. */</span> <span class="k">def</span> <span class="nf">coalesce</span><span class="o">(</span><span class="n">numPartitions</span><span class="k">:</span> <span class="kt">Int</span><span class="o">,</span> <span class="n">shuffle</span><span class="k">:</span> <span class="kt">Boolean</span> <span class="o">=</span> <span class="kc">false</span><span class="o">,</span> <span class="n">partitionCoalescer</span><span class="k">:</span> <span class="kt">Option</span><span class="o">[</span><span class="kt">PartitionCoalescer</span><span class="o">]</span> <span class="k">=</span> <span class="nv">Option</span><span class="o">.</span><span class="py">empty</span><span class="o">)</span> <span class="o">(</span><span class="k">implicit</span> <span class="n">ord</span><span class="k">:</span> <span class="kt">Ordering</span><span class="o">[</span><span class="kt">T</span><span class="o">]</span> <span class="k">=</span> <span class="kc">null</span><span class="o">)</span> <span class="k">:</span> <span class="kt">RDD</span><span class="o">[</span><span class="kt">T</span><span class="o">]</span> <span class="k">=</span> <span class="n">withScope</span> <span class="o">{</span> <span class="nf">require</span><span class="o">(</span><span class="n">numPartitions</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="o">,</span> <span class="n">s</span><span class="s">"Number of partitions ($numPartitions) must be positive."</span><span class="o">)</span> <span class="nf">if</span> <span class="o">(</span><span class="n">shuffle</span><span class="o">)</span> <span class="o">{</span> <span class="cm">/** Distributes elements evenly across output partitions, starting from a random partition. */</span> <span class="k">val</span> <span class="nv">distributePartition</span> <span class="k">=</span> <span class="o">(</span><span class="n">index</span><span class="k">:</span> <span class="kt">Int</span><span class="o">,</span> <span class="n">items</span><span class="k">:</span> <span class="kt">Iterator</span><span class="o">[</span><span class="kt">T</span><span class="o">])</span> <span class="k">=&gt;</span> <span class="o">{</span> <span class="k">var</span> <span class="n">position</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">Random</span><span class="o">(</span><span class="nv">hashing</span><span class="o">.</span><span class="py">byteswap32</span><span class="o">(</span><span class="n">index</span><span class="o">)).</span><span class="py">nextInt</span><span class="o">(</span><span class="n">numPartitions</span><span class="o">)</span> <span class="nv">items</span><span class="o">.</span><span class="py">map</span> <span class="o">{</span> <span class="n">t</span> <span class="k">=&gt;</span> <span class="c1">// Note that the hash code of the key will just be the key itself. The HashPartitioner</span> <span class="c1">// will mod it with the number of total partitions.</span> <span class="n">position</span> <span class="k">=</span> <span class="n">position</span> <span class="o">+</span> <span class="mi">1</span> <span class="o">(</span><span class="n">position</span><span class="o">,</span> <span class="n">t</span><span class="o">)</span> <span class="o">}</span> <span class="o">}</span> <span class="k">:</span> <span class="kt">Iterator</span><span class="o">[(</span><span class="kt">Int</span>, <span class="kt">T</span><span class="o">)]</span> <span class="c1">// include a shuffle step so that our upstream tasks are still distributed</span> <span class="k">new</span> <span class="nc">CoalescedRDD</span><span class="o">(</span> <span class="k">new</span> <span class="nc">ShuffledRDD</span><span class="o">[</span><span class="kt">Int</span>, <span class="kt">T</span>, <span class="kt">T</span><span class="o">](</span> <span class="nf">mapPartitionsWithIndexInternal</span><span class="o">(</span><span class="n">distributePartition</span><span class="o">,</span> <span class="n">isOrderSensitive</span> <span class="k">=</span> <span class="kc">true</span><span class="o">),</span> <span class="k">new</span> <span class="nc">HashPartitioner</span><span class="o">(</span><span class="n">numPartitions</span><span class="o">)),</span> <span class="n">numPartitions</span><span class="o">,</span> <span class="n">partitionCoalescer</span><span class="o">).</span><span class="py">values</span> <span class="o">}</span> <span class="k">else</span> <span class="o">{</span> <span class="k">new</span> <span class="nc">CoalescedRDD</span><span class="o">(</span><span class="k">this</span><span class="o">,</span> <span class="n">numPartitions</span><span class="o">,</span> <span class="n">partitionCoalescer</span><span class="o">)</span> <span class="o">}</span> <span class="o">}</span> </code></pre></div></div> <p>最后优化的<code class="language-plaintext highlighter-rouge">SQL</code> 语句</p> <div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">INSERT</span> <span class="n">overwrite</span> <span class="k">TABLE</span> <span class="n">testdb</span><span class="p">.</span><span class="n">testtable</span> <span class="n">PARTITION</span> <span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">dt</span><span class="p">,</span> <span class="n">hour</span><span class="p">)</span> <span class="k">SELECT</span> <span class="cm">/*+ COALESCE(40) */</span> <span class="p">...</span> <span class="k">FROM</span> <span class="n">t1</span> <span class="k">UNION</span> <span class="k">ALL</span> <span class="k">SELECT</span> <span class="cm">/*+ COALESCE(20) */</span> <span class="p">...</span> <span class="k">FROM</span> <span class="n">t2</span> </code></pre></div></div> <p>修改以后,发现数据量基本上于以前保持一致。</p> <p><strong>思考:</strong></p> <ol> <li>为什么之前会有 800 ( 400+400 ) 个 <code class="language-plaintext highlighter-rouge">partition</code> ?</li> <li>是否通过设置如下参数 <code class="language-plaintext highlighter-rouge">spark.sql.shuffle.partitions=40</code> 解决这个问题,区别是什么?</li> <li>repartition 与 coalesce 区别是什么?</li> <li>Spark 有几种 repartition 方法</li> </ol> <h2 id="参考链接">参考链接</h2> <ol> <li>https://spark.apache.org/docs/3.0.1/sql-ref-syntax-qry-select-hints.html</li> <li>https://medium.com/@mrpowers/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4#.36o8a7b5j</li> <li>https://stackoverflow.com/questions/54218006/why-does-the-repartition-method-increase-file-size-on-disk</li> </ol>utf7/Yechao Chen在 Spark SQL 使用 REPARTITION Hint 来减少小文件输出