utf7

2023-11-16T18:01:03+08:00

大数据资源使用情况统计：

物理内存使用率：物理CPU使用率：

物理 CPU使用率（白天/晚上）：物理内存使用率（白天/晚上）：

YARN CPU使用率： YARN 内存使用率：

YARN CPU使用率（白天/晚上）： YARN 内存使用率（白天/晚上）：

Spark 相关：
每秒 Container 数量
每天 Spark 任务数量. 平均每个 Spark 任务 executor 数量. 平均每个 Spark Executor 申请的堆内存(spark.executor.memory). 平均每个 Spark Executor 申请的堆外内存（spark.executor.memoryoverhead). 平均每个 Spark Executor 申请的 Cores 数量(spark.executor.cores). 平均每个 Spark Executor Core 对应的内存比例(spark.executor.memory=8G,spark.executor.cores=4，则为2). 平均每个 Spark Executor 申请的总内存（spark.executor.memory+spark.executor.memoryoverhead）. 平均 Spark Executor 堆内存使用率 sum(executor.max_heap_usedexecutorNum)/sum(spark.executor.memoryexecutorNum) 注意算法. 平均 Spark Executor 堆外内存使用率 sum(executor.max_offheap_usedexecutorNum)/sum(spark.executor.memoryoverheadexecutorNum) 注意算法. 平均 Spark Executor 总内存使用率 sum((executor.max_offheap_used+executor.max_heap_used)executorNum)/sum((spark.executor.memory+spark.executor.memoryoverhead)executorNum) 注意算法 (不太严谨，堆内外的峰值内存不一定同时出现）.

平均每个Spark 任务的 Task 数量. 平均每个Spark shuffle 读的数据量. 平均每个Spark shuffle 写的数据量. 。。。。

2023 H2 生活 OKR

2023-09-03T00:00:00+08:00

2023 H2 生活 OKR

O1 【健康管理升级】通过引入良好的健康管理机制并在日常生活中落地，以改善睡眠质量，增强免疫力、保持精力，延缓衰老，提高观感，使得更加健康活力。

KR1 通过健身、运动以及控制饮食等手段，降低脂肪含量，将体重从125下降到120，降低4% ,降低关节、心血管、心脏等健康风险，为长期可持续健康发展打下坚实基础，同时提供更好的观感，使得看起来更年轻。

KR2 通过引入早睡晚起、中午午休、星期天补觉以及少看手机等策略，来改善睡眠质量，以缓解疲劳，增强免疫力，提高大脑活力，保持精力，增强免疫力以及延缓衰老，为更好的生活和工作打下坚实的基础。

KR3 通过撸铁、游泳、散步以及进食高蛋白食品、减少糖分、饮料、咖啡因摄入，提高肌肉含量，提高不限于胸大肌、腹肌、三角肌、斜方肌、背阔肌的维度和颗粒度，使其更加有锐度和颗粒感，同时提高肺活量和改善心血管健康，增强免疫力，改善心情，使得身体更健康。

O2 【家庭管理升级】响应号召，积极生娃；改善家庭关系，关注小孩学习成长，提高幸福感。

KR1 投入更多时间在小孩陪伴上，一块陪伴其学习、生活以及健康快乐成长。

KR2 通过积极响应国家二胎号召，为国家以及人类的可持续发展做贡献，积极生娃，丰富家庭生活，增加家庭成员之间的互动和交流，缓解独生子女的压力，避免感到孤独，提高娃的协作精神。

KR3 每月至少做4次饭，4次市内或者周边出行游玩，完成一次旅游出行。

O3 【生活管理升级】通过完成装修，夯实家居基础环境，为家人提供良好的生活和学习环境；改善财务状况。

KR1 引入水电木瓦油等工艺，高质量完成硬装工作，满足水、电、燃气、保暖、制冷、安全防护等基本需求。

KR2 通过引入全屋定制、淋浴房、马桶、花洒等来满足基本日常储物、洗簌等居住诉求。

KR3 通过购买家具、家电以及引入软装搭配来满足日常家庭居住、休闲、娱乐诉求，同时改善家居观感。

KR4 通过开源节流（主要手段是节流），改善家庭入不敷出的财务状况，提高家庭抗风险能力和长期可持续发展。

O4 【无用技能升级】调研至少1个无用技能，比如乐器、拳击、摩托车等，考虑学习掌握一个无用技能。

KR1 调研乐器（口琴、吉他等）、拳击等无用技能，丰富日常生活。

程序员书写用词规范

2022-06-12T00:00:00+08:00

书写用词规范,特别需要注意大小组合，否则会显得不专业：比如 hbase,mysql,sla,hadoop,clickhouse 都是不推荐的，常见的名词需要注意大小写

常见的名词规范如下：

Hadoop
HDFS
YARN
MapReduce 或者 MR
EMR
Hive
HBase
Cassandra
MySQL
Spark
Flink
Kafka
TensorFlow 或者 TF
Elasticsearch 或者 ES
Kylin
Hudi
Iceberg
Parquet
ORC
Presto
Trino
ClickHouse 或者 CK
Doris
StarRocks
Impala
Zookeeper 或者 ZK
RocksDB
Java
Linux
CentOS
Docker
Kubernetes 或者 K8s
SLA
Tableau

将github 代码push 到自己的仓库并保留commit log

2022-03-29T00:00:00+08:00

1.将github 某个版本放到自己的仓库，保留commit log

git clone [email protected]:apache/hbase.git
cd hbase/
git tag
git checkout rel/2.4.11
git remote set-url origin ssh://[email protected]:port/bigdata/HBase
git remote -v
git branch test
git checkout test
git push origin test:2.4.11

给容器镜像瘦身的一个小技巧

2022-03-14T00:00:00+08:00

1.镜像瘦身

执行完 yum 命令以后，可以删除 cache

 yum -q clean all && \
 rm -rf /var/cache/yum && \

如:

https://github.com/utf7/Dockerfile/edit/master/jdk/jdk11/Dockerfile

这个镜像清理yum cache 可以减少 130多M

Spark Shuffle Service 配置不合理导致的任务失败以及NodeManager OOM 问题分析

2021-05-12T00:00:00+08:00

1.

最近集群有 Spark 任务会出现失败，查看 Spark 日志发现会有 Executor 挂掉的信息。

一般 Executor 挂掉话，通常会是几种情况：

1、Executor 自己挂了，比如申请内存不够等，executor 运行过程中 OOM等。

2、Executor 所在的节点出现问题，比如宕机。

3、Executor 所在节点的NodeManager 挂了。因为Executor 都是NodeManager 进程 fork 出来的，NodeManager 挂的话，Executor 通常也会挂掉

4、操作系统内存紧张触发了OS的OOM Killer 功能，被操作系统干掉了。

5、Executor 被NodeManager watch 到内存超过使用的界限等原因，被NodeManager 给干掉了。

6、人为 kill （可能性较小）

2.

查看 Spark 日志并没有发现 Executor 有 OOM 的情况。

登陆 Executor 所在的节点，发现机器也正常。查看 NodeManager 情况

ps -ef|grep NodeManager|grep -v grep 查看发现 NodeManager 也在。后来仔细看了一下 NodeManager 的启动时间发现不对，是最近的时间，证明 NodeManager 之前挂过，现在又启动了

（我们有 NodeManager 自动拉起的功能，所以挂掉以后立刻被拉起来了）

check NodeManager 日志

发现是 NodeManager OOM了


WARN org.spark_project.io.netty.channel.AbstractChannelHandlerContext: An exception 'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full stacktrace] was thrown by a user handler's exceptionCaught() method while handling the following exception:
java.lang.OutOfMemoryError: GC overhead limit exceeded
 WARN org.spark_project.io.netty.channel.AbstractChannelHandlerContext: An exception '
java.lang.OutOfMemoryError: Java heap space
FATAL org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[DeletionService #3,5,main] threw an Error.  Shutting down now...
java.lang.OutOfMemoryError: Java heap space

OK，那么到这里原因大概就知道了

NodeManager OOM 导致的 Executor 挂掉。

那么为什么会 OOM 呢？

ps -ef|grep NodeManager|grep -v grep check 了一下 NodeManager 进程，发现设置了4096的堆内存（-Xmx4096m)

根据之前的经验线上应用 NodeManager 推荐大概内存是 8-12G 比较靠谱，4G 确实有点低了。

是不是就设置一下8-12G内存就 ok了呢？当然我们还是要分析一下为什么会OOM，到底什么占用了内存？是不是有内存泄露？是个例还是很多类似的？

使用 salt 全量统计了一下所有NodeManager日志关键字段，发现并不是个例有其他节点也存在 NodeManager OOM 的情况，找到另外一个节点来check一下（该节点其实GC不过来，已经卡死了）。

3.

首先是保留一些信息：

1、保留堆栈信息：

jstack 31549 >nm-31459.jstack1;

sleep 2s;

jstack 31549 >nm-31459.jstack2;

sleep 5s;

jstack 31549 > nm-31459.jstack3;

2、保留句柄

lsof -p 31549 > nm-31459.fd

3、check 一下GC

jstat -gcutil 分析的gc 情况.

图片

4、jmap 看一下内存情况：

jmap -histo 31549 >jmap-nm-31549.jmap

5、dump 内存

jmap -dump:format=b,file=nm-31549.heap.bin 31549

check 下 jmap 的信息

图片

发现有大量的LocalCache 对象以及ShuffleIndexInformation

ps：

[B 是指byte[] 数组

[C 是char[] 数组

一般 Java 进程 jmap 大部分都是 byte[],char[],String 这些，是比较正常的通常看不出来什么，基本上没啥关系。

主要怀疑对象在 ShuffleIndexInformation 和 LocalCache 上面

check 了一下代码，发现 Hadoop 中并没有这个代码。

从名字看这块肯定是 Shuffle 相关的，所以自然而然就想到了Spark Shuffle Service 。这里简单说一下，通常 Shuffle 是在 Executor 中的，executor 挂的话，就会导致shuffle 数据丢失失败，所以spark 后来做了 Shuffle Service，原理是将 Shuffle 放到 NodeManager 中来做，NodeManager 一般不忙，用来做 Shuffle 也是可以的，这样的话，Executor 挂了，NodeManager 其实还在的，有一定的稳定性提高，当然现在还有一个 Shuffle 思路是将 shuffle 做成单独的服务放在外面，这是另外的一个话题了。

查看 Spark 代码，Spark Shuffle Service 的代码在 NodeManager 中跑，ExternalShuffleBlockResolver 内部有一个

LoadingCache<File, ShuffleIndexInformation> shuffleIndexCache 该 cache 主要是用于 cache shuffle 的 index 信息。

cache 默认配置的是100m，由参数spark.shuffle.service.index.cache.size来配置。

查看当前配置发现是4096m

grep -A 1 “spark.shuffle.service.index.cache.size” /etc/apps/hadoop-conf/yarn-site.xml

spark.shuffle.service.index.cache.size 4096m

当前 NodeManager 配置也就 4096m，所以当 cache 到一定程度的时候，oom 就可想而知了。

cache 相关代码设置如下：

ExternalShuffleBlockResolver( TransportConf conf, File registeredExecutorFile, Executor directoryCleaner) throws IOException { this.conf = conf; this.registeredExecutorFile = registeredExecutorFile; String indexCacheSize = conf.get(“spark.shuffle.service.index.cache.size”, “100m”); CacheLoader<File, ShuffleIndexInformation> indexCacheLoader = new CacheLoader<File, ShuffleIndexInformation>() { public ShuffleIndexInformation load(File file) throws IOException { return new ShuffleIndexInformation(file); } }; shuffleIndexCache = CacheBuilder.newBuilder() .maximumWeight(JavaUtils.byteStringAsBytes(indexCacheSize)) .weigher(new Weigher<File, ShuffleIndexInformation>() { public int weigh(File file, ShuffleIndexInformation indexInfo) { return indexInfo.getSize();//这块计算其实是有点问题的 } }) .build(indexCacheLoader);

对 dump 出来的内存进行分析发现也是与上面的结论是一致的：

发现4G内存，shuffleIndexCache 就占用了3.7G

图片

check 了代码其实感觉代码还是有一些改进的地方的：

具体见注释

shuffleIndexCache = CacheBuilder.newBuilder()
  .maximumWeight(JavaUtils.byteStringAsBytes(indexCacheSize))
  .weigher(new Weigher<File, ShuffleIndexInformation>() {
    public int weigh(File file, ShuffleIndexInformation indexInfo) {
      //return indexInfo.getSize();
      //这块计算其实是有点问题的，只计算了indexInfo文件size ，其实有会有ovehead的
      //当然这个问题主要看应该不是overhead，而是堆内存一共就配置了4096m，而cache 也配置了这么大。
      //实际上cache 是会超过配置的值的，比如文件多，但是index文件内容比较少的情况下。
      //这块代码做了一点点稍微小小的改动，具体其实是可以参考HBase d等代码来实际计算java对象大小。
      // 不过这块没有太大必要这么细，但是还是需要考虑overhead
      //当然对于index.cache.size 设置成256m的话，内存8G的NM的话，即使有8倍的overhead，其实问题也不大。
      //我们这里设置成了512m
       return file.getAbsolutePath().length() + 128+file indexInfo.getSize();
      

    }
  }) 最终做了如下三个修改：

1.修改 spark.shuffle.service.index.cache.size=512m

2.同时修改NodeManager 内存为10240m（10G）

3.同时对shuffleIndexCache内存占用的计算做了一些小小的修正（这里不是主要原因，这次主要还是内存配置于cache.size不匹配导致的，但是内存配置即使比较大的情况下，如果shuffleIndexCache配置也比较大，也有可能会出现可能shuffleIndexCache配置2g，实际上shuffleIndexCache占用远大于2g的情况会出现问题，所以规避一下风险。

修改以后任务稳定了很多。

观察了一段时间，发现再也没有 NodeManager OOM了

后来搜了一下社区，确实有类似的关于 shuffleIndexCache 内存计算的改进的改进。相关jira：

https://issues.apache.org/jira/browse/SPARK-21501

https://issues.apache.org/jira/browse/SPARK-33206

salt 中执行 awk 注意事项

2020-11-10T00:00:00+08:00

使用 salt 执行 awk 时，发现会执行失败

salt "*" cmd.run " ps -ef|grep NodeManager|grep -v grep|grep org.apache.hadoop.yarn.server.nodemanager.NodeManager|awk '{print $2}'|xargs lsof -p|grep REG|grep /usr/lib/locale/locale-archive"

会发现执行失败 $2 前面需要加转义符号

salt "*" cmd.run " ps -ef|grep NodeManager|grep -v grep|grep org.apache.hadoop.yarn.server.nodemanager.NodeManager|awk '{print \$2}'|xargs lsof -p|grep REG|grep /usr/lib/locale/locale-archive"

salt 中关于locale的问题

2020-11-10T00:00:00+08:00

记录一次诡异的 Hive/Spark 乱码问题

最近遇到一个诡异的事情，Hive/Spark 写入汉字乱码。

检查 Hive/Spark 的各种参数也没有发现有什么变动，客户端配置也是一样的。

有一个诡异的地方是

spark-sql -f xxx.sql,xxx.sql 如果 xxx.sql 中如果有汉字，则也会有问题。

spark-submit client 模式则没有问题，cluster 模式则会有问题。

这里面唯一区别是 client 的 Driver 在本地，cluster 在 YARN（NodeManager） 节点。

另外 Spark/MR 读写数据也有问题，不过 hive cli 不启动 MR 的话，则没有问题。。。比如 select * from t limit 10 这种不会启动 MR，则是本地运行的。

分析来看，不管是 Hive/Spark 只要任务跑在 YARN 上面则会有问题。

所以这块感觉和 NodeManager 有关系，因为 on YARN 的进程都是 NodeManager fork 出来的。

因为最近在做 Zstandard 修改了代码，所以同事怀疑是否代码引起的。我反复check了代码，并没有涉及到字符编码相关的内容。不应该出现此类问题。

由于修改了 Hadoop 代码，需要重启 NodeManager ，最近变动就是重启了 NodeManager以及添加了 Zstandard 的支持。

登陆出现问题的主机

执行 locale

    LANG=en_US.UTF-8
    LCCTYPE="en_US.UTF-8"
    LC_NUMERIC="en_US.UTF-8"
    LC_TIME="en_US.UTF-8"
    LC_COLLATE="en_US.UTF-8"
    LC_MONETARY="en_US.UTF-8"
    LC_MESSAGES="en_US.UTF-8"
    LC_PAPER="en_US.UTF-8"
    LC_NAME="en_US.UTF-8"
    LC_ADDRESS="en_US.UTF-8"
    LC_TELEPHONE="en_US.UTF-8"
    LC_MEASUREMENT="en_US.UTF-8"
    LC_IDENTIFICATION="en_US.UTF-8"
    LC_ALL=

貌似也没啥问题。百思不得其解：

此时赶快将节点回滚到之前的版本，然后慢慢重启 NodeManager，发现确实问题渐渐修复了。

还有另外一个现象是并不是所有内容都是乱码的，这个猜测和 Map/Reduce 以及 Spark Executor 跑在不同的 NodeManager 节点有关系。

这时候根据多年的经验直觉，赶快保存的异常的节点 NodeManager 保留的 lsof 句柄信息，

diff 一下进程加载的 fd，是不是少加载了什么导致的，比较2个进程加载的fd 发现有一个

ps -ef|grep -v grep|grep org.apache.hadoop.yarn.server.nodemanager.NodeManager|awk '{print $2}'|xargs lsof -p |sort -n > lsof_before   

ps -ef|grep -v grep|grep org.apache.hadoop.yarn.server.nodemanager.NodeManager|awk '{print $2}'|xargs lsof -p |sort -n  > lsof_after. 

diff lsof_before lsof_after

然后发现少了

/usr/lib/locale/locale-archive

咦，什么鬼。。。为啥没有这个，分明我登陆节点执行，locale 输出是正常的。

所以就把问题定位点放在 /usr/lib/locale/locale-archive 为什么少了这个地方。

出问题的 NodeManager 启动命令如下：

salt 'nm-*' cmd.run 'sudo -u yarn bash -c "source /etc/profile;yarn-daemon.sh start nodemanager"'

我在一个节点测试了一下，使用

sudo -u yarn bash -c "source /etc/profile;yarn-daemon.sh start nodemanager"

启动一个 NM 测试，发现并没有什么异常。。。。太诡异了。。。

节点还没有重启完，我赶快把所有的节点检查了一下，是否都丢失 /usr/lib/locale/locale-archive

salt "*" cmd.run " ps -ef|grep -v grep|grep org.apache.hadoop.yarn.server.nodemanager.NodeManager|awk '{print \$2}'|xargs lsof -p|grep REG|grep /usr/lib/locale/locale-archive"

发现我之前启动的节点都是没有 locale-archive ，而同事新启动的则是没有问题的。莫不是启动方式有问题？？？诡异。
执行： salt -E "nm-*" cmd.run "locale"

    LC_CTYPE=C
    LC_NUMERIC=C
    LC_TIME=C
    LC_COLLATE=C
    LC_MONETARY=C
    LC_MESSAGES=C
    LC_PAPER=C
    LC_NAME=C
    LC_ADDRESS=C
    LC_TELEPHONE=C
    LC_MEASUREMENT=C
    LC_IDENTIFICATION=C
    LC_ALL=

发现 locale 很多输出都是C，这个与我之前在机器上面执行的结果是不一致的。

所以应该是 salt 导致的，问了一下同事是如何重启的，他是kill 掉 NodeManager,并没有启动，而是让监控脚本自动拉起的。

而我当时为了启动快一些，是直接使用 salt 批量 stop 然后 start 的。

至此原因排查出来的，原来是 salt 的锅。

后来去github 把salt 代码拉了下来，看了一下。问题出在这边：

salt 默认会 reset_system_locale,代码截图如下：

个人感觉这边salt处理的很有问题，虽然提供了参数，但我仍然认为是个bug，不知道为啥会考虑默认重置locale。

很坑！！！

看了一下代码，可以执行salt 命令的时候，添加参数 reset_system_locale=False 解决：

修改 NodeManger 启动脚本,Hive/Spark 乱码问题解决

salt 'nm-*' cmd.run reset_system_locale=False 'sudo -u yarn bash -c "source /etc/profile;yarn-daemon.sh start nodemanager"'

验证：

salt "nm-*" cmd.run " ps -ef|grep -v grep|grep org.apache.hadoop.yarn.server.nodemanager.NodeManager|awk '{print \$2}'|xargs lsof -p|grep REG|grep /usr/lib/locale/locale-archive"

如果有使用 salt 的同学，具体可以使用如下命令来测试测试这个问题：

salt 'nm-*' cmd.run reset_system_locale=False 'locale'

 salt 'nm-*' cmd.run  'locale'

PS：关于 locale ：可参考如下几个链接：

1.https://wiki.archlinux.org/index.php/locale
2.https://man7.org/linux/man-pages/man1/localedef.1.html
3.https://linuxhint.com/locales_debian/

Spark SQL 正确的传递 Hive 参数

2020-10-18T00:00:00+08:00

使用 spark-sql 导入动态分区时，出现来错误

Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Number of dynamic partitions created is 1793, which is more than 1000. To solve this try to set hive.exec.max.dynamic.partitions to at least 1793.;

提示增加 hive.exec.max.dynamic.partitions 的值，可是我明明通过 --conf hive.exec.max.dynamic.partitions=1000000 为什么还是报错了？难道是参数没有传递进去？？？执行 spark-sql --help ，通过查看 spark-sql 的帮助

原来 hive的参数是通过 --hiveconf 来传递的而不是 --conf

使用 --conf 来传递 spark 参数使用 --hiveconf 来传递 hive 参数修改后为：

nohup spark-sql  --master yarn  --deploy-mode client --queue root.test.myqueue  --driver-memory 24g   --executor-cores 2   --executor-memory 4g   --num-executors 256   --conf spark.driver.memoryOverhead=1g   --conf spark.executor.memoryOverhead=1g   --conf spark.speculation=false --conf spark.driver.maxResultSize=12g --conf spark.sql.hive.filesourcePartitionFileCacheSize=4621440000 --conf spark.sql.files.maxPartitionBytes=268435456 --conf spark.sql.files.openCostInBytes=0 --conf spark.sql.shuffle.partitions=512 --hiveconf hive.exec.max.dynamic.partitions=1000000 --hiveconf hive.exec.max.dynamic.partitions.pernode=100000 --hiveconf hive.exec.max.created.files=1000000 -S -e "insert overwrite table tt partition(day,hour) select * from table1 where day<='2020-10-11' distribute by day,hour , cast( floor(rand() * 8) as int);"  > merge.log 2>&1 & 

在 Spark SQL 使用 REPARTITION Hint 来减少小文件输出

2020-10-08T00:00:00+08:00

1. 问题

公司数仓业务有一个 sql 任务，每天会产生大量的小文件，每个文件只有几百 KB～几 M 大小，小文件过多会对 HDFS 性能造成比较大的影响，同时也影响数据的读写性能（Spark 任务某些情况下会缓存文件信息），虽然开发了小文件合并工具会去定期合并小文件，但还是想从源头上来解决这个问题，尽量生成比较合适的文件大小，而不是事后补救。

下图是优化前一天 18 点这个分区的数据，可以看出小文件问题明显。

2. 分析

思路：

1、统计每次生成的文件大小

2、每个小文件的大小

3、预期文件大小

4、找到优化方法，使其满足预期大小

统计一下每个分区的数据量，差不多每个小时 1-6G 数据

$  hdfs dfs -du -h  hdfs://dw/test.db/testtable/p1=a/dt=2020-09-28/
4 G  10.2 G  hdfs://dw/test.db/testtable/p1=a/dt=2020-09-28/hour=00
1 G  6.3 G   hdfs://dw/test.db/testtable/p1=a/dt=2020-09-28/hour=01
4 G  4.3 G   hdfs://dw/test.db/testtable/p1=a/dt=2020-09-28/hour=02
1 G  3.4 G   hdfs://dw/test.db/testtable/p1=a/dt=2020-09-28/hour=03
2 G  3.6 G   hdfs://dw/test.db/testtable/p1=a/dt=2020-09-28/hour=04
8 G  5.5 G   hdfs://dw/test.db/testtable/p1=a/dt=2020-09-28/hour=05
1 G  9.3 G   hdfs://dw/test.db/testtable/p1=a/dt=2020-09-28/hour=06
8 G  11.4 G  hdfs://dw/test.db/testtable/p1=a/dt=2020-09-28/hour=07
8 G  11.5 G  hdfs://dw/test.db/testtable/p1=a/dt=2020-09-28/hour=08
9 G  11.6 G  hdfs://dw/test.db/testtable/p1=a/dt=2020-09-28/hour=09
1 G  12.2 G  hdfs://dw/test.db/testtable/p1=a/dt=2020-09-28/hour=10
6 G  13.7 G  hdfs://dw/test.db/testtable/p1=a/dt=2020-09-28/hour=11
4 G  16.3 G  hdfs://dw/test.db/testtable/p1=a/dt=2020-09-28/hour=12
7 G  14.0 G  hdfs://dw/test.db/testtable/p1=a/dt=2020-09-28/hour=13
1 G  12.4 G  hdfs://dw/test.db/testtable/p1=a/dt=2020-09-28/hour=14
1 G  12.3 G  hdfs://dw/test.db/testtable/p1=a/dt=2020-09-28/hour=15
1 G  12.4 G  hdfs://dw/test.db/testtable/p1=a/dt=2020-09-28/hour=16
5 G  13.4 G  hdfs://dw/test.db/testtable/p1=a/dt=2020-09-28/hour=17
6 G  13.8 G  hdfs://dw/test.db/testtable/p1=a/dt=2020-09-28/hour=18
0 G  15.1 G  hdfs://dw/test.db/testtable/p1=a/dt=2020-09-28/hour=19
5 G  16.6 G  hdfs://dw/test.db/testtable/p1=a/dt=2020-09-28/hour=20
1 G  18.4 G  hdfs://dw/test.db/testtable/p1=a/dt=2020-09-28/hour=21
2 G  18.7 G  hdfs://dw/test.db/testtable/p1=a/dt=2020-09-28/hour=22
0 G  15.1 G  hdfs://dw/test.db/testtable/p1=a/dt=2020-09-28/hour=23

使用 hour=18 这个分区举例：

$  hdfs dfs -ls -h -R  hdfs://dw/test.db/testtable/p1=a/dt=2020-09-28/hour=18|wc -l
800

可以看出有 800 个文件。

详细看一下文件大小，有一半是 9.2M 左右和一半2.6M左右(实际是由 UNION ALL 分别生成 400个 )，其他就不列出来了.

2 M  27.6 M  hdfs://dw/test.db/testtable/p1=a/dt=2020-09-28/hour=18/part-00397-13c5757b-294b-4010-a0f7-bbb7100455c8.c000
3 M  27.8 M  hdfs://dw/test.db/testtable/p1=a/dt=2020-09-28/hour=18/part-00398-13c5757b-294b-4010-a0f7-bbb7100455c8.c000
2 M  27.5 M  hdfs://dw/test.db/testtable/p1=a/dt=2020-09-28/hour=18/part-00399-13c5757b-294b-4010-a0f7-bbb7100455c8.c000
6 M  7.7 M   hdfs://dw/test.db/testtable/p1=a/dt=2020-09-28/hour=18/part-00400-13c5757b-294b-4010-a0f7-bbb7100455c8.c000
6 M  7.7 M   hdfs://dw/test.db/testtable/p1=a/dt=2020-09-28/hour=18/part-00401-13c5757b-294b-4010-a0f7-bbb7100455c8.c000

文件由 SQL ETL 生成，清洗入库的sql 如下：

INSERT
	overwrite TABLE testdb.testtable PARTITION (p1, dt, hour)
SELECT
	...
FROM
	t1
UNION ALL
SELECT
	...
FROM
	t2
  

上面是很典型的一个业务，定时向某个时间点的分区导入数据。

从上面的sql 可以看出 2 个重要信息：

1、数据表有三个分区字段 p1,dt,hour 其中 dt 表示日期，hour 表示小时.

2、 insert overwrite SELECT 导入，有 1 个 UNION ALL

应该是每个SELECT 有 400 个 partition 导致 2 个表 UNION ALL 加起来一共 800 个文件。

我们理想的文件大小是保持在100-200M（建议略小于等于BLOCK 大小），比如HDFS BLOCK为128MB，则可以按照100M左右去算。

目前每小时是1-6G数据量。

3. 优化

Spark SQL 支持使用 ` /*+ REPARTITION(N) */ ` hint 来重新 repartition 数据,

按照每个分区1-6 G 的数据量，设置 40/20个PARTITION

INSERT
	overwrite TABLE testdb.testtable PARTITION (a, dt, hour)
SELECT
/*+ REPARTITION(40) */
	...
FROM
	t1
UNION ALL
SELECT
/*+ REPARTITION(20) */
	...
FROM
	t2

优化以后，我们发现今天新生成的数据文件明显变大，大小都在95M 左右，小文件问题解决了。

有一个奇怪的点，优化以后数据量会比以前大一些，大概会多5%-10% ，数据表格式是ORC，按常理来说，ORC 合并的话，应该会稍微小一点，或者基本上没有变化，因为 ORC 里面有很多统计信息或者如果更多相同的数据，可以编码压缩掉，经过研究，发现应该是 repartition 是会重新做一次全量的shuffle，这样的话，不利于 ORC 编码和压缩,除了 REPARTITION，Spark 的 Partition 还支持 COALESCE ，我们可以使用COALESCE 来替代REPARTITION.

Partitioning Hints Types

COALESCE

The COALESCE hint can be used to reduce the number of partitions to the specified number of partitions. It takes a partition number as a parameter.

REPARTITION

The REPARTITION hint can be used to repartition to the specified number of partitions using the specified partitioning expressions. It takes a partition number, column names, or both as parameters.

REPARTITION_BY_RANGE

The REPARTITION_BY_RANGE hint can be used to repartition to the specified number of partitions using the specified partitioning expressions. It takes column names and an optional partition number as parameters.

关于 coalesce 与 repartition 的区别，请参考链接[3]：

The repartition algorithm does a full shuffle of the data and creates equal sized partitions of data. coalesce combines existing partitions to avoid a full shuffle.

实际上翻阅代码可以得到，RDD 的 repartition 就是调用的 coalesce 函数,只是shuffle 参数设置为了true，我们的目的是减少小文件，所以这块可以使用 coalesce 。

  /**
   * Return a new RDD that has exactly numPartitions partitions.
   *
   * Can increase or decrease the level of parallelism in this RDD. Internally, this uses
   * a shuffle to redistribute data.
   *
   * If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
   * which can avoid performing a shuffle.
   */
   //这里的repartion 调用了 coalesce 函数，然后 shuffle 参数设置了true
  def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    coalesce(numPartitions, shuffle = true)
  }

  /**
   * Return a new RDD that is reduced into `numPartitions` partitions.
   *
   * This results in a narrow dependency, e.g. if you go from 1000 partitions
   * to 100 partitions, there will not be a shuffle, instead each of the 100
   * new partitions will claim 10 of the current partitions. If a larger number
   * of partitions is requested, it will stay at the current number of partitions.
   *
   * However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
   * this may result in your computation taking place on fewer nodes than
   * you like (e.g. one node in the case of numPartitions = 1). To avoid this,
   * you can pass shuffle = true. This will add a shuffle step, but means the
   * current upstream partitions will be executed in parallel (per whatever
   * the current partitioning is).
   *
   * @note With shuffle = true, you can actually coalesce to a larger number
   * of partitions. This is useful if you have a small number of partitions,
   * say 100, potentially with a few partitions being abnormally large. Calling
   * coalesce(1000, shuffle = true) will result in 1000 partitions with the
   * data distributed using a hash partitioner. The optional partition coalescer
   * passed in must be serializable.
   */
  def coalesce(numPartitions: Int, shuffle: Boolean = false,
               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
              (implicit ord: Ordering[T] = null)
      : RDD[T] = withScope {
    require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")
    if (shuffle) {
      /** Distributes elements evenly across output partitions, starting from a random partition. */
      val distributePartition = (index: Int, items: Iterator[T]) => {
        var position = new Random(hashing.byteswap32(index)).nextInt(numPartitions)
        items.map { t =>
          // Note that the hash code of the key will just be the key itself. The HashPartitioner
          // will mod it with the number of total partitions.
          position = position + 1
          (position, t)
        }
      } : Iterator[(Int, T)]

      // include a shuffle step so that our upstream tasks are still distributed
      new CoalescedRDD(
        new ShuffledRDD[Int, T, T](
          mapPartitionsWithIndexInternal(distributePartition, isOrderSensitive = true),
          new HashPartitioner(numPartitions)),
        numPartitions,
        partitionCoalescer).values
    } else {
      new CoalescedRDD(this, numPartitions, partitionCoalescer)
    }
  }

最后优化的SQL 语句

INSERT
	overwrite TABLE testdb.testtable PARTITION (a, dt, hour)
SELECT
/*+ COALESCE(40) */
	...
FROM
	t1
UNION ALL
SELECT
/*+ COALESCE(20) */
	...
FROM
	t2

修改以后，发现数据量基本上于以前保持一致。

思考：

为什么之前会有 800 ( 400+400 ) 个 partition ？
是否通过设置如下参数 spark.sql.shuffle.partitions=40 解决这个问题，区别是什么？
repartition 与 coalesce 区别是什么？
Spark 有几种 repartition 方法

参考链接

https://spark.apache.org/docs/3.0.1/sql-ref-syntax-qry-select-hints.html
https://medium.com/@mrpowers/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4#.36o8a7b5j
https://stackoverflow.com/questions/54218006/why-does-the-repartition-method-increase-file-size-on-disk

utf7

2023 H2 生活 OKR

2023 H2 生活 OKR

O1 【健康管理升级】通过引入良好的健康管理机制并在日常生活中落地，以改善睡眠质量，增强免疫力、保持精力，延缓衰老，提高观感，使得更加健康活力。

KR1 通过健身、运动以及控制饮食等手段，降低脂肪含量，将体重从125下降到120，降低4% ,降低关节、心血管、心脏等健康风险， 为长期可持续健康发展打下坚实基础，同时提供更好的观感，使得看起来更年轻。

KR2 通过引入早睡晚起、中午午休、星期天补觉以及少看手机等策略，来改善睡眠质量，以缓解疲劳，增强免疫力，提高大脑活力，保持精力，增强免疫力以及延缓衰老，为更好的生活和工作打下坚实的基础。

O2 【家庭管理升级】响应号召，积极生娃；改善家庭关系，关注小孩学习成长，提高幸福感。

KR1 投入更多时间在小孩陪伴上，一块陪伴其学习、生活以及健康快乐成长。

KR2 通过积极响应国家二胎号召，为国家以及人类的可持续发展做贡献，积极生娃，丰富家庭生活，增加家庭成员之间的互动和交流，缓解独生子女的压力，避免感到孤独，提高娃的协作精神。

KR3 每月至少做4次饭，4次市内或者周边出行游玩，完成一次旅游出行。

O3 【生活管理升级】通过完成装修，夯实家居基础环境，为家人提供良好的生活和学习环境；改善财务状况。

KR1 引入水电木瓦油等工艺，高质量完成硬装工作，满足水、电、燃气、保暖、制冷、安全防护等基本需求。

KR2 通过引入全屋定制、淋浴房、马桶、花洒等来满足基本日常储物、洗簌等居住诉求。

KR3 通过购买家具、家电以及引入软装搭配来满足日常家庭居住、休闲、娱乐诉求，同时改善家居观感。

KR4 通过开源节流（主要手段是节流），改善家庭入不敷出的财务状况，提高家庭抗风险能力和长期可持续发展。

O4 【无用技能升级】调研至少1个无用技能，比如乐器、拳击、摩托车等，考虑学习掌握一个无用技能。

KR1 调研乐器（口琴、吉他等）、拳击等无用技能，丰富日常生活。

程序员书写用词规范

将github 代码push 到自己的仓库并保留commit log

1.将github 某个版本放到自己的仓库，保留commit log

给容器镜像瘦身的一个小技巧

1.镜像瘦身

Spark Shuffle Service 配置不合理导致的任务失败以及NodeManager OOM 问题分析

1.

2.

3.

salt 中执行 awk 注意事项

salt 中关于locale的问题

记录一次诡异的 Hive/Spark 乱码问题

Spark SQL 正确的传递 Hive 参数

在 Spark SQL 使用 REPARTITION Hint 来减少小文件输出

1. 问题

2. 分析

3. 优化

Partitioning Hints Types

参考链接

KR1 通过健身、运动以及控制饮食等手段，降低脂肪含量，将体重从125下降到120，降低4% ,降低关节、心血管、心脏等健康风险，为长期可持续健康发展打下坚实基础，同时提供更好的观感，使得看起来更年轻。