forked from HariSekhon/DevOps-Python-tools
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathimpala_tables_null_columns.py
More file actions
executable file
·84 lines (63 loc) · 2.58 KB
/
impala_tables_null_columns.py
File metadata and controls
executable file
·84 lines (63 loc) · 2.58 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
#!/usr/bin/env python
# vim:ts=4:sts=4:sw=4:et
#
# Author: Hari Sekhon
# Date: 2019-11-26 10:08:52 +0000 (Tue, 26 Nov 2019)
#
# https://github.com/harisekhon/devops-python-tools
#
# License: see accompanying Hari Sekhon LICENSE file
#
# If you're using my code you're welcome to connect with me on LinkedIn
# and optionally send me feedback to help steer this or other code I publish
#
# https://www.linkedin.com/in/harisekhon
#
"""
Connect to an Impala daemon and find tables with columns containing only NULLs
for all tables in all databases, or only those matching given db / table regexes
Describes each table, constructs a complex query to check each column individually for containing only NULLs,
and prints out each tables' count of total columns containing only NULLs as well as the list of offending columns
Useful for catching problems with data quality or subtle ETL bugs
Rewrite of a Perl version from 2014 from my DevOps Perl Tools repo
Tested on Impala 2.7.0, 2.12.0 on CDH 5.10, 5.16 with Kerberos and SSL
Due to a thrift / impyla bug this needs exactly thrift==0.9.3, see
https://github.com/cloudera/impyla/issues/286
If you get an error like this:
ERROR:impala.hiveserver2:Failed to open transport (tries_left=1)
...
TTransportException: TSocket read 0 bytes
then check your --kerberos and --ssl settings match the cluster's settings
(Thrift and Kerberos have the worst error messages ever)
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import os
import sys
srcdir = os.path.abspath(os.path.dirname(__file__))
pylib = os.path.join(srcdir, 'pylib')
sys.path.append(pylib)
try:
# pylint: disable=wrong-import-position
from hive_tables_null_columns import HiveTablesNullColumns
except ImportError as _:
print('module import failed: %s' % _, file=sys.stderr)
print("Did you remember to build the project by running 'make'?", file=sys.stderr)
print("Alternatively perhaps you tried to copy this program out without it's adjacent libraries?", file=sys.stderr)
sys.exit(4)
__author__ = 'Hari Sekhon'
__version__ = '0.4.0'
class ImpalaTablesNullColumns(HiveTablesNullColumns):
def __init__(self):
# Python 2.x
super(ImpalaTablesNullColumns, self).__init__()
# Python 3.x
# super().__init__()
# these are auto-set checking sys.argv[0] in HiveImpalaCLI class
self.name = 'Impala'
#self.default_port = 21050
#self.default_service_name = 'impala'
if __name__ == '__main__':
ImpalaTablesNullColumns().main()