Abstract
Probabilistic inference over large data sets is a challenging data management problem since exact inference is generally #P-hard and is most often solved approximately with sampling-based methods today. This paper proposes an alternative approach for approximate evaluation of conjunctive queries with standard relational databases: In our approach, every query is evaluated entirely in the database engine by evaluating a fixed number of query plans, each providing an upper bound on the true probability, then taking their minimum. We provide an algorithm that takes into account important schema information to enumerate only the minimal necessary plans among all possible plans. Importantly, this algorithm is a strict generalization of all known PTIME self-join-free conjunctive queries: A query is in PTIME if and only if our algorithm returns one single plan. Furthermore, our approach is a generalization of a family of efficient ranking methods from graphs to hypergraphs. We also adapt three relational query optimization techniques to evaluate all necessary plans very fast. We give a detailed experimental evaluation of our approach and, in the process, provide a new way of thinking about the value of probabilistic methods over non-probabilistic methods for ranking query answers. We also note that the techniques developed in this paper apply immediately to lifted inference from statistical relational models since lifted inference corresponds to PTIME plans in probabilistic databases.
Similar content being viewed by others
Notes
W.l.o.g. we assume \(\varvec{\mathbf {x}}_i\) to be a tuple of only variables and don’t write the constants. Selections can always be directly pushed into the database before executing the query.
Defined formally as \({\textit{ADom}}_{x_j} = \bigcup _{i: x_j \in {\texttt {Var}}(R_i)} \pi _{x_j}(R_i)\).
Extensional approaches compute the probability of any formula as a function of the probabilities of its subformulas according to syntactic rules, regardless of how those were derived. Intensional approaches reason in terms of possible worlds and keep track of dependencies [56].
Incidence matrices allow us to compactly reason about two types of relationships between variables and relations of sf-free CQs simultaneously: (i) in a column: a variable that is shared across relations, and (ii) in a row: relations that are joined by a variable. They thus allow us to reason about both the “query hypergraph” and the “dual query hypergraph” at the same time, which is helpful also for other types of problems involving sf-free CQs (see, e.g. [24]).
A conjunctive k-chain query is a query q without self-joins in which each relation is binary, all relations are joined together, and there is no single variable common to more than two relations. Furthermore, the first and last variable are head variables and can be replaced by constants: \(q(x_1, x_{k+1}) {\,:\!\!-\,}R_1(x_1, x_2), R_2(x_2, x_3), \ldots , R_k(x_k, x_{k+1})\). The fact that relations are binary entails that the query hypergraph is actually a standard graph. Similarly, the fact that a variable is not common to more than two relations also entails the “dual hypergraph” to be a graph as well. The expression chain query derives from the observation that both its hypergraph and dual hypergraph resemble a simple chain.
Notice that dissociating a table on any head variable has no implication on the probability of a query result as it does not change its lineage. We therefore only focus on dissociating existential variables.
Recall that we say a query is connected if all subgoals are connected by considering only existential variables \({\texttt {EVar}}(q)\). In other words, when computing query components we remove head variables from the query: \(q - {\texttt {HVar}}(q)\). An alternative way to write this is to first substitute all head variables by constants \(q' = q[\varvec{\mathbf {a}} / \varvec{\mathbf {x}}]\) (here \(q[\varvec{\mathbf {a}} / \varvec{\mathbf {x}}]\) denotes the query obtained by substituting each head variable \(x_i \in \varvec{\mathbf {x}}\) with the constant \(a_i \in \varvec{\mathbf {a}}\)), then to let \(q_1, \ldots , q_k\) be the components of \(q'\) connected by any variable. The query is connected if \(k=1\), otherwise it is disconnected, and \(\forall i \ne j: {\texttt {Var}}(q_i) \cap {\texttt {Var}}(q_j) \subseteq {\texttt {HVar}}(q)\).
This follows from the recursive definition of the unique safe plan of a query in Lemma 5: the top-most projection consists precisely of its separator variables.
Note that if there are no existential variables (\(\varvec{\mathbf {z}} = \varvec{\mathbf {x}}_i\)), then there is no need for the projection operator and one could instead simplify to \(\mathscr {P} \leftarrow \{R_i(\varvec{\mathbf {z}})\}\), instead of \(\mathscr {P} \leftarrow \{\pi ^p_{\varvec{\mathbf {z}}} R_i(\varvec{\mathbf {x}}_i)\}\).
A Boolean conjunctive k-star query is a query with k unary relations and one k-ary relation: \(q {\,:\!\!-\,}R_1(x_1), \ldots , R_k(x_k), U(x_1, \ldots , x_k)\). The fact that each variable appears in exactly two relations implies that the dual query hypergraph is actually a standard graph (the dual hypergraph of a query is defined by the relations as vertices and variables as the hyperedges). The expression star query derives from the observation that the query’s dual (hyper)graph resembles a star with the table U connected to all other relations.
E.g., if \(\varvec{\mathbf {x}} = \{y \}\) and \(\varvec{\Gamma } = \{ x \rightarrow y, y \rightarrow z, z \rightarrow u \}\), then \({\varvec{\mathbf {x}}}^+ = \{y, z, u \}\).
The time needed for the lineage query thus serves as minimum benchmark for any probabilistic approximation. The reported times for SampleSearch and MC are the sum of time for retrieving the lineage plus the actual calculations, without the time for reading and writing the input and output files for SampleSearch.
Results for MC with other parameters of $2 are similar. However, the evaluation time for the experiments becomes quickly infeasible.
References
Amarilli, A., Amsterdamer, Y., Milo, T.: Uncertainty in crowd data sourcing under structural constraints. In: DASFAA Workshops, pp. 351–359 (2014)
Antova, L., Jansen, T., Koch, C., Olteanu, D.: Fast and simple relational processing of uncertain data. In: ICDE, pp. 983–992 (2008)
Antova, L., Koch, C., Olteanu, D.: MayBMS: managing incomplete information with probabilistic world-set decompositions. In: ICDE, pp. 1479–1480 (2007)
Beame, P., Li, J., Roy, S., Suciu, D.: Model counting of query expressions: limitations of propositional methods. In: ICDT, pp. 177–188 (2014)
Bhalotia, G., Hulgeri, A., Nakhe, C., Chakrabarti, S., Sudarshan, S.: Keyword searching and browsing in databases using BANKS. In: ICDE, pp. 431–440 (2002)
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. 30(1–7), 107–117 (1998)
Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka Jr., E.R., Mitchell, T.M.: Toward an architecture for never-ending language learning. In: AAAI (2010)
Chen, Y., Wang, D.Z.: Knowledge expansion over probabilistic knowledge bases. In: SIGMOD, pp. 649–660 (2014)
Cohen, W.W.: Data integration using similarity joins and a word-based information representation language. ACM Trans. Inf. Syst. 18(3), 288–321 (2000)
Colbourn, C.J.: The Combinatorics of Network Reliability. Oxford University Press, New York (1987)
Crestani, F.: Application of spreading activation techniques in information retrieval. Artif. Intell. Rev. 11(6), 453–482 (1997)
Dalvi, N.N., Suciu, D.: Efficient query evaluation on probabilistic databases. VLDB J. 16(4), 523–544 (2007)
Dalvi, N.N., Suciu, D.: The dichotomy of probabilistic inference for unions of conjunctive queries. J. ACM 59(6), 30 (2012)
Davis, M., Putnam, H.: A computing procedure for quantification theory. J. ACM 7(3), 201–215 (1960)
DeepDive: http://deepdive.stanford.edu/
Detwiler, L., Gatterbauer, W., Louie, B., Suciu, D., Tarczy-Hornoch, P.: Integrating and ranking uncertain scientific data. In: ICDE, pp. 1235–1238 (2009)
Domingos, Pedro, Lowd, Daniel: Markov Logic: An Interface Layer for Artificial Intelligence. Morgan & Claypool Publishers, San Rafael (2009)
Dong, X., Gabrilovich, E., Heitz, G., Horn, W., Lao, N., Murphy, K., Strohmann, T., Sun, S., Zhang, W.: Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In: KDD, pp. 601–610 (2014)
Dylla, M., Miliaraki, I., Theobald, M.: Top-k query processing in probabilistic databases with non-materialized views. In: ICDE, pp. 122–133 (2013)
Ermis, B., Bouchard, G.: Iterative splits of quadratic bounds for scalable binary tensor factorization. In: UAI, pp. 192–199 (2014)
Fink, R., Huang, J., Olteanu, D.: Anytime approximation in probabilistic databases. VLDB J. 22(6), 823–848 (2013)
Fink, R., Olteanu, D.: On the optimal approximation of queries using tractable propositional languages. In: ICDT, pp. 174–185 (2011)
Fink, R., Olteanu, D.: A dichotomy for non-repeating queries with negation in probabilistic databases. In: PODS, pp. 144–155 (2014)
Freire, C., Gatterbauer, W., Immerman, N., Meliou, A.: The complexity of resilience and responsibility for self-join-free conjunctive queries. PVLDB 9(3), 180–191 (2015)
Fuhr, N., Rölleke, T.: A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Trans. Inf. Syst. 15(1), 32–66 (1997)
Gatterbauer, W., Günnemann, S., Koutra, D., Faloutsos, C.: Linearized and single-pass belief propagation. PVLDB 8(5), 581–592 (2015)
Gatterbauer, W., Jha, A.K., Suciu, D.: Dissociation and propagation for efficient query evaluation over probabilistic databases. In: Proceedings of 4th International VLDB workshop on Management of Uncertain Data (MUD), pp. 83–97 (2010)
Gatterbauer, W., Suciu, V.: Dissociation and propagation for approximate lifted inference with standard relational database management systems (2013). arXiv:1310.6257 [cs.DB]
Gatterbauer, W., Suciu, D.: Oblivious bounds on the probability of Boolean functions. ACM Trans. Database Syst. (TODS) 39(1), 5 (2014)
Gatterbauer, W., Suciu, D.: Approximate lifted inference with probabilistic databases. PVLDB 8(5), 629–640 (2015)
Gogate, V., Dechter, R.: SampleSearch: importance sampling in presence of determinism. Artif. Intell. 175(2), 694–729 (2011)
Gogate, V., Domingos, P.: Formula-based probabilistic inference. In: UAI, pp. 210–219 (2010)
Gogate, V., Domingos, P.: Probabilistic theorem proving. In: UAI, pp. 256–265 (2011)
Gomes, C.P., Sabharwal, A., Selman, B.: Model counting. In: Handbook of Satisfiability, pp. 633–654 (2009)
Goyal, A., Bonchi, F., Lakshmanan, L.V.S.: Learning influence probabilities in social networks. In: WSDM, pp. 241–250 (2010)
Grädel, E., Gurevich, Y., Hirsch, C.: The complexity of query reliability. In: PODS, pp. 227–234 (1998)
Gribkoff, E., Suciu, D.: Slimshot: in-database probabilistic inference for knowledge bases. PVLDB 9(7), 552–563 (2016)
Guha, R.V., Kumar, R., Raghavan, P., Tomkins, A.: Propagation of trust and distrust. In: WWW, pp. 403–412 (2004)
Hoffart, J., Suchanek, F.M., Berberich, K., Weikum, G.: Yago2: a spatially and temporally enhanced knowledge base from wikipedia. Artif. Intell. 194, 28–61 (2013)
Jaeger, M., Van den Broeck, G.: Liftability of probabilistic inference: upper and lower bounds. In: StaRAI (2012)
Jampani, R., Xu, F., Wu, M., Perez, L.L., Jermaine, C.M., Haas, P.J.: MCDB: a Monte Carlo approach to managing uncertain data. In: SIGMOD, pp. 687–700 (2008)
Jha, A., Olteanu, D., Suciu, D.: Bridging the gap between intensional and extensional query evaluation in probabilistic databases. In: EDBT, pp. 323–334 (2010)
Jha, A., Suciu, D.: Probabilistic databases with MarkoViews. PVLDB 5(11), 1160–1171 (2012)
Joshi, S., Jermaine, C.M.: Sampling-based estimators for subset-based queries. VLDB J. 18(1), 181–202 (2009)
Kennedy, O., Koch, C.: PIP: a database system for great and small expectations. In: ICDE, pp. 157–168 (2010)
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)
McSherry, F., Najork, M.: Computing information retrieval performance measures efficiently in the presence of tied scores. In: ECIR, pp. 414–421 (2008)
Microsoft SQL Server 2012. http://www.microsoft.com/sqlserver
Moerkotte, G.: Building query compilers. Draft version 03 Mar 2009
Niu, F., Ré, C., Doan, A., Shavlik, J.W.: Tuffy: scaling up statistical inference in markov logic networks using an RDBMS. PVLDB 4(6), 373–384 (2011)
OEIS: The on-line encyclopedia of integer sequences: http://oeis.org/
Olteanu, D., Huang, J.: Using OBDDs for efficient query evaluation on probabilistic databases. In: SUM, pp. 326–340 (2008)
Olteanu, D., Huang, J., Koch, C.: Sprout: lazy vs. eager query plans for tuple-independent probabilistic databases. In: ICDE, pp. 640–651 (2009)
Olteanu, D., Huang, J., Koch, C.: Approximate confidence computation in probabilistic databases. In: ICDE, pp. 145–156 (2010)
Pasternack, J., Roth, D.: Knowing what to believe (when you already know something). In: COLING, pp. 877–885 (2010)
Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers, San Mateo (1988)
Poole, D.: First-order probabilistic inference. In: IJCAI, pp. 985–991 (2003)
PostgreSQL 9.2. http://www.postgresql.org/download/
Quillian, M.R.: Semantic memory. In: Semantic Information Processing, pp. 227–270. MIT Press (1968)
Raghunathan, R., De, S., Kambhampati, S.: Bayesian networks for supporting query processing over incomplete autonomous databases. J. Intell. Inf. Syst. 42(3), 595–618 (2014)
Ré, C., Dalvi, N.N., Suciu, D.: Query evaluation on probabilistic databases. IEEE Data Eng. Bull. 29(1), 25–31 (2006)
Ré, C., Dalvi, N.N., Suciu, D.: Efficient top-k query evaluation on probabilistic data. In: ICDE, pp. 886–895 (2007)
Ré, C., Suciu, D.: Approximate lineage for probabilistic databases. PVLDB 1(1), 797–808 (2008)
Roy, S., Perduca, V., Tannen, V.: Faster query answering in probabilistic databases using read-once functions. In: ICDT, pp. 232–243 (2011)
Rumelhart, D.E., Hinton, G.E.,Williams, R.J.: Learning internal representations by error propagation. In: Parallel distributed processing: explorations in the microstructure of cognition, vol. 1, pp 318–362. MIT Press (1986)
Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Access path selection in a relational database management system. In: SIGMOD, pp. 23–34 (1979)
Sen, P., Deshpande, A.: Representing and querying correlated tuples in probabilistic databases. In: ICDE, pp. 596–605 (2007)
Sen, P., Deshpande, A., Getoor, L.: Read-once functions and query evaluation in probabilistic databases. PVLDB 3(1), 1068–1079 (2010)
Singh, A.P., Gordon, G.J.: Relational learning via collective matrix factorization. In: KDD, pp. 650–658 (2008)
Stoyanovich, J., Davidson, S.B., Milo, T., Tannen, V.: Deriving probabilistic databases with inference ensembles. In: ICDE, pp. 303–314 (2011)
TPC-H Benchmark. http://www.tpc.org/tpch/
Van den Broeck, G., Choi, A., Darwiche, A.: Lifted relax, compensate and then recover: from approximate to exact lifted probabilistic inference. In: UAI, pp. 131–141 (2012)
Van den Broeck, G., Meert, W., Darwiche, A.: Skolemization for weighted first-order model counting. In: KR (2014)
Van den Broeck, G., Suciu, D.: Lifted probabilistic inference in relational models. In: UAI tutorials (2014)
Van den Broeck, G., Taghipour, N., Meert, W., Davis, J., De Raedt, L.: Lifted probabilistic inference by first-order knowledge compilation. In: IJCAI, pp. 2178–2185 (2011)
Vardi, M.Y.: The complexity of relational query languages (extended abstract). In: STOC, pp. 137–146 (1982)
Weston, J., Elisseeff, A., Zhou, D., Leslie, C.S., Noble, W.S.: Protein ranking: from local to global structure in the protein similarity network. Proc Natl Acad Sci USA 101(17), 6559–6563 (2004)
Yin, X., Han, J., Philip, S.Y.: Truth discovery with multiple conflicting information providers on the web. IEEE Trans. Knowl. Data Eng. 20(6), 796–808 (2008)
Zeng, K., Gao, S., Mozafari, B., Zaniolo, C.: The analytical bootstrap: a new method for fast error estimation in approximate query processing. In: SIGMOD, pp. 277–288 (2014)
Zhang, C., Ré, C.: Towards high-throughput Gibbs sampling at scale: a study across storage managers. In: SIGMOD, pp. 397–408 (2013)
Acknowledgments
This work was supported in part by NSF Grants IIS-0513877, IIS-0713576, IIS-0915054, IIS-1115188, IIS-1247469, and CAREER IIS-1553547. We like to thank Abhay Jha for help with the experiments in the workshop version of this paper, Alexandra Meliou for suggesting the name “dissociation”, and Vibhav Gogate for guidance in using his tool SampleSearch. WG would also like to thank Manfred Hauswirth for a small comment in 2007 that was crucial for the development of the ideas in this paper.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Gatterbauer, W., Suciu, D. Dissociation and propagation for approximate lifted inference with standard relational database management systems. The VLDB Journal 26, 5–30 (2017). https://doi.org/10.1007/s00778-016-0434-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-016-0434-5