Our research aims at models, algorithms and tools for highly efficient, easy-to-use data and knowledge management; throughout our research, performance at scale is a core concern, which we address, among other techniques, by designing algorithms for a cloud (massively parallel) setting. In addition, we explore and mine rich data via machine learning techniques. Our scientific contributions fall into four interconnected areas:
As the world's affairs get increasingly more digital, a large and varied set of data sources becomes available: they are either structured databases, such as government-gathered data (demographics, economics, taxes, elections, ...), legal records, stock quotes for specific companies, un-structured or semi-structured, including in particular graph data, sometimes endowed with semantics (see e.g., the Linked Open Data cloud). Modern data management applications, such as data journalism, are eager to combine in innovative ways both static and dynamic information coming from structured, semi-structured, and unstructured databases and social feeds. However, current content management tools for this task are not suited for the task, in particular when they require a lengthy rigid cycle of data integration and consolidation in a warehouse. Thus, we need flexible tools allowing us to interconnect various kinds of data sources and query them together.
Semantic graphs, including data and knowledge, are hard to apprehend for users due to the complexity of their structure and, often to their large volumes. To help tame this complexity, our research follows several avenues. First, we build compact summaries of Semantic Web (RDF) graphs suited for a first-sight interaction with the data. Second, we devise fully automated methods of exploring RDF graphs using interesting aggregate queries, which, when evaluated over a given input graph, yield interesting results (with interestingness understood in a formal, statistical sense). Third, we study the exploration of highly heterogeneous data graphs resulting from integrating structured, semi-structured, and unstructured (text) data. In this context, we develop data abstraction methods, showing the structure of any dataset to a novice user, as well as searching on the graph through (
Data analytics in the cloud has become an integral part of enterprise businesses. Big data analytics systems, however, still lack the ability to take user performance goals and budgetary constraints for a task collectively referred to as task objectives, and automatically configure an analytic job to achieve the objectives. Our goal is to develop a data analytics optimizer that can automatically determine a cluster configuration with a suitable number of cores and other runtime system parameters that best meet the task objectives. To achieve this, we also need to design a multi-objective optimizer that constructs a Pareto optimal set of job configurations for task-specific objectives and recommends new job configurations to best meet these objectives.
Database engines are migrating to the cloud to leverage the opportunities for efficient resource management by adapting to the variations and heterogeneity of the workloads. Resource management in a virtualized setting, like the cloud, must be enforced in a performance-efficient manner to avoid introducing overheads to the execution. We design elastic systems that change their configuration at runtime with minimal cost to adapt to the workload every time. Changes in the design include both different resource allocations and different data layouts. We consider different workloads, including transactional, analytical, and mixed, and we study the performance implications on different configurations to propose a set of adaptive algorithms.
Argumentation appears when we evaluate the validity of new ideas, convince an addressee, or solve a difference of opinion. An argument contains a statement to be validated (a proposition also called claim or conclusion), a set of backing propositions (called premises, which should be accepted ideas), and a logical connection between all the pieces of information presented that allows the inference of the conclusion. In our work, we focus on fallacious arguments, where evidence does not prove or disprove the claim, for example, in an "ad hominem" argument, a claim is declared false because the person making it has a character flaw. We study the impact of fallacies in online discussions and show the need for improving tools for their detection. In addition, we look into detecting verifiable claims made by politicians. We started a collaboration with RadioFrance and with Wikidébats, a debate platform focused on proving quality arguments for controversial topics.
We are witnessing a massive shift in the way people consume information. In the past, people had an active role in selecting the news they read. More recently, the information started to appear on people's social media feeds as a byproduct of one's social relations. We see a new shift brought by the emergence of online advertising platforms where third parties can pay ad platforms to show specific information to particular groups of people through paid targeted ads. AI-driven algorithms power these targeting technologies. Our goal is to study the risks with AI-driven information targeting at three levels: (1) human-level–in which conditions targeted information can influence an individual's beliefs; (2) algorithmic- level–in which conditions AI-driven targeting algorithms can exploit people's vulnerabilities; and (3) platform- level–are targeting technologies leading to biases in the quality of information different groups of people receive and assimilate. Then, we will use this understanding to propose protection mechanisms for platforms, regulators, and users.
Cloud computing services are strongly developing and more and more companies and institutions resort to running their computations in the cloud, in order to avoid the hassle of running their own infrastructure. Today’s cloud service providers guarantee machine availabilities in their Service Level Agreement (SLA), without any guarantees on performance measures according to a specific cost budget. Running analytics on big data systems require the user not to only reserve the suitable cloud instances over which the big data system will be running, but also setting many system parameters like the degree of parallelism and granularity of scheduling. Chosing values for these parameters, and chosing cloud instances need to meet user objectives regarding latency, throughput and cost measures, which is a complex task if it’s done manually by the user. Hence, we need need to transform cloud service models from availabily to user performance objective rises and leads to the problem of multi-objective optimization. Research carried out in the team within the ERC project “Big and Fast Data Analytics” aims to develop a novel optimization framework for providing guarantees on the performance while controlling the cost of data processing in the cloud.
Modern journalism increasingly relies on content management technologies in order to represent, store, and query source data and media objects themselves. Writing news articles increasingly requires consulting several sources, interpreting their findings in context, and crossing links between related sources of information. Cedar research results directly applicable to this area provide techniques and tools for rich Web content warehouse management. Within the SourcesSay AI Chair project, we work to devise concrete algorithms and platforms to help journalists perform their work better and/or faster. This work is in collaboration with the journalists from RadioFrance, the team Le vrai du faux.
Political discussions revolve around ideological conflicts that often split the audience into two opposing parties. Both parties try to win the argument by bringing forward information. However, often this information is misleading, and its dissemination employs propaganda techniques. We investigate the impact of propaganda in online forums and we study a particular type of propagandist content, the fallacious argument. We show that identifying such arguments remains a difficult task, but one of high importance because of the pervasiveness of this type of discourse. We also explore trends around the diffusion and consumption of propaganda and how this can impact or be a reflection of society.
The enormous financial success of online advertising platforms is partially due to the precise targeting features they offer. Ad platforms collect large amounts of data on users and use powerful AI-driven algorithms to infer users’ fine-grain interests and demographics, which they make available to advertisers to target users. For instance, advertisers can target groups of users as small as tens or hundreds and as specific as ‘‘people interested in anti-abortion movements that have a particular education level’’. Ad platforms also employ AI-driven targeting algorithms to predict how ‘‘relevant’’ ads are to particular groups of people to decide to whom to deliver them. While these targeting technologies are creating opportunities for businesses to reach interested parties and lead to economic growth, they also open the way for interested groups to use user’s data to manipulate them by targeting messages that resonate with each user.
Our work on Big Data and AI techniques applied to data journalism and fact-checking have attracted attention beyond our community and was disseminated in general-audience settings, for instance through I. Manolescu's participation in panels at Médias en Seine, at the Colloque Morgenstern at Inria Sophia, and through invited keynotes, e.g., at DEBS 2022 and DASFAA 2022.
Our work in the SourcesSay project (Section 8.1.1), on propaganda detection (Section 8.4.1), and on ad transparency (Section 8.2), goes towards making information sharing on the Web more transparent and more trustworthy.
In the spring, a journalist and data scientist, Camille Pettineo, worked in the team. Discussing with her, we have devised new, more user-friendly interfaces for our software ConnectionLens, leading to the ConnectionStudio system.
In May 2023, O. Balalau, N. Barret, S. Ebel, T. Galizzi and I. Manolescu presented tools developed within the team (StatCheck, ConnectionStudio, and Abstra) to a meet-up of DataJournos, a data journalism association. Approximately 40 people attended our presentation. This has lead to new collaborations between O. Balalau and ADEME/AEF, on the topic of automatically detecting greenwashing through Natural Language analysis.
In July 2023, N. Barret and M. Mohanty gave a tutorial on ConnectionStudio and how to use it for data journalism, at the “Forum Medias et Développement” organised by CFI, the french media development agency.
In September 2023, O. Balalau, T. Bouganim and I. Manolescu participated to SciCar ("Where Science meets Computer-Assisted Reporting") conference in Dortmund, Germany. We presented our collaboration with S. Horel (Le Monde).
In October 2023, O. Balalau, S. Ebel et T. Galizzi presented StatCheck, our fact-checking tool, at an IA day organized by RadioFrance.
In November 2023, I. Manolescu participated in a debate on “New (AI) tools for media” at the “Médias en Seine” journalism conference.
ConnectionStudio integrates highly heterogeneous data into graphs, enriched with extracted entities. Studio users can discover the entities in their data, navigate across connections between datasets, explore and query the data in many ways. The Studio currently supports: CSV, JSON, XML, RDF, text, property graphs, all Office formats, and PDF datasets.
ConnectionStudio is a novel front-end to ConnectionLens, Abstra and PathWays (see also the respective Web sites). Its own novel features are outlined in a CoopIS 2023 article.
Work carried within the ANR AI Chair SourcesSay project has focused on developing a platform for integrating arbitrary heterogeneous data into a graph, then exploring and querying that graph in a simple, intuitive manner through keyword search. The main technical challenges are: (i) how to interconnect structured and semi-structured data sources? We address this through information extraction (when an entity appears in two data sources or two places in the same graph, we only create one node, thus interlinking the two locations) and through similarity comparisons; (ii) how to find all connections between nodes matching specific search criteria, or certain keywords? The question is particularly challenging in our context since ConnectionLens graphs can be pretty large, and query answers can traverse edges in both directions.
In this context, the following new contributions have been brought:
ConnectionLens is available online at: ConnectionLens Gitlab repository, while ConnectionStudio is available at ConnectionStudio Gitlab repository.
To strengthen public trust and counter disinformation, computational fact-checking, leveraging digital data sources, attracts interest from journalists and the computer science community. A particular class of interesting data sources comprises statistics, that is, numerical data compiled mostly by governments, administrations, and international organizations. Statistics are often multidimensional datasets, where multiple dimensions characterize one value and the dimensions may be organized in hierarchies. To address this challenge we developed STATCHECK, a statistic fact-checking system, in collaboration with RadioFrance. The initial technical novelties of STATCHECK were twofold: (i) we focus on multidimensional, complex-structure statistics, which have received little attention so far, despite their practical importance; and (ii) novel statistical claim extraction modules for French, an area where few resources exist. We validate the efficiency and quality of our system on large statistic datasets (hundreds of millions of facts), including the complete INSEE (French) and Eurostat (European Union) datasets, as well as French presidential election debates 2. In 2023, we have further improved the plaform with: (i) the use of LLM for both the table extraction and the table retrieval; (ii) extending the list of data sources, from which we gather tables, in a extensible and dynamic manner (Focus crawling, incremental crawler) (iii) automatical transcription and analysis of audio files, this new capatibility is perticularly important in the collaboration with Franceinfo, since journalists work a lot with radio feeds; (iv) detection and analysis of propaganda.
Online political advertising has become the cornerstone of political campaigns. The budget spent solely on political advertising in the U.S. has increased by more than 1002017-2018 U.S. election cycle to 1.6 billion during the 2020 U.S. presidential elections. Naturally, the capacity offered by online platforms to micro-target ads with political content has been worrying lawmakers, journalists, and online platforms, especially after the 2016 U.S. presidential election, where Cambridge Analytica has targeted voters with political ads congruent with their personality. To curb such risks, both online platforms and regulators (through the DSA act proposed by the EuropeanCommission) have agreed that researchers, journalists, and civil society need to be able to scrutinize the political ads running on large online platforms. Consequently, online platforms such as Meta and Google have implemented Ad Libraries that contain information about all political ads running on their platforms. This is the first step on a long path. Due to the volume of available data, it is impossible to go through these ads manually, and we now need automated methods and tools to assist in the scrutiny of political ads. In this work 25, we focus on political ads that are related to policy. Understanding which policies politicians or organizations promote and to whom is essential in determining dishonest representations. This paper proposes automated methods based on pre-trained models to classify ads in 14 main policy groups identified by the Comparative Agenda Project (CAP). We discuss several inherent challenges that arise. Finally, we analyze policy-related ads featured on Meta platforms during the 2022 French presidential elections period.
Several targeted advertising platforms offer transparency mechanisms, but researchers and civil societies repeatedly showed that those have major limitations. In this work 22, we propose a collaborative ad transparency method to infer, without the cooperation of ad platforms, the targeting parameters used by advertisers to target their ads. Our idea is to ask users to donate data about their attributes and the ads they receive and to use this data to infer the targeting attributes of an ad campaign. We propose a Maximum Likelihood Estimator based on a simplified Bernoulli ad delivery model. We first test our inference method through controlled ad experiments on Facebook. Then, to further investigate the potential and limitations of collaborative ad transparency, we propose a simulation framework that allows varying key parameters. We validate that our framework gives accuracies consistent with real-world observations such that the insights from our simulations are transferable to the real world. We then perform an extensive simulation study for ad campaigns that target a combination of two attributes. Our results show that we can obtain good accuracy whenever at least ten monitored users receive an ad. This usually requires a few thousand monitored users, regardless of population size. Our simulation framework is based on a new method to generate a synthetic population with statistical properties resembling the actual population, which may be of independent interest.
Many researchers and organizations, such as WHO and UNICEF, have raised awareness of the dangers of advertisements targeted at children. While most existing laws only regulate ads on televi- sion that may reach children, lawmakers have been working on extending regulations to online advertising and, for example, forbid (e.g., the DSA) or restrict (e.g., the COPPA) advertising based on profiling to children. At first sight, ad platforms such as Google seem to protect chil- dren by not allowing advertisers to target their ads to users that are less than 18 years old. However, this work 24 shows that other targeting features can be exploited to reach children. For example, on YouTube, advertisers can target their ads to users watching a particular video through placement-based targeting, a form of con- textual targeting. Hence, advertisers can target children by simply placing their ads in children-focused videos. Through a series of ad experiments, we show that placement-based targeting is possible on children-focused videos and, hence, enables marketing to chil- dren. In addition, our ad experiments show that advertisers can use targeting based on profiling (e.g., interest, location, behavior) in combination with placement-based advertising on children-focused videos. We discuss the lawfulness of these two practices with re- spect to DSA and COPPA. Finally, we investigate to which extent real-world advertisers are employing placement-based targeting to reach children with ads on YouTube. We propose a measurement methodology consisting of building a Chrome extension able to capture ads and instrumenting six browser profiles to watch children-focused videos. Our results show that 7we test use placement-based targeting. Hence, targeting children with ads on YouTube is not only hypothetically possible but also occurs in practice. We believe that the current legal and technical solutions are not enough to protect children from harm due to online advertising. A straightforward solution would be to forbid placement-based advertising on children-focused content.
Privacy-focused search engines such as DuckDuckGo, StartPage, and Qwant promote a strategy of respecting users’ privacy and promise not to track users’ search and browsing behavior. However, they rely on advertising for revenue, using Microsoft's (DuckDuckGo and Qwant) or Google's (StartPage) advertising systems. Moreover, these search engines are often silent or ambiguous on the privacy properties of the ads that appear on their search page. Our research 21 delves into the privacy properties of advertising systems used by these search engines. Our findings reveal that the privacy protections of private search engines do not sufficiently cover their advertising systems. Although these search engines refrain from identifying and tracking users and their ad clicks, the presence of ads from Google or Microsoft subjects users to the privacy-invasive practices performed by these two advertising platforms. When users click on ads on private search engines, they are often identified and tracked either by Google, Microsoft, or other third parties, through bounce tracking and UID smuggling techniques.
Growing concern over digital privacy has led to the widespread use of tracking restriction tools such as ad blockers, Virtual Private Networks (VPN), and privacy-focused web browsers. Despite these efforts, advertising companies continuously innovate to overcome these restrictions. Recently, advertising platforms like Meta have been promoting server-side tracking solutions to bypass traditional browser-based tracking restrictions.
We explore how server-side tracking technologies can link website visitors with their user accounts on Meta products. The goal is to assess the effectiveness and accuracy of employing this technology, as well as the effect of tracking restrictions on online tracking. Our methodology involves a series of experiments where we integrate Meta's client-side tracker (the Meta Pixel) and server-side technology (the Conversions API) on different web pages. We then drive traffic to these pages and evaluate the success rate of linking website visitors to their profiles on Meta products.
Humans use argumentation daily to evaluate the validity of new ideas, convince an addressee, or solve a difference of opinion. An argument contains a statement to be validated (a proposition also called claim or conclusion), a set of backing propositions (called premises, which should be accepted ideas), and a logical connection between all the pieces of information presented that allows the inference of the conclusion. In this work, we will focus on fallacies: weak arguments that seem convincing, however, their evidence does not prove or disprove the argument's conclusion.
Fallacy detection is part of argumentation mining, the area of natural language processing dedicated to extracting, summarizing, and reasoning over human arguments. The task is closely related to propaganda detection, where propaganda consists of a set of manipulative techniques, such as fallacies, used in a political context to enforce an agenda 3.
In the past, we have worked on propaganda 3 and fallacy detection 11. We continue this work with a CIFRE PhD, a collaboration between the Amundi company, Inria and Télécom Paris. This thesis aims to improve fallacy detection in natural language by leveraging both language patterns and additional information, such as common sense knowledge, encyclopedic knowledge and logical rules. To achieve this we will focus on how fallacies can be represented and how we can classify reasoning patterns in argumentation. The interest of Amundi is in how argumentation can be applied to finding examples of greenwashing. We are currently collaborating with ADEME and AEF Info on the topic of greenwashing detection.
Open Information Extraction (OIE) is the task of extracting tuples of the form (subject, predicate, object), without any knowledge of the type and lexical form of the predicate, the subject, or the object. In this work, we focus on improving OIE quality by exploiting domain knowledge about the subject and object. More precisely, knowing that the subjects and objects in sentences are often named entities, we explore how to inject constraints in the extraction through constrained inference and constraint-aware training. An important use case that we want to pursue next is automatically creating a knowledge base of relations between scientists and companies, i.e. identifying conflict-of-interest between the scientists and funding bodies, where the named entities are the names of scientists and companies, and the relation describes the conflict of interest between them. Our work 26 leverages the state-of-the-art OpenIE6 platform, which we adapt to our setting. Through a carefully constructed training dataset and constrained training, we obtain a 29.17
This research 27 investigates the extent to which the Graph-to-Text (G2T) generation problem is addressed in existing datasets and the performance of metrics in text comparison. A key focus is on a new metric, FactSpotter, developed to assess factual faithfulness in generated texts from G2T models. FactSpotter is shown to correlate highly with human annotations in terms of data correctness, coverage, and relevance. It functions as a plugin feature to enhance the factual accuracy of existing models and examines the current challenges in G2T datasets.
FactSpotter, initially developed for evaluating the factual faithfulness of Graph-to-Text generation, has its potential for expansion into a more general text similarity metric. This would enable it to assess a wider range of text generation tasks, extending its applicability to various narrative forms and data-driven journalism.
There is an increasing gap between fast growth of data and the limited human ability to comprehend data. Consequently, there has been a growing demand of data management tools that can bridge this gap and help the user retrieve high-value content from data more effectively. In this work 14, we propose an interactive data exploration system as a new database service, using an approach called "explore-by-example." Our new system is designed to assist the user in performing highly effective data exploration while reducing the human effort in the process. We cast the explore-by-example problem in a principled "active learning" framework. However, traditional active learning suffers from two fundamental limitations: slow convergence and lack of robustness under label noise. To overcome the slow convergence and label noise problems, we bring the properties of important classes of database queries to bear on the design of new algorithms and optimizations for active learning-based database exploration. Evaluation results using real-world datasets and user interest patterns show that our new system, both in the noise-free case and in the label noise case, significantly outperforms state-of-the-art active learning techniques and data exploration systems in accuracy while achieving the desired efficiency for interactive data exploration.
Spark has been widely used for data analytics in the cloud. Determining an optimal configuration of a Spark physical plan based on user-specified objectives is a complex task. It is challenging from three aspects. Firstly, a Spark physical plan, or query, can be represented as a Directed Acyclic Graph (DAG) of “query stages,” where parameters of each stage are controlled under the granularity of the query (i.e. Spark-context parameters, e.g. resources are shared among all stages) and the granularity of the stage (i.e. different among different stages). The correlation of parameters under multiple granularity makes the performance tuning of a query more complicated. Secondly, the parameters of each stage face timing constraints. Spark-context parameters should be set at compile time and cannot change during runtime, while stage-level parameters can be modified during runtime. Thirdly, Multi-Objective Optimization (MOO) is necessary when there are multiple potentially conflicting, user performance objectives such as latency and cost. This work focuses on the algorithm design to return Pareto optimal configurations for a query with parameters under multi-granularity control and different timing constraints. It captures tradeoffs among various objectives and recommends an optimal configuration based on user preferences. The expectation is to provide recommendations for all stages in a query within a few seconds.
Several real-time applications rely on dynamic graphs to model and store data arriving from multiple streams. In addition to the high ingestion rate, the storage and query execution challenges are amplified in contexts where consistency should be considered when storing and querying the data. In this project 28, we addresses the challenges associated with multi-stream dynamic graph analytics. We propose a database design that can provide scalable storage and indexing, to support consistent read-only analytical queries (present and historical), in the presence of real-time dynamic graph updates that arrive continuously from multiple streams.
The widespread adoption of Internet-based services by software companies, as well as the scale and complexity at which they operate, have made incidents in their IT operations increasingly more likely, diverse and impactful. This has led to the rapid development of a central aspect of the "Artificial Intelligence for IT Operations" (AIOps) domain, focusing on detecting abnormal patterns in vast amounts of multivariate time series (MTS) data generated by service entities. Although numerous MTS anomaly detection methods have been developed, the state of the art still presents some limitations due to the unique challenges posed by AIOps. These challenges include 1) the presence of complex, noisy and diverse normal behaviors, 2) the wide variety of anomaly types and difficulty in providing detailed anomaly labels, and 3) the need to generalize to a wide variety of normal behaviors for the monitored entities.
Our research focused on designing new anomaly detection methods to address these AIOps challenges, mainly through explicit context generalization and weak supervision, as well as conducting thorough experiments to demonstrate their superiority. Concurrently, we continued developing our data science pipeline for explainable anomaly detection over MTS, to further facilitate the design and benchmarking of new techniques for the community.
Hi!Paris Collaborative Project (2022-2024) coordinated by
PhD supervision: The team has supervised the following PhDs students:
Engineers supervision:
Intern supervision: The team has supervised the following interns:
Part-time project supervision The team has supervised the following part-time research projects: