Proposal Title: Dialogue-Based Question Answering with Verified Information
Project supported by the Samsung Global Research Outreach Initiative
Principal Investigator: Alessandro Moschitti
Co-Investigator: Olga Uryupina
We propose to build a completely novel Question Answering technology called DiQuaVe to address one of the society’s longstanding problems—how to separate facts from falsehoods. Our model will provide answer (re)ranking based not only on the relevance, but also, and most importantly, on the veracity and reliability of the information source. Our technology will solve the following problems:
How to select reliable social media and streams and categorize their veracity.
How to integrate information fragments from multiple streams or static knowledge sources in a way that enables us to assess their value and truthfulness.
How to enrich Question Answering models taking into account veracity and reliability.
DiQuaVe will develop algorithms able to:
search for complex structured information, e.g., topics, events or facts in social media and streams;
derive the reliability of facts providing supporting information;
perform an innovative fact veracity classification—this will enable accurate fact similarity using semantic structures, and possibly user discussions and interactions;
rerank candidate answers based on their truthfulness and reliability to enable a novel veracity-based QA technology.
In the last decade, the Web has shown that making information (of any kind) widely available to everybody has positively impacted the way people live: how they work (e.g., affecting the industrial world at any level) and how they conduct daily, cultural, social, leisure and other activities. The most common way of accessing such information comes from search engines. While web search algorithms initially were oriented on keyword-style user requests, recent reports show that more and more queries are formulated as natural language questions. Moreover, the users nowadays often engage in dialogues with search interfaces, submitting follow-up questions to their initial requests. The trend is even more prominent for voice interfaces, such as Siri. This shows the importance of web-based question answering that is currently replacing keyword-oriented search.
The impact of contaminated online content on the society cannot be underestimated: it is, for example, currently believed that fake news campaigns were involved and might have impacted the citizens’ choice in the recent presidential elections in U.S. and France. Fake news therefore pose a direct threat to the basic democratic values. The media industry is in a constant ‘cat and mouse’ competition with contaminated content generators, struggling to protect their market against unscrupulous online publishers. Similar to the internet security sector, today’s content verification technologies are largely reactive to yesterday’s threats – and the latest wave of Fake News content has several new alarming characteristics. Firstly, the latest fake news are no longer simple easily verifiable fragments of information. Instead, they combine a mixture of true and half-true facts with biased analytics and ill-conceived interpretations, created by professional writers with an explicit purpose of misleading and manipulating the reader. Second, these news stories are typically promoted by complex well-thought campaigns that imitate natural information spread patterns and are therefore hard to identify automatically. Finally, fake news are no longer outliers: a recent journalistic study suggests that up to 30% of all the news even by reputable sources contain misinformation while, according to Gartner, by 2022, the majority of individuals in mature economies will consume more false information than true information. To address this issue, multiple media groups as well as independent journalistic agencies are investing into fact-checking initiatives: projects led by human experts aiming at manual verification of their news feeds. While the significance of these projects cannot be underestimated, they are fighting a losing battle: with multiple unscrupulous content generators publishing contaminated and manipulative information at an alarming and ever increasing rate, manual fact-checking cannot possibly deal with pervasive and polymorphic disinformation waves.
Fact checking relies on a combination of complex techniques. A human expert would analyze a statement, identify important claims to be fact-checked, collect evidence on each of the claims and analyze the evidence to decide if it supports of refutes the claim. All of these tasks are challenging even for human professionals. As a result, there is no binary labeling of facts as true or fake, even generated by human experts. Fact-checking agencies classify news into multiple categories depending on their veracity (e.g.: true, half-true, half-lie, lie, pants on fire). Moreover, different experts may disagree on the labeling of particular news, even if the overall analysis is not disputed. Thus, Fig. 1 shows a claim by Sen. Bernie Sanders comprising multiple statements involving numbers. These numbers are obviously approximations and it remains to human experts to find the evidence on the correct percentages of voters (these can be provided by different sources and thus, again be different) and to interpret them as being close enough to Sen. Sanders’ approximations. In this particular example, two fact-checking agencies strongly disagree on each other’s analyses. This example shows that fact checking requires considerable domain knowledge and relies on advanced text understanding. While a lot of effort is currently being undertaken by both academic and industrial stakeholders, the automatic fact-checking technology is, at the best, in its infancy.
In DiQuaVe, we aim at developing a novel Question Answering technology, where veracity and reliability are integrated as key criteria for candidate answer selection and reranking. While in the first year we are mainly focusing on the Question Answering and Dialogue parts of our research agenda, the second year will be dedicated mostly to the Fact Checking/Veracity Technology.
During the first year of DiQuaVe, we have focused on improving machine understanding for user-generated questions. To this end, we have adopted our Text Representation Pipeline for generating high-quality representations of very short texts (user questions), incorporating features suggested and tested at recent SemEval challenges. These representations help us define pairwise similarity between questions and thus are useful for a variety of tasks, e.g., for question duplicate detection. However, they do not model any global domain-level interaction between questions. To address this issue, we have proposed a novel model for supervised question clustering: we formulate question clustering as a structural output problem, building on top of individual pairwise edge similarities generated by the Text Representation Pipeline. This allows us to acquire question clusters corresponding to user intents in a data-driven way, thus considerably reducing the engineering effort associated with intent detection in dialogue modeling. We have evaluated our approach on two datasets for two different languages (English and Italian), showing a consistent improvement against conventional clustering methods.
This work is described in more details here
During the remaining Y1 months, we are designing a prototype fact-checking system. The system will incorporate our Text Representation into a convolutional neural net to assess facts and supporting evidence. We will assess our prototype in two major experiments. First, we will run a stance detection evaluation on the recently released FEVER dataset. Second, we will put all the developed technology together to implement a proof of concept for the veracity-aware QA.
During the first half of Y2, we have focused on creating a software prototype enabling a full stack of NLP modules required for fact-checking technology. We have worked on integrating state-of-the-art NLP processors into the system, as well as assessing their individual performance on the FEVER data and their impact on the overall fact-checking score. Apart from software integration and fine-tuning efforts, this also required setting up clean evaluation framework to be able to benchmark each component individually. Moreover, building upon this system, we have further advanced state of the art by developing a novel model for passage reranking.
During the last months of the Y2 of the project, we have focused on the following directions:
- Integrating BERT architecture into the prototype. Our original evidence reranking and NLI inference components implementations were based on the CNNR (Severyn and Moschitti, 2016) and ESIM (Chen et al., 2017) architectures. Recently, we substituted them with the implementations based on the novel BERT architecture (Devlin et al., 2018), shown to beat the state-of-the-art on a wide range of the Natural Language processing tasks.
- Experimenting with transformer architecture. Building upon our BERT-based prototype, we have explored different possibilities of tuning transformer models for fact-checking related tasks. In particular, we have been experimenting with the recent trend of transfer pre-finetuning (Clark et al., 2019; Phang et al., 2018), i.e. fine-tuning BERT on some related task before fine-tuning it on the final task.
- Evaluating. We evaluate our prototype on the original FEVER DEV dataset, showing that our end-to-end pipeline obtains 62.24 FEVER points and claim label accuracy of 64.76. Additionally, we show that our implementation exhibits the state-of-the-art performance when reusing the document retrieval components by the best systems in the FEVER competition. Additionally, we apply our models to the novel Symmetric Fever Test (Schuster et al., 2019), which removes certain biases from the original evaluation. We demonstrate that our pipeline outperforms the state-of-the-art results on this dataset.
- New Dataset for Veracity-based QA. We have created a pilot corpus for veracity-based Question Answering, by manually extending the annotation in a very recent fact-checking dataset (Augenstein et al., 2019).
- Haponchyk, I., Uva, A., Yu, S., Uryupina, O., and Moschitti, A. (2018) Supervised Clustering of Questions into Intents for Dialog System Applications. In EMNLP 2018.