Project 2: Representation of (and Reasoning on) Data Quality

In Representation of Data Quality, we investigate the identification of factors that influence the quality of textual sources, such as provenance and the perspectives they reflect. The project fits a broader aim to determine the reliability of Web data for scholarly research. In addition to data produced by scholars or other experts (such as journalists at established newspapers or broadcasters) the Web contains a wealth of data (textual and audio-visual) that document current events and debates in their development throughout space and over time.

Since the Web is tied in with freedom of speech, Web data are diverse and biased, which limits their uptake as sources for Humanities research. Evaluating the reliability of these “noisy” data at the scale of the Web requires contextualizing the provenance of the data and analysing their reflected perspectives, as well as a comparison with information extracted from trusted data repositories.

This project aims to define measures for trust and reliability in historical textual data by combining NLP-processing, crowdsourcing human annotations and social media analysis. The outcome not only supports the uptake of historical online data in research but also serves as the basis for services that promote informed decision making in the fields of consumption, health, politics and education. The human-machine entities and events extraction will be tested on a shared dataset across the program, including newspapers, social media, biographies, encyclopaedias, literary texts.

The stakeholders in the specific use case for this project will be media and history researchers, as well as wide audience.

We performed a series of preliminary analyses aimed at outlining a framework for assessing the quality of Web documents. These are based on the following overview:

Screen Shot 2016-04-18 at 14.29.41

Given a list of documents, we enrich it with NLP-based features and we use them to predict diverse quality assessments.

The project will provide the grounding for the reliability of multimedia content on the Web.

Project team:

  • Julia Noordegraaf, Digital Heritage, Faculty of Humanities, University of Amsterdam
  • Lora Aroyo, Computer Science, Faculty of Science, VU University Amsterdam
  • Davide Ceolin, Computer Science, Faculty of Science, VU University Amsterdam