Test3 -

Citation Needed: A Taxonomy and Algorithmic Assessment of Wikipedia’s Verifiability Redi, Miriam; Fetahu, Besnik; Morgan, Jonathan; Taraborelli, Dario (2019).
Neural Based Statement Classification for Biased Language Hube, Christoph; Fetahu, Besnik (2019).
EventKG - the Hub of Event Knowledge on the Web - and Biographical Timeline Generation Gottschalk, Simon; Demidova, Elena (2019).

One of the key requirements to facilitate the semantic analytics of information regarding contemporary and historical events on the Web, in the news and in social media is the availability of reference knowledge repositories containing comprehensive representations of events, entities and temporal relations. Existing knowledge graphs, with popular examples including DBpedia, YAGO and Wikidata, focus mostly on entity-centric information and are insufficient in terms of their coverage and completeness with respect to events and temporal relations. In this article we address this limitation, formalise the concept of a temporal knowledge graph and present its instantiation - EventKG. EventKG is a multilingual event-centric temporal knowledge graph that incorporates over 690 thousand events and over 2.3 million temporal relations obtained from several large-scale knowledge graphs and semi-structured sources and makes them available through a canonical RDF representation. Whereas popular entities often possess hundreds of relations within a temporal knowledge graph such as EventKG, generating a concise overview of the most important temporal relations for a given entity is a challenging task. In this article we demonstrate an application of EventKG to biographical timeline generation, where we adopt a distant supervision method to identify relations most relevant for an entity biography. Our evaluation results provide insights on the characteristics of EventKG and demonstrate the effectiveness of the proposed biographical timeline generation method.
TableNet: A Knowledge Graph of Interlinked Wikipedia Tables Fetahu, Besnik; Anand, Avishek; Koutraki, Maria (2019).
Asynchronous Training of Word Embeddings for Large Text Corpora Anand, Avishek; Khosla, Megha; Singh, Jaspreet; Zab, Jan-Hendrik; Zhang, Zijian in WSDM ’19 (2019). 168–176.
RDF Dataset Profiling - a Survey of Features, Methods, Vocabularies and Applications Ben Ellefi, Mohamed; Bellahsene, Zohra; John, Breslin; Demidova, Elena; Dietze, Stefan; Szymanski, Julian; Todorov, Konstantin (2018). 9(5) 677–705.

The Web of Data, and in particular Linked Data, has seen tremendous growth over the past years. However, reuse and take-up of these rich data sources is often limited and focused on a few well-known and established RDF datasets. This can be partially attributed to the lack of reliable and up-to-date information about the characteristics of available datasets. While RDF datasets vary heavily with respect to the features related to quality, provenance, interlinking, licenses, statistics and dynamics, reliable information about such features is essential to enable dataset discovery and selection in tasks such as entity linking, distributed query, search or question answering. Even though there exists a wealth of works contributing to the task of dataset profiling in general, these works are spread across a wide range of communities. In this survey, we provide a first comprehensive overview of the RDF dataset profiling features, methods, tools and vocabularies. We organize these building blocks of dataset profiling in a taxonomy and illustrate the links between the dataset profiling and feature extraction approaches and several application domains. This survey is aimed towards data practitioners, data providers and scientists, spanning a large range of communities and drawing from different fields such as dataset profiling, assessment, summarization and characterization. Ultimately, this work is intended to facilitate the reader to identify the relevant features for building a dataset profile for intended applications together with the methods and tools capable of extracting these features from the datasets as well as vocabularies to describe the extracted features and make them available.
EventKG: A Multilingual Event-Centric Temporal Knowledge Graph Gottschalk, Simon; Demidova, Elena in Lecture Notes in Computer Science (2018). 272–287.

One of the key requirements to facilitate semantic analytics of information regarding contemporary and historical events on the Web, in the news and in social media is the availability of reference knowledge repositories containing comprehensive representations of events and temporal relations. Existing knowledge graphs, with popular examples including DBpedia, YAGO and Wikidata, focus mostly on entity-centric information and are insufficient in terms of their coverage and completeness with respect to events and temporal relations. EventKG presented in this paper is a multilingual event-centric temporal knowledge graph that aims to address this gap. EventKG incorporates over 690 thousand contemporary and historical events and over 2.3 million temporal relations extracted from several large-scale knowledge graphs and less structured sources and makes this information available through a canonical representation. In this paper we present EventKG including its data model, extraction process, and characteristics and discuss its relevance for several real-world applications including Question Answering, timeline generation and cross-cultural analytics.
EventKG+TL: Creating Cross-Lingual Timelines from an Event-Centric Knowledge Graph Gottschalk, Simon; Demidova, Elena (2018). 164–169.
Heuristics-based Query Reordering for Federated Queries in SPARQL 1.1 and SPARQL-LD Yannakis, Thanos; Fafalios, Pavlos; Tzitzikas, Yannis (2018).
Tracking the History and Evolution of Entities: Entity-centric Temporal Analysis of Large Social Media Archives Fafalios, Pavlos; Iosifidis, Vasileios; Stefanidis, Kostas; Ntoutsi, Eirini (2018).

How did the popularity of the Greek Prime Minister evolve in 2015? How did the predominant sentiment about him vary during that period? Were there any controversial sub-periods? What other entities were related to him during these periods? To answer these questions, one needs to analyze archived documents and data about the query entities, such as old news articles or social media archives. In particular, user generated content posted in social networks, like Twitter and Facebook, can be seen as a comprehensive documentation of our society, and thus, meaningful analysis methods over such archived data are of immense value for sociologists, historians, and other interested parties who want to study the history and evolution of entities and events. To this end, in this paper we propose an entity-centric approach to analyze social media archives and we define measures that allow studying how entities were reflected in social media in different time periods and under different aspects, like popularity, attitude, controversiality, and connectedness with other entities. A case study using a large Twitter archive of 4 years illustrates the insights that can be gained by such an entity-centric and multi-aspect analysis.
Building and Querying Semantic Layers for Web Archives (Extended Version) Fafalios, Pavlos; Holzmann, Helge; Kasturia, Vaibhav; Nejdl, Wolfgang (2018).

Web archiving is the process of collecting portions of the Web to ensure that the information is preserved for future exploitation. However, despite the increasing number of web archives worldwide, the absence of efficient and meaningful exploration methods still remains a major hurdle in the way of turning them into a usable and useful information source. In this paper, we focus on this problem and propose an RDF/S model and a distributed framework for building semantic profiles (“layers”) that describe semantic information about the contents of web archives. A semantic layer allows describing metadata information about the archived documents, annotating them with useful semantic information (like entities, concepts, and events), and publishing all these data on the Web as Linked Data. Such structured repositories offer advanced query and integration capabilities, and make web archives directly exploitable by other systems and tools. To demonstrate their query capabilities, we build and query semantic layers for three different types of web archives. An experimental evaluation showed that a semantic layer can answer information needs that existing keyword-based systems are not able to sufficiently satisfy.
Posthoc Interpretability of Learning to Rank Models using Secondary Training Data Singh, Jaspreet; Anand, Avishek (2018).
Learning under Feature Drifts in Textual Streams Melidis, Damianos P.; Spiliopoulou, Myra; Ntoutsi, Eirini (2018).
Towards Better Understanding Researcher Strategies in Cross-Lingual Event Analytics. Gottschalk, Simon; Bernacchi, Viola; Rogers, Richard; Demidova, Elena in Lecture Notes in Computer Science, E. Méndez, F. Crestani, C. Ribeiro, G. David, J. C. Lopes (eds.) (2018). (Vol. 11057) 139–151.
DistrustRank: Spotting False News Domains Woloszyn, Vinicius; Nejdl, Wolfgang in WebSci’18 (2018).

In this paper we propose a semi-supervised learning strategy to automatically separate fake News from reliable News sources: DistrustRank. We first select a small set of unreliable News, manually evaluated and classified by experts on fact checking portals. Once this set is created, DistrustRank constructs a weighted graph where nodes represent websites, connected by edges based on a minimum similarity between a pair of websites. Next it computes the central- ity using a biased PageRank, where a bias is applied to the selected set of seeds. As an output of the proposed model we obtain a trust (or distrust) rank that can be used in two ways: a) as a counter-bias to be applied when News about a specific subject is ranked, in order to discount possible boosts achieved by false claims; and b) to assist humans to identify sources that are likely to be source of fake News (or that are likely to be reputable), suggesting websites that should be examined more closely or to be avoided. In our experiments, DistrustRank outperforms the supervised approaches in either ranking and classification task.
Detecting Biased Statements in Wikipedia. Hube, Christoph; Fetahu, Besnik P.-A. Champin, F. L. Gandon, M. Lalmas, P. G. Ipeirotis (eds.) (2018). 1779–1786.
A Trio Neural Model for Dynamic Entity Relatedness Ranking Nguyen, Tu Ngoc; Tran, Tuan; Nejdl, Wolfgang (2018).

Measuring entity relatedness is a fundamental task for many natural language processing and information retrieval applications. Prior work often studies entity relatedness in static settings and an unsupervised manner. However, entities in real-world are often involved in many different relationships, consequently entity-relations are very dynamic over time. In this work, we propose a neural networkbased approach for dynamic entity relatedness, leveraging the collective attention as supervision. Our model is capable of learning rich and different entity representations in a joint framework. Through extensive experiments on large-scale datasets, we demonstrate that our method achieves better results than competitive baselines.
Tempas: Temporal Archive Search Based on Tags. Holzmann, Helge; Anand, Avishek (2017). abs/1702.01076
On Analyzing User Topic-Specific Platform Preferences Across Multiple Social Media Sites Lee, Roy Ka-Wei; Hoang, Tuan-Anh; Lim, Ee-Peng in WWW ’17 (2017). 1351–1359.

Topic modeling has traditionally been studied for single text collections and applied to social media data represented in the form of text documents. With the emergence of many social media platforms, users find themselves using different social media for posting content and for social interaction. While many topics may be shared across social media platforms, users typically show preferences of certain social media platform(s) over others for certain topics. Such platform preferences may even be found at the individual level. To model social media topics as well as platform preferences of users, we propose a new topic model known as MultiPlatform-LDA (MultiLDA). Instead of just merging all posts from different social media platforms into a single text collection, MultiLDA keeps one text collection for each social media platform but allowing these platforms to share a common set of topics. MultiLDA further learns the user-specific platform preferences for each topic. We evaluate MultiLDA against TwitterLDA, the state-of-the-art method for social media content modeling, on two aspects: (i) the effectiveness in modeling topics across social media platforms, and (ii) the ability to predict platform choices for each post. We conduct experiments on three real-world datasets from Twitter, Instagram and Tumblr sharing a set of common users. Our experiments results show that the MultiLDA outperforms in both topic modeling and platform choice prediction tasks. We also show empirically that among the three social media platforms, "Daily matters" and "Relationship matters" are dominant topics in Twitter, "Social gathering", "Outing" and "Fashion" are dominant topics in Instagram, and "Music", "Entertainment" and "Fashion" are dominant topics in Tumblr.
ArchiveWeb: collaboratively extending and exploring web archive collections - How would you like to work with your collections? Fernando, Zeon Trevor; Marenzi, Ivana; Nejdl, Wolfgang (2017).

Curated web archive collections contain focused digital content which is collected by archiving organizations, groups, and individuals to provide a representative sample covering specific topics and events to preserve them for future exploration and analysis. In this paper, we discuss how to best support collaborative construction and exploration of these collections through the ArchiveWeb system. ArchiveWeb has been developed using an iterative evaluation-driven design-based research approach, with considerable user feedback at all stages. The first part of this paper describes the important insights we gained from our initial requirements engineering phase during the first year of the project and the main functionalities of the current ArchiveWeb system for searching, constructing, exploring, and discussing web archive collections. The second part summarizes the feedback we received on this version from archiving organizations and libraries, as well as our corresponding plans for improving and extending the system for the next release.
Building and Querying Semantic Layers for Web Archives. Fafalios, Pavlos; Holzmann, Helge; Kasturia, Vaibhav; Nejdl, Wolfgang (2017). 11–20.

Web archiving is the process of collecting portions of the Web to ensure that the information is preserved for future exploitation. However, despite the increasing number of web archives worldwide, the absence of efficient and meaningful exploration methods still remains a major hurdle in the way of turning them into a usable and useful information source. In this paper, we focus on this problem and propose an RDF/S model and a distributed framework for building semantic profiles ("layers") that describe semantic information about the contents of web archives. A semantic layer allows describing metadata information about the archived documents, annotating them with useful semantic information (like entities, concepts and events), and publishing all this data on the Web as Linked Data. Such structured repositories offer advanced query and integration capabilities and make web archives directly exploitable by other systems and tools. To demonstrate their query capabilities, we build and query semantic layers for three different types of web archives. An experimental evaluation showed that a semantic layer can answer information needs that existing keyword-based systems are not able to sufficiently satisfy.
Multi-aspect Entity-centric Analysis of Big Social Media Archives Fafalios, Pavlos; Iosifidis, Vasileios; Stefanidis, Kostas; Ntoutsi, Eirini (2017). 261–273.
ArchiveWeb: collaboratively extending and exploring web archive collections---How would you like to work with your collections? Fernando, Zeon Trevor; Marenzi, Ivana; Nejdl, Wolfgang (2017). 1–17.

Curated web archive collections contain focused digital content which is collected by archiving organizations, groups, and individuals to provide a representative sample covering specific topics and events to preserve them for future exploration and analysis. In this paper, we discuss how to best support collaborative construction and exploration of these collections through the ArchiveWeb system. ArchiveWeb has been developed using an iterative evaluation-driven design-based research approach, with considerable user feedback at all stages. The first part of this paper describes the important insights we gained from our initial requirements engineering phase during the first year of the project and the main functionalities of the current ArchiveWeb system for searching, constructing, exploring, and discussing web archive collections. The second part summarizes the feedback we received on this version from archiving organizations and libraries, as well as our corresponding plans for improving and extending the system for the next release.
Designing Search Tasks for Archive Search Singh, Jaspreet; Anand, Avishek in CHIIR ’17 (2017). 361–364.

Longitudinal corpora like legal, corporate and newspaper archives are of immense value to a variety of users, and time as an important factor strongly influences their search behavior in these archives. While many systems have been developed to support users' temporal information needs, questions remain over how users utilize these advances to satisfy their needs. Analyzing their search behavior will provide us with novel insights into search strategy, guide better interface and system design and highlight new problems for further research. In this paper we propose a set of search tasks, with varying complexity, that IIR researchers can utilize to study user search behavior in archives. We discuss how we created and refined these tasks as the result of a pilot study using a temporal search engine. We not only propose task descriptions but also pre and post-task evaluation mechanisms that can be employed for a large-scale study (crowdsourcing). Our initial findings show the viability of such tasks for investigating search behavior in archives.
Accessing web archives from different perspectives with potential synergies Holzmann, Helge; Risse, and Thomas (2017).
Fine Grained Citation Span for References in Wikipedia. Fetahu, Besnik; Markert, Katja; Anand, Avishek (2017). abs/1707.07278
Time-Aware Entity Linking Joao, Renato Stoffalette (2017).

Entity Linking is the task of automatically identifying entity mentions in a piece of text and linking them to their corresponding entries in a reference knowledge base like Wikipedia. Although there is a plethora of works on entity linking, existing state-of-the-art approaches do not explicitly consider the time aspect and specifically the temporality of an entity’s prior probability (popularity) and embedding (semantic network). Consequently, they show limited performance in annotating old documents like news or web archives, while the problem is bigger in cases of short texts with limited context, such as archives of social media posts and query logs. This thesis focuses on this problem and proposes a modeling that leverages time-aware prior probabilities and word embeddings in the entity linking task
Fine Grained Citation Span for References in Wikipedia Fetahu, Besnik; Markert, Kajta; Anand, Avishek (2017).
Modeling Event Importance for Ranking Daily News Events Setty, Vinay; Anand, Abhijit; Mishra, Arunav; Anand, Avishek in WSDM ’17 (2017). 231–240.

We deal with the problem of ranking news events on a daily basis for large news corpora, an essential building block for news aggregation. News ranking has been addressed in the literature before but with individual news articles as the unit of ranking. However, estimating event importance accurately requires models to quantify current day event importance as well as its significance in the historical context. Consequently, in this paper we show that a cluster of news articles representing an event is a better unit of ranking as it provides an improved estimation of popularity, source diversity and authority cues. In addition, events facilitate quantifying their historical significance by linking them with long-running topics and recent chain of events. Our main contribution in this paper is to provide effective models for improved news event ranking. To this end, we propose novel event mining and feature generation approaches for improving estimates of event importance. Finally, we conduct extensive evaluation of our approaches on two large real-world news corpora each of which span for more than a year with a large volume of up to tens of thousands of daily news articles. Our evaluations are large-scale and based on a clean human curated ground-truth from Wikipedia Current Events Portal. Experimental comparison with a state-of-the-art news ranking technique based on language models demonstrates the effectiveness of our approach.
What’s new? Analysing language-specific Wikipedia entity contexts to support entity-centric news retrieval. Zhou, Yiwei; Demidova, Elena; Cristea, Alexandra I. (N. Nguyen; R. Kowalczyk; A. Pinto; J. Cardoso, eds.) (2017). 10190 210–231.

Representation of influential entities, such as celebrities and multinational corporations on the web can vary across languages, reflecting language-specific entity aspects, as well as divergent views on these entities in different communities. An important source of multilingual background knowledge about influential entities is Wikipedia - an online community-created encyclopaedia - containing more than 280 language editions. Such language-specific information could be applied in entity-centric information retrieval applications, in which users utilise very simple queries, mostly just the entity names, for the relevant documents. In this article we focus on the problem of creating language-specific entity contexts to support entity-centric, language-specific information retrieval applications. First, we discuss alternative ways such contexts can be built, including Graph-based and Article-based approaches. Second, we analyse the similarities and the differences in these contexts in a case study including 220 entities and five Wikipedia language editions. Third, we propose a context-based entity-centric information retrieval model that maps documents to aspect space, and apply language-specific entity contexts to perform query expansion. Last, we perform a case study to demonstrate the impact of this model in a news retrieval application. Our study illustrates that the proposed model can effectively improve the recall of entity-centric information retrieval while keeping high precision, and provide language-specific results.
Software as a first-class citizen in web archives Holzmann, Helge (2017, May).
Software citation, landing pages, and the swMATH service Sperber, Wolfram; Dalitz, Wolfgang; Holzmann, Helge (2017, October).
Multi-aspect Entity-Centric Analysis of Big Social Media Archives. Fafalios, Pavlos; Iosifidis, Vasileios; Stefanidis, Kostas; Ntoutsi, Eirini in Lecture Notes in Computer Science, J. Kamps, G. Tsakonas, Y. Manolopoulos, L. Iliadis, I. Karydis (eds.) (2017). 261–273.

Social media archives serve as important historical information sources, and thus meaningful analysis and exploration methods are of immense value for historians, sociologists and other interested parties. In this paper, we propose an entity-centric approach to analyze social media archives and we define measures that allow studying how entities are reflected in social media in different time periods and under different aspects (like popularity, attitude, controversiality, and connectedness with other entities). A case study using a large Twitter archive of 4 years illustrates the insights that can be gained by such an entity-centric multi-aspect analysis.
Ongoing Events in Wikipedia: A Cross-lingual Case Study Gottschalk, Simon; Demidova, Elena; Bernacchi, Viola; Rogers, Richard (2017). 387–388.

In order to effectively analyze information regarding ongoing events that impact local communities across language and country borders, researchers often need to perform multilingual data analysis. This analysis can be particularly challenging due to the rapidly evolving event-centric data and the language barrier. In this abstract we present preliminary results of a case study with the goa to better understand how researchers interact with multilingual event-centric information in the context of cross-cultural studies and which methods and features they use.
Universal Distant Reading through Metadata Proxies with ArchiveSpark Holzmann, Helge; Goel, Vinay; Gustainis, Emily Novak (2017).
Towards a Ranking Model for Semantic Layers over Digital Archives Fafalios, Pavlos; Kasturia, Vaibhav; Nejdl, Wolfgang (2017). 336–337.

Archived collections of documents (like newspaper archives) serve as important information sources for historians, journalists, sociologists and other interested parties. Semantic Layers over such digital archives allow describing and publishing metadata and semantic information about the archived documents in a standard format (RDF), which in turn can be queried through a structured query language (e.g., SPARQL). This enables to run advanced queries by combining metadata of the documents (like publication date) and content-based semantic information (like entities mentioned in the documents). However, the results returned by structured queries can be numerous and also they all equally match the query. Thus, there is the need to rank these results in order to promote the most important ones. In this paper, we focus on this problem and propose a ranking model that considers and combines: i) the relativeness of documents to entities, ii) the timeliness of documents, and iii) the relations among the entities.
ArchiveWeb: Collaboratively Extending and Exploring Web Archive Collections. How would you like to work with your collections? Fernando, Zeon Trevor; Marenzi, Ivana; Nejdl, Wolfgang (N. Adam; R. Furuta; E. Neuhold, eds.) (2017).

Curated web archive collections contain focused digital content which is collected by archiving organizations, groups and individuals to provide a representative sample covering specific topics and events to preserve them for future exploration and analysis. In this paper, we discuss how to best support collaborative construction and exploration of these collections through the ArchiveWeb system. ArchiveWeb has been developed using an iterative evaluation-driven designbased research approach, with considerable user feedback at all stages. The first part of this paper describes the important insights we gained from our initial requirements engineering phase during the first year of the project and the main functionalities of the current ArchiveWeb system for searching, constructing, exploring and discussing web archive collections. The second part summarizes the feedback we received on this version from archiving organizations and libraries, as well as our corresponding plans for improving and extending the system for the next release.
Semi-supervised Identification of Rarely Appearing Persons in Video by Correcting Weak Labels M{ü}ller, Eric; Otto, Christian; Ewerth, Ralph in ICMR ’16 (2016). 381–384.

Some recent approaches for character identification in movies and TV broadcasts are realized in a semi-supervised manner by assigning transcripts and/or subtitles to the speakers. However, the labels obtained in this way achieve only an accuracy of $80\% - 90\%$ and the number of training examples for the different actors is unevenly distributed. In this paper, we propose a novel approach for person identification in video by correcting and extending the training data with reliable predictions to reduce the number of annotation errors. Furthermore, the intra-class diversity of rarely speaking characters is enhanced. To address the imbalance of training data per person, we suggest two complementary prediction scores. These scores are also used to recognize whether or not a face track belongs to a (supporting) character whose identity does not appear in the transcript etc. Experimental results demonstrate the feasibility of the proposed approach, outperforming the current state of the art.
Analyzing Web Archives Through Topic and Event Focused Sub-collections Gossen, Gerhard; Demidova, Elena; Risse, Thomas in WebSci ’16 (2016). 291–295.

Web archives capture the history of the Web and are therefore an important source to study how societal developments have been reflected on the Web. However, the large size of Web archives and their temporal nature pose many challenges to researchers interested in working with these collections. In this work, we describe the challenges of working with Web archives and propose the research methodology of extracting and studying sub-collections of the archive focused on specific topics and events. We discuss the opportunities and challenges of this approach and suggest a framework for creating sub-collections.
History by Diversity: Helping Historians Search News Archives Singh, Jaspreet; Nejdl, Wolfgang; Anand, Avishek in CHIIR ’16 (2016). 183–192.

Longitudinal corpora like newspaper archives are of immense value to historical research, and time as an important factor for historians strongly influences their search behaviour in these archives. While searching for articles published over time, a key preference is to retrieve documents which cover the important aspects from important points in time which is different from standard search behavior. To support this search strategy, we introduce the notion of a Historical Query Intent to explicitly model a historian's search task and define an aspect-time diversification problem over news archives. We present a novel algorithm, HistDiv, that explicitly models the aspects and important time windows based on a historian's information seeking behavior. By incorporating temporal priors based on publication times and temporal expressions, we diversify both on the aspect and temporal dimensions. We test our methods by constructing a test collection based on The New York Times Collection with a workload of 30 queries of historical intent assessed manually. We find that HistDiv outperforms all competitors in subtopic recall with a slight loss in precision. We also present results of a qualitative user study to determine wether this drop in precision is detrimental to user experience. Our results show that users still preferred HistDiv's ranking.
How to Search the Internet Archive Without Indexing It Kanhabua, Nattiya; Kemkes, Philipp; Nejdl, Wolfgang; Nguyen, Tu Ngoc; Reis, Felipe; Tran, Nam Khanh in Lecture Notes in Computer Science, N. Fuhr, L. Kov{{á}}cs, T. Risse, W. Nejdl (eds.) (2016). (Vol. 9819) 147–160.
Cobwebs from the Past and Present: Extracting Large Social Networks Using Internet Archive Data Shaltev, Miroslav; Zab, Jan-Hendrik; Kemkes, Philipp; Siersdorfer, Stefan; Zerr, Sergej in SIGIR ’16 (2016). 1093–1096.

Social graph construction from various sources has been of interest to researchers due to its application potential and the broad range of technical challenges involved. The World Wide Web provides a huge amount of continuously updated data and information on a wide range of topics created by a variety of content providers, and makes the study of extracted people networks and their temporal evolution valuable for social as well as computer scientists. In this paper we present SocGraph - an extraction and exploration system for social relations from the content of around 2 billion web pages collected by the Internet Archive over the 17 years time period between 1996 and 2013. We describe methods for constructing large social graphs from extracted relations and introduce an interface to study their temporal evolution.
Analysing Temporal Evolution of Interlingual Wikipedia Article Pairs. Gottschalk, Simon; Demidova, Elena R. Perego, F. Sebastiani, J. A. Aslam, I. Ruthven, J. Zobel (eds.) (2016). 1089–1092.
Archiving Software Surrogates on the Web for Future Reference Holzmann, Helge; Sperber, Wolfram; Runnwerth, Mila N. Fuhr, L. Kov{á}cs, T. Risse, W. Nejdl (eds.) (2016). 215–226.
Linking Mathematical Software in Web Archives Holzmann, Helge; Runnwerth, Mila; Sperber, Wolfram G.-M. Greuel, T. Koch, P. Paule, A. Sommese (eds.) (2016). 419–422.
Temporal Information Retrieval Kanhabua, Nattiya; Anand, Avishek in SIGIR ’16 (2016). 1235–1238.
ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation Holzmann, Helge; Goel, Vinay; Anand, Avishek in JCDL ’16 (2016). 83–92.

Web archives are a valuable resource for researchers of various disciplines. However, to use them as a scholarly source, researchers require a tool that provides efficient access to Web archive data for extraction and derivation of smaller datasets. Besides efficient access we identify five other objectives based on practical researcher needs such as ease of use, extensibility and reusability. Towards these objectives we propose ArchiveSpark, a framework for efficient, distributed Web archive processing that builds a research corpus by working on existing and standardized data formats commonly held by Web archiving institutions. Performance optimizations in ArchiveSpark, facilitated by the use of a widely available metadata index, result in significant speed-ups of data processing. Our benchmarks show that ArchiveSpark is faster than alternative approaches without depending on any additional data stores while improving usability by seamlessly integrating queries and derivations with external tools.
Finding News Citations for Wikipedia Fetahu, Besnik; Markert, Katja; Nejdl, Wolfgang; Anand, Avishek in CIKM ’16 (2016). 337–346.

An important editing policy in Wikipedia is to provide citations for added statements in Wikipedia pages, where statements can be arbitrary pieces of text, ranging from a sentence to a paragraph. In many cases citations are either outdated or missing altogether. In this work we address the problem of finding and updating news citations for statements in entity pages. We propose a two-stage supervised approach for this problem. In the first step, we construct a classifier to find out whether statements need a news citation or other kinds of citations (web, book, journal, etc.). In the second step, we develop a news citation algorithm for Wikipedia statements, which recommends appropriate citations from a given news collection. Apart from IR techniques that use the statement to query the news collection, we also formalize three properties of an appropriate citation, namely: (i) the citation should entail the Wikipedia statement, (ii) the statement should be central to the citation, and (iii) the citation should be from an authoritative source. We perform an extensive evaluation of both steps, using 20 million articles from a real-world news collection. Our results are quite promising, and show that we can perform this task with high precision and at scale.
Who likes me more? Analysing entity-centric language-specific bias in multilingual Wikipedia Zhou, Yiwei; Demidova, Elena; Cristea, Alexandra I. (2016).
Exploring the past of the web: alexandria & archive-it hackathon. Anand, Avishek; Bailey, Jefferson W. Nejdl, W. Hall, P. Parigi, S. Staab (eds.) (2016). 14.
Analysing Temporal Evolution of Interlingual Wikipedia Article Pairs Gottschalk, Simon; Demidova, Elena (2016).

Wikipedia articles representing an entity or a topic in different language editions evolve independently within the scope of the language-specific user communities. This can lead to different points of views reflected in the articles, as well as complementary and inconsistent information. An analysis of how the information is propagated across the Wikipedia language editions can provide important insights in the article evolution along the temporal and cultural dimensions and support quality control. To facilitate such analysis, we present MultiWiki - a novel web-based user interface that provides an overview of the similarities and differences across the article pairs originating from different language editions on a timeline. MultiWiki enables users to observe the changes in the interlingual article similarity over time and to perform a detailed visual comparison of the article snapshots at a particular time point.
SaR-Web - {A} Tool to Support Search as Learning Processes Fulantelli, Giovanni; Marenzi, Ivana; Ahmad, Qazi Asim Ijaz; Taibi, Davide in {CEUR} Workshop Proceedings, J. Gwizdka, P. Hansen, C. Hauff, J. He, N. Kando (eds.) (2016). (Vol. 1647)
On the Applicability of Delicious for Temporal Search on Web Archives Holzmann, Helge; Nejdl, Wolfgang; Anand, Avishek in SIGIR ’16 (2016). 929–932.

Web archives are large longitudinal collections that store webpages from the past, which might be missing on the current live Web. Consequently, temporal search over such collections is essential for finding prominent missing webpages and tasks like historical analysis. However, this has been challenging due to the lack of popularity information and proper ground truth to evaluate temporal retrieval models. In this paper we investigate the applicability of external longitudinal resources to identify important and popular websites in the past and analyze the social bookmarking service Delicious for this purpose. The timestamped bookmarks on Delicious provide explicit cues about popular time periods in the past along with relevant descriptors. These are valuable to identify important documents in the past for a given temporal query. Focusing purely on recall, we analyzed more than 12,000 queries and find that using Delicious yields average recall values from 46% up to 100%, when limiting ourselves to the best represented queries in the considered dataset. This constitutes an attractive and low-overhead approach for quick access into Web archives by not dealing with the actual contents.
Search As Research Practices on the Web: The SaR-Web Platform for Cross-language Engine Results Analysis Taibi, Davide; Rogers, Richard; Marenzi, Ivana; Nejdl, Wolfgang; Ahmad, Qazi Asim Ijaz; Fulantelli, Giovanni in WebSci ’16 (2016). 367–369.

Search engines are the most utilized tools to access information on the Web. The success of large companies such as Google owes to their capacity to conduct users through the vast troves of knowledge and information online. Recently, the concept of search as research has been used to shift the research focus from workings of information-seeking tools towards methods for the social study of Web and particularly the social meanings of engine results. In this paper, we present SaR-Web, a web search tool that provides an automatic means to carry out search as research on the Web. It compares the results of same (translated) queries across search engine language domains, thereby enabling cross-linguistic and cross-cultural comparisons of results. SaR-Web outputs enable the comparative study of cultural mores as well as societal associations and concerns, interpreted through search engine results.
Who Likes Me More?: Analysing Entity-centric Language-specific Bias in Multilingual Wikipedia Zhou, Yiwei; Demidova, Elena; Cristea, Alexandra I. in SAC ’16 (2016). 750–757.

In this paper we take an important step towards better understanding the existence and extent of entity-centric language-specific bias in multilingual Wikipedia, and any deviation from its targeted neutral point of view. We propose a methodology using sentiment analysis techniques to systematically extract the variations in sentiments associated with real-world entities in different language editions of Wikipedia, illustrated with a case study of five Wikipedia language editions and a set of target entities from four categories.
Improving Entity Retrieval on Structured Data Fetahu, Besnik; Gadiraju, Ujwal; Dietze, Stefan (2015).
Herausforderungen für die nationale, regionale und thematische Webarchivierung und deren Nutzung Nejdl, Wolfgang; Risse, Thomas (2015). 62(3-4) 160–171.
iCrawl: Improving the Freshness of Web Collections by Integrating Social Web and Focused Web Crawling Gossen, Gerhard; Demidova, Elena; Risse, Thomas in JCDL ’15 (2015). 75–84.

Researchers in the Digital Humanities and journalists need to monitor, collect and analyze fresh online content regarding current events such as the Ebola outbreak or the Ukraine crisis on demand. However, existing focused crawling approaches only consider topical aspects while ignoring temporal aspects and therefore cannot achieve thematically coherent and fresh Web collections. Especially Social Media provide a rich source of fresh content, which is not used by state-of-the-art focused crawlers. In this paper we address the issues of enabling the collection of fresh and relevant Web and Social Web content for a topic of interest through seamless integration of Web and Social Media in a novel integrated focused crawler. The crawler collects Web and Social Media content in a single system and exploits the stream of fresh Social Media content for guiding the crawler.
Semantic URL Analytics to Support Efficient Annotation of Large Scale Web Archives Souza, Tarcísio; Demidova, Elena; Risse, Thomas; Holzmann, Helge; Gossen, Gerhard; Szymanski, Julian (2015). 153–166.
Semantic Annotation for Microblog Topics Using Wikipedia Temporal Information Tran, Tuan; Tran, Nam-Khanh; Teka Hadgu, Asmelash; Jäschke, Robert (2015).

Trending topics in microblogs such as Twitter are valuable resources to understand social aspects of real-world events. To enable deep analyses of such trends, semantic annotation is an effective approach; yet the problem of annotating microblog trending topics is largely unexplored by the research community. In this work, we tackle the problem of mapping trending Twitter topics to entities from Wikipedia. We propose a novel model that complements traditional text-based approaches by rewarding entities that exhibit a high temporal correlation with topics during their burst time period. By exploiting temporal information from the Wikipedia edit history and page view logs, we have improved the annotation performance by 17-28%, as compared to the competitive baselines.
Groupsourcing: Team Competition Designs for Crowdsourcing Rokicki, Markus; Zerr, Sergej; Siersdorfer, Stefan in WWW ’15 (2015). 906–915.

Many data processing tasks such as semantic annotation of images, translation of texts in foreign languages, and labeling of training data for machine learning models require human input, and, on a large scale, can only be accurately solved using crowd based online work. Recent work shows that frameworks where crowd workers compete against each other can drastically reduce crowdsourcing costs, and outperform conventional reward schemes where the payment of online workers is proportional to the number of accomplished tasks ("pay-per-task"). In this paper, we investigate how team mechanisms can be leveraged to further improve the cost efficiency of crowdsourcing competitions. To this end, we introduce strategies for team based crowdsourcing, ranging from team formation processes where workers are randomly assigned to competing teams, over strategies involving self-organization where workers actively participate in team building, to combinations of team and individual competitions. Our large-scale experimental evaluation with more than 1,100 participants and overall 5,400 hours of work spent by crowd workers demonstrates that our team based crowdsourcing mechanisms are well accepted by online workers and lead to substantial performance boosts.
Learning to Detect Event-Related Queries for Web Search Kanhabua, Nattiya; Ngoc Nguyen, Tu; Nejdl, Wolfgang in WWW ’15 Companion (2015). 1339–1344.

In many cases, a user turns to search engines to find information about real-world situations, namely, political elections, sport competitions, or natural disasters. Such temporal querying behavior can be observed through a significant number of event-related queries generated in web search. In this paper, we study the task of detecting event-related queries, which is the first step for understanding temporal query intent and enabling different temporal search applications, e.g., time-aware query auto-completion, temporal ranking, and result diversification. We propose a two-step approach to detecting events from query logs. We first identify a set of event candidates by considering both implicit and explicit temporal information needs. The next step further classifies the candidates into two main categories, namely, event or non-event. In more detail, we leverage different machine learning techniques for query classification, which are trained using the feature set composed of time series features from signal processing, along with features derived from click-through information, and standard statistical features. In order to evaluate our proposed approach, we conduct an experiment using two real-world query logs with manually annotated relevance assessments for 837 events. To this end, we provide a large set of event-related queries made available for fostering research on this challenging task.
Who With Whom And How?: Extracting Large Social Networks Using Search Engines Siersdorfer, Stefan; Kemkes, Philipp; Ackermann, Hanno; Zerr, Sergej in CIKM ’15 (2015). 1491–1500.
Time-travel Translator: Automatically Contextualizing News Articles Tran, Nam Khanh; Ceroni, Andrea; Kanhabua, Nattiya; Niederée, Claudia in WWW’2015 (2015).
The iCrawl Wizard – Supporting Interactive Focused Crawl Specification Gossen, Gerhard; Demidova, Elena; Risse, Thomas (2015).
Improving Entity Retrieval on Structured Data Fetahu, Besnik; Gadiraju, Ujwal; Dietze, Stefan in Lecture Notes in Computer Science (2015). (Vol. 9366) 474–491.
Balancing Novelty and Salience: Adaptive Learning to Rank Entities for Timeline Summarization of High-impact Events Tran, Tuan A.; Niederee, Claudia; Kanhabua, Nattiya; Gadiraju, Ujwal; Anand, Avishek in CIKM ’15 (2015). 1201–1210.

Long-running, high-impact events such as the Boston Marathon bombing often develop through many stages and involve a large number of entities in their unfolding. Timeline summarization of an event by key sentences eases story digestion, but does not distinguish between what a user remembers and what she might want to re-check. In this work, we present a novel approach for timeline summarization of high-impact events, which uses entities instead of sentences for summarizing the event at each individual point in time. Such entity summaries can serve as both (1) important memory cues in a retrospective event consideration and (2) pointers for personalized event exploration. In order to automatically create such summaries, it is crucial to identify the "right" entities for inclusion. We propose to learn a ranking function for entities, with a dynamically adapted trade-off between the in-document salience of entities and the informativeness of entities across documents, i.e., the level of new information associated with an entity for a time point under consideration. Furthermore, for capturing collective attention for an entity we use an innovative soft labeling approach based on Wikipedia. Our experiments on a real large news datasets confirm the effectiveness of the proposed methods.
Analysing the Duration of Trending Topics in Twitter using Wikipedia Tran, Tuan; Georgescu, Mihai; Zhu, Xiaofei; Kanhabua, Nattiya (2014).
A Burstiness-aware Approach for Document Dating Kotsakos, Dimitrios; Lappas, Theodoros; Kotzias, Dimitrios; Gunopulos, Dimitrios; Kanhabua, Nattiya; Nørvåg, Kjetil (2014).
Bridging Temporal Context Gaps using Time-Aware Re-Contextualization Ceroni, Andrea; Khanh Tran, Nam; Kanhabua, Nattiya; Niederée, Claudia (2014).
Leveraging Dynamic Query Subtopics for Time-Aware Search Result Diversification Nguyen, Tu Ngoc; Kanhabua, Nattiya in Lecture Notes in Computer Science, M. de Rijke, T. Kenter, A. P. de Vries, C. Zhai, F. de Jong, K. Radinsky, K. Hofmann (eds.) (2014). (Vol. 8416) 222–234.
Analyzing Relative Incompleteness of Movie Descriptions in the Web of Data: A Case Study Yuan, Wancheng; Demidova, Elena; Dietze, Stefan; Zhou, Xuan (2014). (Vol. Vol-1272) 197–200.
iCrawl: An integrated focused crawling toolbox for Web Science Gossen, Gerhard N. Brügger (ed.) (2014).

Within the scientific community an increasing interest in using Web content for research can be observed. Especially the Social Web is attractive for many humanities disciplines as it provides direct access to thoughts of many people about politics, popular topics and events. Documenting the activities on the Web and Social Web in Web archives facilitates better understanding of the public perception. However, state-of-the-art Web archive crawler like Heritrix have significant limitations in terms of usability, functionality and maintenance with regard to the needs of the scientific community. The iCrawl project aims to provide an integrated crawling toolbox with an intuitive, flexible and extensible set of Web crawling components.
What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalysts for Collective Memory in Wikipedia Kanhabua, Nattiya; Ngoc Nguyen, Tu; Niederée, Claudia (2014).
On the Value of Temporal Anchor Texts in Wikipedia Kanhabua, Nattiya; Nejdl, Wolfgang (2014).
Insights into Entity Name Evolution on Wikipedia Holzmann, Helge; Risse, Thomas B. Benatallah, A. Bestavros, Y. Manolopoulos, A. Vakali, Y. Zhang (eds.) (2014). (Vol. 8787) 47–61.

Working with Web archives raises a number of issues caused by their temporal characteristics. Depending on the age of the content, additional knowledge might be needed to find and understand older texts. Especially facts about entities are subject to change. Most severe in terms of information retrieval are name changes. In order to find entities that have changed their name over time, search engines need to be aware of this evolution. We tackle this problem by analyzing Wikipedia in terms of entity evolutions mentioned in articles regardless the structural elements. We gathered statistics and automatically extracted minimum excerpts covering name changes by incorporating lists dedicated to that subject. In future work, these excerpts are going to be used to discover patterns and detect changes in other sources. In this work we investigate whether or not Wikipedia is a suitable source for extracting the required knowledge.
Competitive Game Designs for Improving the Cost Effectiveness of Crowdsourcing Rokicki, Markus; Chelaru, Sergiu; Zerr, Sergej; Siersdorfer, Stefan in CIKM ’14 (2014). 1469–1478.
Extraction of Evolution Descriptions from the Web Holzmann, Helge; Risse, Thomas in JCDL ’14 (2014). 413–414.

The evolution of named entities affects exploration and retrieval tasks in digital libraries. An information retrieval system that is aware of name changes can actively support users in finding former occurrences of evolved entities. However, current structured knowledge bases, such as DBpedia or Freebase, do not provide enough information about evolutions, even though the data is available on their resources, like Wikipedia. Our Evolution Base prototype will demonstrate how excerpts describing name evolutions can be identified on these websites with a promising precision. The descriptions are classified by means of models that we trained based on a recent analysis of named entity evolutions on Wikipedia.
Extraction of evolution descriptions from the web Holzmann, Helge; Risse, Thomas (2014).
Named Entity Evolution Analysis on Wikipedia Holzmann, Helge; Risse, Thomas in WebSci ’14 (2014). 241–242.

Accessing Web archives raises a number of issues caused by their temporal characteristics. Additional knowledge is needed to find and understand older texts. Especially entities mentioned in texts are subject to change. Most severe in terms of information retrieval are name changes. In order to find entities that have changed their name over time, search engines need to be aware of this evolution. We tackle this problem by analyzing Wikipedia in terms of entity evolutions mentioned in articles. We present statistical data on excerpts covering name changes, which will be used to discover similar text passages and extract evolution knowledge in future work.
Named Entity Evolution Analysis on Wikipedia Holzmann, Helge; Risse, Thomas in WebSci ’14 (2014). 241–242.

Accessing Web archives raises a number of issues caused by their temporal characteristics. Additional knowledge is needed to find and understand older texts. Especially entities mentioned in texts are subject to change. Most severe in terms of information retrieval are name changes. In order to find entities that have changed their name over time, search engines need to be aware of this evolution. We tackle this problem by analyzing Wikipedia in terms of entity evolutions mentioned in articles. We present statistical data on excerpts covering name changes, which will be used to discover similar text passages and extract evolution knowledge in future work.
Analysing and Enriching Focused Semantic Web Archives for Parliament Applications Demidova, Elena; Barbieri, Nicola; Dietze, Stefan; Funk, Adam; Holzmann, Helge; Maynard, Diana; Papailiou, Nikolaos; Peters, Wim; Risse, Thomas; Spiliotopoulos, Dimitris (2014). 6(3) 433.

The web and the social web play an increasingly important role as an information source for Members of Parliament and their assistants, journalists, political analysts and researchers. It provides important and crucial background information, like reactions to political events and comments made by the general public. The case study presented in this paper is driven by two European parliaments (the Greek and the Austrian parliament) and targets an effective exploration of political web archives. In this paper, we describe semantic technologies deployed to ease the exploration of the archived web and social web content and present evaluation results.
Hedera: Scalable Indexing and Exploring Entities in Wikipedia Revision History Tran, Tuan; Nguyen, Tu Ngoc (2014).
Bridging Temporal Context Gaps Using Time-aware Re-contextualization Ceroni, Andrea; Tran, Nam Khanh; Kanhabua, Nattiya; Nieder{é}e, Claudia in SIGIR ’14 (2014). 1127–1130.

Understanding a text, which was written some time ago, can be compared to translating a text from another language. Complete interpretation requires a mapping, in this case, a kind of time-travel translation between present context knowledge and context knowledge at time of text creation. In this paper, we study time-aware re-contextualization, the challenging problem of retrieving concise and complementing information in order to bridge this temporal context gap. We propose an approach based on learning to rank techniques using sentence-level context information extracted from Wikipedia. The employed ranking combines relevance, complimentarity and time-awareness. The effectiveness of the approach is evaluated by contextualizing articles from a news archive collection using more than 7,000 manually judged relevance pairs. To this end, we show that our approach is able to retrieve a significant number of relevant context information for a given news article.
What Do You Want to Collect from the Web? Risse, Thomas; Demidova, Elena; Gossen, Gerhard (2014).

Today an increasing interest in collecting, analyzing and preserving Web content can be observed in many digital humanities. Especially the Social Web is attractive for many humanities disciplines as it provides a direct access to statements of many people about politics, popular topics or events. In this paper we present an exemplary study that we have conducted with the aim to understand the needs of scientists in social sciences, historical sciences and law with respect to the creation of Web archives.
The History of Web Archiving Toyoda, M.; Kitsuregawa, M. (2012). 100(Special Centennial Issue) 144–1443.

This paper describes the history and the current challenges of archiving massive and extremely diverse amounts of user-generated data in an international environment on the World Wide Web and the technologies required for interoperability between service providers and for preserving their contents in the future.
Discovering URLs through user feedback Bai, Xiao; Cambazoglu, B. Barla; Junqueira, Flavio P. (2011). 77–86.

Search engines rely upon crawling to build their Web page collections. A Web crawler typically discovers new URLs by following the link structure induced by links on Web pages. As the number of documents on the Web is large, discovering newly created URLs may take arbitrarily long, and depending on how a given page is connected to others, such a crawler may miss the pages altogether. In this paper, we evaluate the benefits of integrating a passive URL discovery mechanism into a Web crawler. This mechanism is passive in the sense that it does not require the crawler to actively fetch documents from the Web to discover URLs. We focus here on a mechanism that uses toolbar data as a representative source for new URL discovery. We use the toolbar logs of Yahoo! to characterize the URLs that are accessed by users via their browsers, but not discovered by Yahoo! Web crawler. We show that a high fraction of URLs that appear in toolbar logs are not discovered by the crawler. We also reveal that a certain fraction of URLs are discovered by the crawler later than the time they are first accessed by users. One important conclusion of our work is that web search engines can highly benefit from user feedback in the form of toolbar logs for passive URL discovery.
Dremel: interactive analysis of web-scale datasets Melnik, Sergey; Gubarev, Andrey; Long, Jing Jing; Romer, Geoffrey; Shivakumar, Shiva; Tolton, Matt; Vassilakis, Theo (2010). 3(1-2) 330–339.

Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. In this paper, we describe the architecture and implementation of Dremel, and explain how it complements MapReduce-based computing. We present a novel columnar storage representation for nested records and discuss experiments on few-thousand node instances of the system.
Socio-Sense: A System for Analysing the Societal Behavior from Long Term Web Archive Kitsuregawa, Masaru; Tamura, Takayuki; Toyoda, Masashi; Kaji, Nobuhiro Y. Zhang, G. Yu, E. Bertino, G. Xu (eds.) (2008). (Vol. 4976) 1–8.

We introduce Socio-Sense Web analysis system. The system applies structural and temporal analysis methods to long term Web archive to obtain insight into the real society. We present an overview of the system and core methods followed by excerpts from case studies on consumer behavior analyses.
A user reputation model for a user-interactive question answering system Chen, Wei; Zeng, Qingtian; Wenyin, Liu; Hao, Tianyong (2007). 19(15) 2091–2103.

In this paper, we propose a user reputation model and apply it to a user-interactive question answering system. It combines the social network analysis approach and the user rating approach. Social network analysis is applied to analyze the impact of participant users' relations to their reputations. User rating is used to acquire direct judgment of a user's reputation based on other users' experiences with this user. Preliminary experiments show that the computed reputations based on our proposed reputation model can reflect the actual reputations of the simulated roles and therefore can fit in well with our user-interactive question answering system. Copyright © 2006 John Wiley & Sons, Ltd.
User-centric Web crawling Pandey, Sandeep; Olston, Christopher (2005). 401–411.

Search engines are the primary gateways of information access on the Web today. Behind the scenes, search engines crawl the Web to populate a local indexed repository of Web pages, used to answer user search queries. In an aggregate sense, the Web is very dynamic, causing any repository of Web pages to become out of date over time, which in turn causes query answer quality to degrade. Given the considerable size, dynamicity, and degree of autonomy of the Web as a whole, it is not feasible for a search engine to maintain its repository exactly synchronized with the Web.In this paper we study how to schedule Web pages for selective (re)downloading into a search engine repository. The scheduling objective is to maximize the quality of the user experience for those who query the search engine. We begin with a quantitative characterization of the way in which the discrepancy between the content of the repository and the current content of the live Web impacts the quality of the user experience. This characterization leads to a user-centric metric of the quality of a search engine's local repository. We use this metric to derive a policy for scheduling Web page (re)downloading that is driven by search engine usage and free of exterior tuning parameters. We then focus on the important subproblem of scheduling refreshing of Web pages already present in the repository, and show how to compute the priorities efficiently. We provide extensive empirical comparisons of our user-centric method against prior Web page refresh strategies, using real Web data. Our results demonstrate that our method requires far fewer resources to maintain same search engine quality level for users, leaving substantially more resources available for incorporating new Web pages into the search repository.