Test3 -

Neural Based Statement Classification for Biased Language Hube, Christoph; Fetahu, Besnik (2019).
TableNet: A Knowledge Graph of Interlinked Wikipedia Tables Fetahu, Besnik; Anand, Avishek; Koutraki, Maria (2019).
EventKG - the Hub of Event Knowledge on the Web - and Biographical Timeline Generation Gottschalk, Simon; Demidova, Elena (2019).

One of the key requirements to facilitate the semantic analytics of information regarding contemporary and historical events on the Web, in the news and in social media is the availability of reference knowledge repositories containing comprehensive representations of events, entities and temporal relations. Existing knowledge graphs, with popular examples including DBpedia, YAGO and Wikidata, focus mostly on entity-centric information and are insufficient in terms of their coverage and completeness with respect to events and temporal relations. In this article we address this limitation, formalise the concept of a temporal knowledge graph and present its instantiation - EventKG. EventKG is a multilingual event-centric temporal knowledge graph that incorporates over 690 thousand events and over 2.3 million temporal relations obtained from several large-scale knowledge graphs and semi-structured sources and makes them available through a canonical RDF representation. Whereas popular entities often possess hundreds of relations within a temporal knowledge graph such as EventKG, generating a concise overview of the most important temporal relations for a given entity is a challenging task. In this article we demonstrate an application of EventKG to biographical timeline generation, where we adopt a distant supervision method to identify relations most relevant for an entity biography. Our evaluation results provide insights on the characteristics of EventKG and demonstrate the effectiveness of the proposed biographical timeline generation method.
Asynchronous Training of Word Embeddings for Large Text Corpora Anand, Avishek; Khosla, Megha; Singh, Jaspreet; Zab, Jan-Hendrik; Zhang, Zijian in WSDM ’19 (2019). 168–176.
Citation Needed: A Taxonomy and Algorithmic Assessment of Wikipedia’s Verifiability Redi, Miriam; Fetahu, Besnik; Morgan, Jonathan; Taraborelli, Dario (2019).
RDF Dataset Profiling - a Survey of Features, Methods, Vocabularies and Applications Ben Ellefi, Mohamed; Bellahsene, Zohra; John, Breslin; Demidova, Elena; Dietze, Stefan; Szymanski, Julian; Todorov, Konstantin (2018). 9(5) 677–705.

The Web of Data, and in particular Linked Data, has seen tremendous growth over the past years. However, reuse and take-up of these rich data sources is often limited and focused on a few well-known and established RDF datasets. This can be partially attributed to the lack of reliable and up-to-date information about the characteristics of available datasets. While RDF datasets vary heavily with respect to the features related to quality, provenance, interlinking, licenses, statistics and dynamics, reliable information about such features is essential to enable dataset discovery and selection in tasks such as entity linking, distributed query, search or question answering. Even though there exists a wealth of works contributing to the task of dataset profiling in general, these works are spread across a wide range of communities. In this survey, we provide a first comprehensive overview of the RDF dataset profiling features, methods, tools and vocabularies. We organize these building blocks of dataset profiling in a taxonomy and illustrate the links between the dataset profiling and feature extraction approaches and several application domains. This survey is aimed towards data practitioners, data providers and scientists, spanning a large range of communities and drawing from different fields such as dataset profiling, assessment, summarization and characterization. Ultimately, this work is intended to facilitate the reader to identify the relevant features for building a dataset profile for intended applications together with the methods and tools capable of extracting these features from the datasets as well as vocabularies to describe the extracted features and make them available.
Detecting Biased Statements in Wikipedia. Hube, Christoph; Fetahu, Besnik P.-A. Champin, F. L. Gandon, M. Lalmas, P. G. Ipeirotis (eds.) (2018). 1779–1786.
Building and Querying Semantic Layers for Web Archives (Extended Version) Fafalios, Pavlos; Holzmann, Helge; Kasturia, Vaibhav; Nejdl, Wolfgang (2018).

Web archiving is the process of collecting portions of the Web to ensure that the information is preserved for future exploitation. However, despite the increasing number of web archives worldwide, the absence of efficient and meaningful exploration methods still remains a major hurdle in the way of turning them into a usable and useful information source. In this paper, we focus on this problem and propose an RDF/S model and a distributed framework for building semantic profiles (“layers”) that describe semantic information about the contents of web archives. A semantic layer allows describing metadata information about the archived documents, annotating them with useful semantic information (like entities, concepts, and events), and publishing all these data on the Web as Linked Data. Such structured repositories offer advanced query and integration capabilities, and make web archives directly exploitable by other systems and tools. To demonstrate their query capabilities, we build and query semantic layers for three different types of web archives. An experimental evaluation showed that a semantic layer can answer information needs that existing keyword-based systems are not able to sufficiently satisfy.
Posthoc Interpretability of Learning to Rank Models using Secondary Training Data Singh, Jaspreet; Anand, Avishek (2018).
EventKG: A Multilingual Event-Centric Temporal Knowledge Graph Gottschalk, Simon; Demidova, Elena in Lecture Notes in Computer Science (2018). 272–287.

One of the key requirements to facilitate semantic analytics of information regarding contemporary and historical events on the Web, in the news and in social media is the availability of reference knowledge repositories containing comprehensive representations of events and temporal relations. Existing knowledge graphs, with popular examples including DBpedia, YAGO and Wikidata, focus mostly on entity-centric information and are insufficient in terms of their coverage and completeness with respect to events and temporal relations. EventKG presented in this paper is a multilingual event-centric temporal knowledge graph that aims to address this gap. EventKG incorporates over 690 thousand contemporary and historical events and over 2.3 million temporal relations extracted from several large-scale knowledge graphs and less structured sources and makes this information available through a canonical representation. In this paper we present EventKG including its data model, extraction process, and characteristics and discuss its relevance for several real-world applications including Question Answering, timeline generation and cross-cultural analytics.
Learning under Feature Drifts in Textual Streams Melidis, Damianos P.; Spiliopoulou, Myra; Ntoutsi, Eirini (2018).
DistrustRank: Spotting False News Domains Woloszyn, Vinicius; Nejdl, Wolfgang in WebSci’18 (2018).

In this paper we propose a semi-supervised learning strategy to automatically separate fake News from reliable News sources: DistrustRank. We first select a small set of unreliable News, manually evaluated and classified by experts on fact checking portals. Once this set is created, DistrustRank constructs a weighted graph where nodes represent websites, connected by edges based on a minimum similarity between a pair of websites. Next it computes the central- ity using a biased PageRank, where a bias is applied to the selected set of seeds. As an output of the proposed model we obtain a trust (or distrust) rank that can be used in two ways: a) as a counter-bias to be applied when News about a specific subject is ranked, in order to discount possible boosts achieved by false claims; and b) to assist humans to identify sources that are likely to be source of fake News (or that are likely to be reputable), suggesting websites that should be examined more closely or to be avoided. In our experiments, DistrustRank outperforms the supervised approaches in either ranking and classification task.
TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets Fafalios, Pavlos; Iosifidis, Vasileios; Ntoutsi, Eirini; Dietze, Stefan (2018).

Publicly available social media archives facilitate research in a variety of fields, such as data science, sociology or the digital humanities, where Twitter has emerged as one of the most prominent sources. However, obtaining, archiving and annotating large amounts of tweets is costly. In this paper, we describe TweetsKB, a publicly available corpus of currently more than 1.5 billion tweets, spanning almost 5 years (Jan'13-Nov'17). Metadata information about the tweets as well as extracted entities, hashtags, user mentions and sentiment information are exposed using established RDF/S vocabularies. Next to a description of the extraction and annotation process, we present use cases to illustrate scenarios for entity-centric information exploration, data integration and knowledge discovery facilitated by TweetsKB.
Towards Better Understanding Researcher Strategies in Cross-Lingual Event Analytics. Gottschalk, Simon; Bernacchi, Viola; Rogers, Richard; Demidova, Elena in Lecture Notes in Computer Science, E. Méndez, F. Crestani, C. Ribeiro, G. David, J. C. Lopes (eds.) (2018). (Vol. 11057) 139–151.
A Trio Neural Model for Dynamic Entity Relatedness Ranking Nguyen, Tu Ngoc; Tran, Tuan; Nejdl, Wolfgang (2018).

Measuring entity relatedness is a fundamental task for many natural language processing and information retrieval applications. Prior work often studies entity relatedness in static settings and an unsupervised manner. However, entities in real-world are often involved in many different relationships, consequently entity-relations are very dynamic over time. In this work, we propose a neural networkbased approach for dynamic entity relatedness, leveraging the collective attention as supervision. Our model is capable of learning rich and different entity representations in a joint framework. Through extensive experiments on large-scale datasets, we demonstrate that our method achieves better results than competitive baselines.
Heuristics-based Query Reordering for Federated Queries in SPARQL 1.1 and SPARQL-LD Yannakis, Thanos; Fafalios, Pavlos; Tzitzikas, Yannis (2018).
Tracking the History and Evolution of Entities: Entity-centric Temporal Analysis of Large Social Media Archives Fafalios, Pavlos; Iosifidis, Vasileios; Stefanidis, Kostas; Ntoutsi, Eirini (2018).

How did the popularity of the Greek Prime Minister evolve in 2015? How did the predominant sentiment about him vary during that period? Were there any controversial sub-periods? What other entities were related to him during these periods? To answer these questions, one needs to analyze archived documents and data about the query entities, such as old news articles or social media archives. In particular, user generated content posted in social networks, like Twitter and Facebook, can be seen as a comprehensive documentation of our society, and thus, meaningful analysis methods over such archived data are of immense value for sociologists, historians, and other interested parties who want to study the history and evolution of entities and events. To this end, in this paper we propose an entity-centric approach to analyze social media archives and we define measures that allow studying how entities were reflected in social media in different time periods and under different aspects, like popularity, attitude, controversiality, and connectedness with other entities. A case study using a large Twitter archive of 4 years illustrates the insights that can be gained by such an entity-centric and multi-aspect analysis.
Building and Querying Semantic Layers for Web Archives. Fafalios, Pavlos; Holzmann, Helge; Kasturia, Vaibhav; Nejdl, Wolfgang (2017). 11–20.

Web archiving is the process of collecting portions of the Web to ensure that the information is preserved for future exploitation. However, despite the increasing number of web archives worldwide, the absence of efficient and meaningful exploration methods still remains a major hurdle in the way of turning them into a usable and useful information source. In this paper, we focus on this problem and propose an RDF/S model and a distributed framework for building semantic profiles ("layers") that describe semantic information about the contents of web archives. A semantic layer allows describing metadata information about the archived documents, annotating them with useful semantic information (like entities, concepts and events), and publishing all this data on the Web as Linked Data. Such structured repositories offer advanced query and integration capabilities and make web archives directly exploitable by other systems and tools. To demonstrate their query capabilities, we build and query semantic layers for three different types of web archives. An experimental evaluation showed that a semantic layer can answer information needs that existing keyword-based systems are not able to sufficiently satisfy.
Multi-aspect Entity-Centric Analysis of Big Social Media Archives. Fafalios, Pavlos; Iosifidis, Vasileios; Stefanidis, Kostas; Ntoutsi, Eirini in Lecture Notes in Computer Science, J. Kamps, G. Tsakonas, Y. Manolopoulos, L. Iliadis, I. Karydis (eds.) (2017). 261–273.

Social media archives serve as important historical information sources, and thus meaningful analysis and exploration methods are of immense value for historians, sociologists and other interested parties. In this paper, we propose an entity-centric approach to analyze social media archives and we define measures that allow studying how entities are reflected in social media in different time periods and under different aspects (like popularity, attitude, controversiality, and connectedness with other entities). A case study using a large Twitter archive of 4 years illustrates the insights that can be gained by such an entity-centric multi-aspect analysis.
Universal Distant Reading through Metadata Proxies with ArchiveSpark Holzmann, Helge; Goel, Vinay; Gustainis, Emily Novak (2017).
Fine Grained Citation Span for References in Wikipedia Fetahu, Besnik; Markert, Kajta; Anand, Avishek (2017).
Ongoing Events in Wikipedia: A Cross-lingual Case Study Gottschalk, Simon; Demidova, Elena; Bernacchi, Viola; Rogers, Richard (2017). 387–388.

In order to effectively analyze information regarding ongoing events that impact local communities across language and country borders, researchers often need to perform multilingual data analysis. This analysis can be particularly challenging due to the rapidly evolving event-centric data and the language barrier. In this abstract we present preliminary results of a case study with the goa to better understand how researchers interact with multilingual event-centric information in the context of cross-cultural studies and which methods and features they use.
ArchiveWeb: Collaboratively Extending and Exploring Web Archive Collections. How would you like to work with your collections? Fernando, Zeon Trevor; Marenzi, Ivana; Nejdl, Wolfgang (N. Adam; R. Furuta; E. Neuhold, eds.) (2017).

Curated web archive collections contain focused digital content which is collected by archiving organizations, groups and individuals to provide a representative sample covering specific topics and events to preserve them for future exploration and analysis. In this paper, we discuss how to best support collaborative construction and exploration of these collections through the ArchiveWeb system. ArchiveWeb has been developed using an iterative evaluation-driven designbased research approach, with considerable user feedback at all stages. The first part of this paper describes the important insights we gained from our initial requirements engineering phase during the first year of the project and the main functionalities of the current ArchiveWeb system for searching, constructing, exploring and discussing web archive collections. The second part summarizes the feedback we received on this version from archiving organizations and libraries, as well as our corresponding plans for improving and extending the system for the next release.
Towards a Ranking Model for Semantic Layers over Digital Archives Fafalios, Pavlos; Kasturia, Vaibhav; Nejdl, Wolfgang (2017). 336–337.

Archived collections of documents (like newspaper archives) serve as important information sources for historians, journalists, sociologists and other interested parties. Semantic Layers over such digital archives allow describing and publishing metadata and semantic information about the archived documents in a standard format (RDF), which in turn can be queried through a structured query language (e.g., SPARQL). This enables to run advanced queries by combining metadata of the documents (like publication date) and content-based semantic information (like entities mentioned in the documents). However, the results returned by structured queries can be numerous and also they all equally match the query. Thus, there is the need to rank these results in order to promote the most important ones. In this paper, we focus on this problem and propose a ranking model that considers and combines: i) the relativeness of documents to entities, ii) the timeliness of documents, and iii) the relations among the entities.
Modeling Event Importance for Ranking Daily News Events Setty, Vinay; Anand, Abhijit; Mishra, Arunav; Anand, Avishek in WSDM ’17 (2017). 231–240.

We deal with the problem of ranking news events on a daily basis for large news corpora, an essential building block for news aggregation. News ranking has been addressed in the literature before but with individual news articles as the unit of ranking. However, estimating event importance accurately requires models to quantify current day event importance as well as its significance in the historical context. Consequently, in this paper we show that a cluster of news articles representing an event is a better unit of ranking as it provides an improved estimation of popularity, source diversity and authority cues. In addition, events facilitate quantifying their historical significance by linking them with long-running topics and recent chain of events. Our main contribution in this paper is to provide effective models for improved news event ranking. To this end, we propose novel event mining and feature generation approaches for improving estimates of event importance. Finally, we conduct extensive evaluation of our approaches on two large real-world news corpora each of which span for more than a year with a large volume of up to tens of thousands of daily news articles. Our evaluations are large-scale and based on a clean human curated ground-truth from Wikipedia Current Events Portal. Experimental comparison with a state-of-the-art news ranking technique based on language models demonstrates the effectiveness of our approach.
Designing Search Tasks for Archive Search Singh, Jaspreet; Anand, Avishek in CHIIR ’17 (2017). 361–364.

Longitudinal corpora like legal, corporate and newspaper archives are of immense value to a variety of users, and time as an important factor strongly influences their search behavior in these archives. While many systems have been developed to support users' temporal information needs, questions remain over how users utilize these advances to satisfy their needs. Analyzing their search behavior will provide us with novel insights into search strategy, guide better interface and system design and highlight new problems for further research. In this paper we propose a set of search tasks, with varying complexity, that IIR researchers can utilize to study user search behavior in archives. We discuss how we created and refined these tasks as the result of a pilot study using a temporal search engine. We not only propose task descriptions but also pre and post-task evaluation mechanisms that can be employed for a large-scale study (crowdsourcing). Our initial findings show the viability of such tasks for investigating search behavior in archives.
Tempas: Temporal Archive Search Based on Tags. Holzmann, Helge; Anand, Avishek (2017). abs/1702.01076
Software citation, landing pages, and the swMATH service Sperber, Wolfram; Dalitz, Wolfgang; Holzmann, Helge (2017, October).
Software as a first-class citizen in web archives Holzmann, Helge (2017, May).
Who Likes Me More?: Analysing Entity-centric Language-specific Bias in Multilingual Wikipedia Zhou, Yiwei; Demidova, Elena; Cristea, Alexandra I. in SAC ’16 (2016). 750–757.

In this paper we take an important step towards better understanding the existence and extent of entity-centric language-specific bias in multilingual Wikipedia, and any deviation from its targeted neutral point of view. We propose a methodology using sentiment analysis techniques to systematically extract the variations in sentiments associated with real-world entities in different language editions of Wikipedia, illustrated with a case study of five Wikipedia language editions and a set of target entities from four categories.
Temporal Information Retrieval Kanhabua, Nattiya; Anand, Avishek in SIGIR ’16 (2016). 1235–1238.
Who likes me more? Analysing entity-centric language-specific bias in multilingual Wikipedia Zhou, Yiwei; Demidova, Elena; Cristea, Alexandra I. (2016).
History by Diversity: Helping Historians Search News Archives Singh, Jaspreet; Nejdl, Wolfgang; Anand, Avishek in CHIIR ’16 (2016). 183–192.

Longitudinal corpora like newspaper archives are of immense value to historical research, and time as an important factor for historians strongly influences their search behaviour in these archives. While searching for articles published over time, a key preference is to retrieve documents which cover the important aspects from important points in time which is different from standard search behavior. To support this search strategy, we introduce the notion of a Historical Query Intent to explicitly model a historian's search task and define an aspect-time diversification problem over news archives. We present a novel algorithm, HistDiv, that explicitly models the aspects and important time windows based on a historian's information seeking behavior. By incorporating temporal priors based on publication times and temporal expressions, we diversify both on the aspect and temporal dimensions. We test our methods by constructing a test collection based on The New York Times Collection with a workload of 30 queries of historical intent assessed manually. We find that HistDiv outperforms all competitors in subtopic recall with a slight loss in precision. We also present results of a qualitative user study to determine wether this drop in precision is detrimental to user experience. Our results show that users still preferred HistDiv's ranking.
How to Search the Internet Archive Without Indexing It Kanhabua, Nattiya; Kemkes, Philipp; Nejdl, Wolfgang; Nguyen, Tu Ngoc; Reis, Felipe; Tran, Nam Khanh in Lecture Notes in Computer Science, N. Fuhr, L. Kov{{á}}cs, T. Risse, W. Nejdl (eds.) (2016). (Vol. 9819) 147–160.
Search As Research Practices on the Web: The SaR-Web Platform for Cross-language Engine Results Analysis Taibi, Davide; Rogers, Richard; Marenzi, Ivana; Nejdl, Wolfgang; Ahmad, Qazi Asim Ijaz; Fulantelli, Giovanni in WebSci ’16 (2016). 367–369.

Search engines are the most utilized tools to access information on the Web. The success of large companies such as Google owes to their capacity to conduct users through the vast troves of knowledge and information online. Recently, the concept of search as research has been used to shift the research focus from workings of information-seeking tools towards methods for the social study of Web and particularly the social meanings of engine results. In this paper, we present SaR-Web, a web search tool that provides an automatic means to carry out search as research on the Web. It compares the results of same (translated) queries across search engine language domains, thereby enabling cross-linguistic and cross-cultural comparisons of results. SaR-Web outputs enable the comparative study of cultural mores as well as societal associations and concerns, interpreted through search engine results.
Finding News Citations for Wikipedia Fetahu, Besnik; Markert, Katja; Nejdl, Wolfgang; Anand, Avishek in CIKM ’16 (2016). 337–346.

An important editing policy in Wikipedia is to provide citations for added statements in Wikipedia pages, where statements can be arbitrary pieces of text, ranging from a sentence to a paragraph. In many cases citations are either outdated or missing altogether. In this work we address the problem of finding and updating news citations for statements in entity pages. We propose a two-stage supervised approach for this problem. In the first step, we construct a classifier to find out whether statements need a news citation or other kinds of citations (web, book, journal, etc.). In the second step, we develop a news citation algorithm for Wikipedia statements, which recommends appropriate citations from a given news collection. Apart from IR techniques that use the statement to query the news collection, we also formalize three properties of an appropriate citation, namely: (i) the citation should entail the Wikipedia statement, (ii) the statement should be central to the citation, and (iii) the citation should be from an authoritative source. We perform an extensive evaluation of both steps, using 20 million articles from a real-world news collection. Our results are quite promising, and show that we can perform this task with high precision and at scale.
On the Applicability of Delicious for Temporal Search on Web Archives Holzmann, Helge; Nejdl, Wolfgang; Anand, Avishek in SIGIR ’16 (2016). 929–932.

Web archives are large longitudinal collections that store webpages from the past, which might be missing on the current live Web. Consequently, temporal search over such collections is essential for finding prominent missing webpages and tasks like historical analysis. However, this has been challenging due to the lack of popularity information and proper ground truth to evaluate temporal retrieval models. In this paper we investigate the applicability of external longitudinal resources to identify important and popular websites in the past and analyze the social bookmarking service Delicious for this purpose. The timestamped bookmarks on Delicious provide explicit cues about popular time periods in the past along with relevant descriptors. These are valuable to identify important documents in the past for a given temporal query. Focusing purely on recall, we analyzed more than 12,000 queries and find that using Delicious yields average recall values from 46% up to 100%, when limiting ourselves to the best represented queries in the considered dataset. This constitutes an attractive and low-overhead approach for quick access into Web archives by not dealing with the actual contents.
SaR-Web - {A} Tool to Support Search as Learning Processes Fulantelli, Giovanni; Marenzi, Ivana; Ahmad, Qazi Asim Ijaz; Taibi, Davide in {CEUR} Workshop Proceedings, J. Gwizdka, P. Hansen, C. Hauff, J. He, N. Kando (eds.) (2016). (Vol. 1647)
ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation Holzmann, Helge; Goel, Vinay; Anand, Avishek in JCDL ’16 (2016). 83–92.

Web archives are a valuable resource for researchers of various disciplines. However, to use them as a scholarly source, researchers require a tool that provides efficient access to Web archive data for extraction and derivation of smaller datasets. Besides efficient access we identify five other objectives based on practical researcher needs such as ease of use, extensibility and reusability. Towards these objectives we propose ArchiveSpark, a framework for efficient, distributed Web archive processing that builds a research corpus by working on existing and standardized data formats commonly held by Web archiving institutions. Performance optimizations in ArchiveSpark, facilitated by the use of a widely available metadata index, result in significant speed-ups of data processing. Our benchmarks show that ArchiveSpark is faster than alternative approaches without depending on any additional data stores while improving usability by seamlessly integrating queries and derivations with external tools.
Cobwebs from the Past and Present: Extracting Large Social Networks Using Internet Archive Data Shaltev, Miroslav; Zab, Jan-Hendrik; Kemkes, Philipp; Siersdorfer, Stefan; Zerr, Sergej in SIGIR ’16 (2016). 1093–1096.

Social graph construction from various sources has been of interest to researchers due to its application potential and the broad range of technical challenges involved. The World Wide Web provides a huge amount of continuously updated data and information on a wide range of topics created by a variety of content providers, and makes the study of extracted people networks and their temporal evolution valuable for social as well as computer scientists. In this paper we present SocGraph - an extraction and exploration system for social relations from the content of around 2 billion web pages collected by the Internet Archive over the 17 years time period between 1996 and 2013. We describe methods for constructing large social graphs from extracted relations and introduce an interface to study their temporal evolution.
Named entity evolution recognition on the Blogosphere Holzmann, Helge; Tahmasebi, Nina; Risse, Thomas (2015). 15(2-4) 209–235.

Advancements in technology and culture lead to changes in our language. These changes create a gap between the language known by users and the language stored in digital archives. It affects user’s possibility to firstly find content and secondly interpret that content. In a previous work, we introduced our approach for named entity evolution recognition (NEER) in newspaper collections. Lately, increasing efforts in Web preservation have led to increased availability of Web archives covering longer time spans. However, language on the Web is more dynamic than in traditional media and many of the basic assumptions from the newspaper domain do not hold for Web data. In this paper we discuss the limitations of existing methodology for NEER. We approach these by adapting an existing NEER method to work on noisy data like the Web and the Blogosphere in particular. We develop novel filters that reduce the noise and make use of Semantic Web resources to obtain more information about terms. Our evaluation shows the potentials of the proposed approach.
Groupsourcing: Team Competition Designs for Crowdsourcing Rokicki, Markus; Zerr, Sergej; Siersdorfer, Stefan in WWW ’15 (2015). 906–915.

Many data processing tasks such as semantic annotation of images, translation of texts in foreign languages, and labeling of training data for machine learning models require human input, and, on a large scale, can only be accurately solved using crowd based online work. Recent work shows that frameworks where crowd workers compete against each other can drastically reduce crowdsourcing costs, and outperform conventional reward schemes where the payment of online workers is proportional to the number of accomplished tasks ("pay-per-task"). In this paper, we investigate how team mechanisms can be leveraged to further improve the cost efficiency of crowdsourcing competitions. To this end, we introduce strategies for team based crowdsourcing, ranging from team formation processes where workers are randomly assigned to competing teams, over strategies involving self-organization where workers actively participate in team building, to combinations of team and individual competitions. Our large-scale experimental evaluation with more than 1,100 participants and overall 5,400 hours of work spent by crowd workers demonstrates that our team based crowdsourcing mechanisms are well accepted by online workers and lead to substantial performance boosts.
Learning to Detect Event-Related Queries for Web Search Kanhabua, Nattiya; Ngoc Nguyen, Tu; Nejdl, Wolfgang in WWW ’15 Companion (2015). 1339–1344.

In many cases, a user turns to search engines to find information about real-world situations, namely, political elections, sport competitions, or natural disasters. Such temporal querying behavior can be observed through a significant number of event-related queries generated in web search. In this paper, we study the task of detecting event-related queries, which is the first step for understanding temporal query intent and enabling different temporal search applications, e.g., time-aware query auto-completion, temporal ranking, and result diversification. We propose a two-step approach to detecting events from query logs. We first identify a set of event candidates by considering both implicit and explicit temporal information needs. The next step further classifies the candidates into two main categories, namely, event or non-event. In more detail, we leverage different machine learning techniques for query classification, which are trained using the feature set composed of time series features from signal processing, along with features derived from click-through information, and standard statistical features. In order to evaluate our proposed approach, we conduct an experiment using two real-world query logs with manually annotated relevance assessments for 837 events. To this end, we provide a large set of event-related queries made available for fostering research on this challenging task.
iCrawl: Improving the Freshness of Web Collections by Integrating Social Web and Focused Web Crawling Gossen, Gerhard; Demidova, Elena; Risse, Thomas in JCDL ’15 (2015). 75–84.

Researchers in the Digital Humanities and journalists need to monitor, collect and analyze fresh online content regarding current events such as the Ebola outbreak or the Ukraine crisis on demand. However, existing focused crawling approaches only consider topical aspects while ignoring temporal aspects and therefore cannot achieve thematically coherent and fresh Web collections. Especially Social Media provide a rich source of fresh content, which is not used by state-of-the-art focused crawlers. In this paper we address the issues of enabling the collection of fresh and relevant Web and Social Web content for a topic of interest through seamless integration of Web and Social Media in a novel integrated focused crawler. The crawler collects Web and Social Media content in a single system and exploits the stream of fresh Social Media content for guiding the crawler.
Improving Entity Retrieval on Structured Data Fetahu, Besnik; Gadiraju, Ujwal; Dietze, Stefan (2015).
Semantic Annotation for Microblog Topics Using Wikipedia Temporal Information Tran, Tuan; Tran, Nam-Khanh; Teka Hadgu, Asmelash; Jäschke, Robert (2015).

Trending topics in microblogs such as Twitter are valuable resources to understand social aspects of real-world events. To enable deep analyses of such trends, semantic annotation is an effective approach; yet the problem of annotating microblog trending topics is largely unexplored by the research community. In this work, we tackle the problem of mapping trending Twitter topics to entities from Wikipedia. We propose a novel model that complements traditional text-based approaches by rewarding entities that exhibit a high temporal correlation with topics during their burst time period. By exploiting temporal information from the Wikipedia edit history and page view logs, we have improved the annotation performance by 17-28%, as compared to the competitive baselines.
Named Entity Evolution Recognition on the Blogosphere Holzmann, Helge; Tahmasebi, Nina; Risse, Thomas (2015). 15(2-4) 209–235.
Mining Relevant Time for Query Subtopics in Web Archives Nguyen, Tu Ngoc; Kanhabua, Nattiya; Nejdl, Wolfgang; Niederée, Claudia in TempWeb’2015 (2015).
Learning to Detect Event-Related Queries for Web Search Kanhabua, Nattiya; Nguyen, Tu Ngoc; Nejdl, Wolfgang in TempWeb’2015 (2015).
Semantic URL Analytics to Support Efficient Annotation of Large Scale Web Archives Souza, Tarcísio; Demidova, Elena; Risse, Thomas; Holzmann, Helge; Gossen, Gerhard; Szymanski, Julian (2015). 153–166.
Time-travel Translator: Automatically Contextualizing News Articles Tran, Nam Khanh; Ceroni, Andrea; Kanhabua, Nattiya; Niederée, Claudia in WWW’2015 (2015).
Who With Whom And How?: Extracting Large Social Networks Using Search Engines Siersdorfer, Stefan; Kemkes, Philipp; Ackermann, Hanno; Zerr, Sergej in CIKM ’15 (2015). 1491–1500.
Improving Entity Retrieval on Structured Data Fetahu, Besnik; Gadiraju, Ujwal; Dietze, Stefan in Lecture Notes in Computer Science (2015). (Vol. 9366) 474–491.
Balancing Novelty and Salience: Adaptive Learning to Rank Entities for Timeline Summarization of High-impact Events Tran, Tuan A.; Niederee, Claudia; Kanhabua, Nattiya; Gadiraju, Ujwal; Anand, Avishek in CIKM ’15 (2015). 1201–1210.

Long-running, high-impact events such as the Boston Marathon bombing often develop through many stages and involve a large number of entities in their unfolding. Timeline summarization of an event by key sentences eases story digestion, but does not distinguish between what a user remembers and what she might want to re-check. In this work, we present a novel approach for timeline summarization of high-impact events, which uses entities instead of sentences for summarizing the event at each individual point in time. Such entity summaries can serve as both (1) important memory cues in a retrospective event consideration and (2) pointers for personalized event exploration. In order to automatically create such summaries, it is crucial to identify the "right" entities for inclusion. We propose to learn a ranking function for entities, with a dynamically adapted trade-off between the in-document salience of entities and the informativeness of entities across documents, i.e., the level of new information associated with an entity for a time point under consideration. Furthermore, for capturing collective attention for an entity we use an innovative soft labeling approach based on Wikipedia. Our experiments on a real large news datasets confirm the effectiveness of the proposed methods.
The iCrawl Wizard – Supporting Interactive Focused Crawl Specification Gossen, Gerhard; Demidova, Elena; Risse, Thomas (2015).
Insights into Entity Name Evolution on Wikipedia Holzmann, Helge; Risse, Thomas B. Benatallah, A. Bestavros, Y. Manolopoulos, A. Vakali, Y. Zhang (eds.) (2014). (Vol. 8787) 47–61.

Working with Web archives raises a number of issues caused by their temporal characteristics. Depending on the age of the content, additional knowledge might be needed to find and understand older texts. Especially facts about entities are subject to change. Most severe in terms of information retrieval are name changes. In order to find entities that have changed their name over time, search engines need to be aware of this evolution. We tackle this problem by analyzing Wikipedia in terms of entity evolutions mentioned in articles regardless the structural elements. We gathered statistics and automatically extracted minimum excerpts covering name changes by incorporating lists dedicated to that subject. In future work, these excerpts are going to be used to discover patterns and detect changes in other sources. In this work we investigate whether or not Wikipedia is a suitable source for extracting the required knowledge.
Analysing and Enriching Focused Semantic Web Archives for Parliament Applications Demidova, Elena; Barbieri, Nicola; Dietze, Stefan; Funk, Adam; Holzmann, Helge; Maynard, Diana; Papailiou, Nikolaos; Peters, Wim; Risse, Thomas; Spiliotopoulos, Dimitris (2014). 6(3) 433.

The web and the social web play an increasingly important role as an information source for Members of Parliament and their assistants, journalists, political analysts and researchers. It provides important and crucial background information, like reactions to political events and comments made by the general public. The case study presented in this paper is driven by two European parliaments (the Greek and the Austrian parliament) and targets an effective exploration of political web archives. In this paper, we describe semantic technologies deployed to ease the exploration of the archived web and social web content and present evaluation results.
Bridging Temporal Context Gaps Using Time-aware Re-contextualization Ceroni, Andrea; Tran, Nam Khanh; Kanhabua, Nattiya; Nieder{é}e, Claudia in SIGIR ’14 (2014). 1127–1130.

Understanding a text, which was written some time ago, can be compared to translating a text from another language. Complete interpretation requires a mapping, in this case, a kind of time-travel translation between present context knowledge and context knowledge at time of text creation. In this paper, we study time-aware re-contextualization, the challenging problem of retrieving concise and complementing information in order to bridge this temporal context gap. We propose an approach based on learning to rank techniques using sentence-level context information extracted from Wikipedia. The employed ranking combines relevance, complimentarity and time-awareness. The effectiveness of the approach is evaluated by contextualizing articles from a news archive collection using more than 7,000 manually judged relevance pairs. To this end, we show that our approach is able to retrieve a significant number of relevant context information for a given news article.
On the Value of Temporal Anchor Texts in Wikipedia Kanhabua, Nattiya; Nejdl, Wolfgang (2014).
Hedera: Scalable Indexing and Exploring Entities in Wikipedia Revision History Tran, Tuan; Nguyen, Tu Ngoc (2014).
Competitive Game Designs for Improving the Cost Effectiveness of Crowdsourcing Rokicki, Markus; Chelaru, Sergiu; Zerr, Sergej; Siersdorfer, Stefan in CIKM ’14 (2014). 1469–1478.
What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalysts for Collective Memory in Wikipedia Kanhabua, Nattiya; Ngoc Nguyen, Tu; Niederée, Claudia (2014).
Extraction of Evolution Descriptions from the Web Holzmann, Helge; Risse, Thomas in JCDL ’14 (2014). 413–414.

The evolution of named entities affects exploration and retrieval tasks in digital libraries. An information retrieval system that is aware of name changes can actively support users in finding former occurrences of evolved entities. However, current structured knowledge bases, such as DBpedia or Freebase, do not provide enough information about evolutions, even though the data is available on their resources, like Wikipedia. Our Evolution Base prototype will demonstrate how excerpts describing name evolutions can be identified on these websites with a promising precision. The descriptions are classified by means of models that we trained based on a recent analysis of named entity evolutions on Wikipedia.
A Burstiness-aware Approach for Document Dating Kotsakos, Dimitrios; Lappas, Theodoros; Kotzias, Dimitrios; Gunopulos, Dimitrios; Kanhabua, Nattiya; Nørvåg, Kjetil (2014).
Named Entity Evolution Analysis on Wikipedia Holzmann, Helge; Risse, Thomas in WebSci ’14 (2014). 241–242.

Accessing Web archives raises a number of issues caused by their temporal characteristics. Additional knowledge is needed to find and understand older texts. Especially entities mentioned in texts are subject to change. Most severe in terms of information retrieval are name changes. In order to find entities that have changed their name over time, search engines need to be aware of this evolution. We tackle this problem by analyzing Wikipedia in terms of entity evolutions mentioned in articles. We present statistical data on excerpts covering name changes, which will be used to discover similar text passages and extract evolution knowledge in future work.
Bridging Temporal Context Gaps using Time-Aware Re-Contextualization Ceroni, Andrea; Khanh Tran, Nam; Kanhabua, Nattiya; Niederée, Claudia (2014).
Analyzing Relative Incompleteness of Movie Descriptions in the Web of Data: A Case Study Yuan, Wancheng; Demidova, Elena; Dietze, Stefan; Zhou, Xuan (2014). (Vol. Vol-1272) 197–200.
iCrawl: An integrated focused crawling toolbox for Web Science Gossen, Gerhard N. Brügger (ed.) (2014).

Within the scientific community an increasing interest in using Web content for research can be observed. Especially the Social Web is attractive for many humanities disciplines as it provides direct access to thoughts of many people about politics, popular topics and events. Documenting the activities on the Web and Social Web in Web archives facilitates better understanding of the public perception. However, state-of-the-art Web archive crawler like Heritrix have significant limitations in terms of usability, functionality and maintenance with regard to the needs of the scientific community. The iCrawl project aims to provide an integrated crawling toolbox with an intuitive, flexible and extensible set of Web crawling components.
Leveraging Dynamic Query Subtopics for Time-Aware Search Result Diversification Nguyen, Tu Ngoc; Kanhabua, Nattiya in Lecture Notes in Computer Science, M. de Rijke, T. Kenter, A. P. de Vries, C. Zhai, F. de Jong, K. Radinsky, K. Hofmann (eds.) (2014). (Vol. 8416) 222–234.
Named Entity Evolution Analysis on Wikipedia Holzmann, Helge; Risse, Thomas in WebSci ’14 (2014). 241–242.

Accessing Web archives raises a number of issues caused by their temporal characteristics. Additional knowledge is needed to find and understand older texts. Especially entities mentioned in texts are subject to change. Most severe in terms of information retrieval are name changes. In order to find entities that have changed their name over time, search engines need to be aware of this evolution. We tackle this problem by analyzing Wikipedia in terms of entity evolutions mentioned in articles. We present statistical data on excerpts covering name changes, which will be used to discover similar text passages and extract evolution knowledge in future work.
Analysing the Duration of Trending Topics in Twitter using Wikipedia Tran, Tuan; Georgescu, Mihai; Zhu, Xiaofei; Kanhabua, Nattiya (2014).
What Do You Want to Collect from the Web? Risse, Thomas; Demidova, Elena; Gossen, Gerhard (2014).

Today an increasing interest in collecting, analyzing and preserving Web content can be observed in many digital humanities. Especially the Social Web is attractive for many humanities disciplines as it provides a direct access to statements of many people about politics, popular topics or events. In this paper we present an exemplary study that we have conducted with the aim to understand the needs of scientists in social sciences, historical sciences and law with respect to the creation of Web archives.
Proceedings of the 1st International Workshop on Dataset PROFIling & fEderated Search for Linked Data (PROFILES 2014), co-located with the 11th Extended Semantic Web Conference (ESWC 2014), Anissaras, Crete, Greece, 26 May 2014. Demidova, E.; Dietze, S.; Szymanski, J.; Breslin, J. (2014). (Vol. 1151) CEUR Workshop Proceedings.
ASTERIX: an open source system for "Big Data" management and analysis (demo) Alsubaiee, Sattam; Altowim, Yasser; Altwaijry, Hotham; Behm, Alexander; Borkar, Vinayak; Bu, Yingyi; Carey, Michael; Grover, Raman; Heilbron, Zachary; Kim, Young-Seok; Li, Chen; Onose, Nicola; Pirzadeh, Pouria; Vernica, Rares; Wen, Jian (2012). 5(12) 1898–1901.

At UC Irvine, we are building a next generation parallel database system, called ASTERIX, as our approach to addressing today's "Big Data" management challenges. ASTERIX aims to combine time-tested principles from parallel database systems with those of the Web-scale computing community, such as fault tolerance for long running jobs. In this demo, we present a whirlwind tour of ASTERIX, highlighting a few of its key features. We will demonstrate examples of our data definition language to model semi-structured data, and examples of interesting queries using our declarative query language. In particular, we will show the capabilities of ASTERIX for answering geo-spatial queries and fuzzy queries, as well as ASTERIX' data feed construct for continuously ingesting data.
Information integration over time in unreliable and uncertain environments Pal, Aditya; Rastogi, Vibhor; Machanavajjhala, Ashwin; Bohannon, Philip (2012). 789–798.

Often an interesting true value such as a stock price, sports score, or current temperature is only available via the observations of noisy and potentially conflicting sources. Several techniques have been proposed to reconcile these conflicts by computing a weighted consensus based on source reliabilities, but these techniques focus on static values. When the real-world entity evolves over time, the noisy sources can delay, or even miss, reporting some of the real-world updates. This temporal aspect introduces two key challenges for consensus-based approaches: (i) due to delays, the mapping between a source's noisy observation and the real-world update it observes is unknown, and (ii) missed updates may translate to missing values for the consensus problem, even if the mapping is known. To overcome these challenges, we propose a formal approach that models the history of updates of the real-world entity as a hidden semi-Markovian process (HSMM). The noisy sources are modeled as observations of the hidden state, but the mapping between a hidden state (i.e. real-world update) and the observation (i.e. source value) is unknown. We propose algorithms based on Gibbs Sampling and EM to jointly infer both the history of real-world updates as well as the unknown mapping between them and the source values. We demonstrate using experiments on real-world datasets how our history-based techniques improve upon history-agnostic consensus-based approaches.
Creating a searchable web archive Gomes, Daniel; Cruz, David; Miranda, João; Costa, Miguel; Fontes, Simão (2012).

The web became a mass means of publication that has been replacing printed media. However, its information is extremely ephemeral. Currently, most of the information available on the web is less than 1 year old. There are several initiatives worldwide that struggle to archive information from the web before it vanishes. However, search mechanisms to access this information are still limited and do not satisfy their users that demand performance similar to live- web search engines. This paper presents some of the work developed to create an effi�cient and effective searchable web archive service, from data acquisition to user interface design. The results of research were applied in practice to create the Portuguese Web Archive that is publicly available since January 2010. It supports full-text search over 1 billion contents archived from 1996 to 2010. The developed software is available as an open source project.
The History of Web Archiving Toyoda, M.; Kitsuregawa, M. (2012). 100(Special Centennial Issue) 144–1443.

This paper describes the history and the current challenges of archiving massive and extremely diverse amounts of user-generated data in an international environment on the World Wide Web and the technologies required for interoperability between service providers and for preserving their contents in the future.
User browsing behavior-driven web crawling Liu, Minghai; Cai, Rui; Zhang, Ming; Zhang, Lei (2011). 87–92.

To optimize the performance of web crawlers, various page importance measures have been studied to select and order URLs in crawling. Most sophisticated measures (e.g. breadth-first and PageRank) are based on link structure. In this paper, we treat the problem from another perspective and propose to measure page importance through mining user interest and behaviors from web browse logs. Unlike most existing approaches which work on single URL, in this paper, both the log mining and the crawl ordering are performed at the granularity of URL pattern. The proposed URL pattern-based crawl orderings are capable to properly predict the importance of newly created (unseen) URLs. Promising experimental results proved the feasibility of our approach.
Collaborative search in electronic health records Zheng, Kai; Mei, Qiaozhu; Hanauer, David A (2011). 18(3) 282–291.

Objective A full-text search engine can be a useful tool for augmenting the reuse value of unstructured narrative data stored in electronic health records (EHR). A prominent barrier to the effective utilization of such tools originates from users' lack of search expertise and/or medical-domain knowledge. To mitigate the issue, the authors experimented with a ‘collaborative search’ feature through a homegrown EHR search engine that allows users to preserve their search knowledge and share it with others. This feature was inspired by the success of many social information-foraging techniques used on the web that leverage users' collective wisdom to improve the quality and efficiency of information retrieval.Design The authors conducted an empirical evaluation study over a 4-year period. The user sample consisted of 451 academic researchers, medical practitioners, and hospital administrators. The data were analyzed using a social-network analysis to delineate the structure of the user collaboration networks that mediated the diffusion of knowledge of search.Results The users embraced the concept with considerable enthusiasm. About half of the EHR searches processed by the system (0.44 million) were based on stored search knowledge; 0.16 million utilized shared knowledge made available by other users. The social-network analysis results also suggest that the user-collaboration networks engendered by the collaborative search feature played an instrumental role in enabling the transfer of search knowledge across people and domains.Conclusion Applying collaborative search, a social information-foraging technique popularly used on the web, may provide the potential to improve the quality and efficiency of information retrieval in healthcare.
Discovering URLs through user feedback Bai, Xiao; Cambazoglu, B. Barla; Junqueira, Flavio P. (2011). 77–86.

Search engines rely upon crawling to build their Web page collections. A Web crawler typically discovers new URLs by following the link structure induced by links on Web pages. As the number of documents on the Web is large, discovering newly created URLs may take arbitrarily long, and depending on how a given page is connected to others, such a crawler may miss the pages altogether. In this paper, we evaluate the benefits of integrating a passive URL discovery mechanism into a Web crawler. This mechanism is passive in the sense that it does not require the crawler to actively fetch documents from the Web to discover URLs. We focus here on a mechanism that uses toolbar data as a representative source for new URL discovery. We use the toolbar logs of Yahoo! to characterize the URLs that are accessed by users via their browsers, but not discovered by Yahoo! Web crawler. We show that a high fraction of URLs that appear in toolbar logs are not discovered by the crawler. We also reveal that a certain fraction of URLs are discovered by the crawler later than the time they are first accessed by users. One important conclusion of our work is that web search engines can highly benefit from user feedback in the form of toolbar logs for passive URL discovery.
Dremel: interactive analysis of web-scale datasets Melnik, Sergey; Gubarev, Andrey; Long, Jing Jing; Romer, Geoffrey; Shivakumar, Shiva; Tolton, Matt; Vassilakis, Theo (2010). 3(1-2) 330–339.

Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. In this paper, we describe the architecture and implementation of Dremel, and explain how it complements MapReduce-based computing. We present a novel columnar storage representation for nested records and discuss experiments on few-thousand node instances of the system.
Do you want to take notes?: identifying research missions in Yahoo! search pad Donato, Debora; Bonchi, Francesco; Chi, Tom; Maarek, Yoelle (2010). 321–330.

Addressing user's information needs has been one of the main goals of Web search engines since their early days. In some cases, users cannot see their needs immediately answered by search results, simply because these needs are too complex and involve multiple aspects that are not covered by a single Web or search results page. This typically happens when users investigate a certain topic in domains such as education, travel or health, which often require collecting facts and information from many pages. We refer to this type of activities as "research missions". These research missions account for 10% of users' sessions and more than 25% of all query volume, as verified by a manual analysis that was conducted by Yahoo! editors.

We demonstrate in this paper that such missions can be automatically identified on-the-fly, as the user interacts with the search engine, through careful runtime analysis of query flows and query sessions.

The on-the-fly automatic identification of research missions has been implemented in Search Pad, a novel Yahoo! application that was launched in 2009, and that we present in this paper. Search Pad helps users keeping trace of results they have consulted. Its novelty however is that unlike previous notes taking products, it is automatically triggered only when the system decides, with a fair level of confidence, that the user is undertaking a research mission and thus is in the right context for gathering notes. Beyond the Search Pad specific application, we believe that changing the level of granularity of query modeling, from an isolated query to a list of queries pertaining to the same research missions, so as to better reflect a certain type of information needs, can be beneficial in a number of other Web search applications. Session-awareness is growing and it is likely to play, in the near future, a fundamental role in many on-line tasks: this paper presents a first step on this path.
Annotating named entities in Twitter data with crowdsourcing Finin, Tim; Murnane, Will; Karandikar, Anand; Keller, Nicholas; Martineau, Justin; Dredze, Mark (2010). 80–88.

We describe our experience using both Amazon Mechanical Turk (MTurk) and Crowd-Flower to collect simple named entity annotations for Twitter status updates. Unlike most genres that have traditionally been the focus of named entity experiments, Twitter is far more informal and abbreviated. The collected annotations and annotation techniques will provide a first step towards the full study of named entity recognition in domains like Facebook and Twitter. We also briefly describe how to use MTurk to collect judgements on the quality of "word clouds."
A Taxonomy of Collaboration in Online Information Seeking Golovchinsky, Gene; Pickens, Jeremy; Back, Maribeth (2009). abs/0908.0704

People can help other people find information in networked information seeking environments. Recently, many such systems and algorithms have proliferated in industry and in academia. Unfortunately, it is difficult to compare the systems in meaningful ways because they often define collaboration in different ways. In this paper, we propose a model of possible kinds of collaboration, and illustrate it with examples from literature. The model contains four dimensions: intent, depth, concurrency and location. This model can be used to classify existing systems and to suggest possible opportunities for design in this space.
A study of link farm distribution and evolution using a time series of web snapshots Chung, Young-joo; Toyoda, Masashi; Kitsuregawa, Masaru (2009). 9–16.

In this paper, we study the overall link-based spam structure and its evolution which would be helpful for the development of robust analysis tools and research for Web spamming as a social activity in the cyber space. First, we use strongly connected component (SCC) decomposition to separate many link farms from the largest SCC, so called the core. We show that denser link farms in the core can be extracted by node filtering and recursive application of SCC decomposition to the core. Surprisingly, we can find new large link farms during each iteration and this trend continues until at least 10 iterations. In addition, we measure the spamicity of such link farms. Next, the evolution of link farms is examined over two years. Results show that almost all large link farms do not grow anymore while some of them shrink, and many large link farms are created in one year.
Webarchivierung und Web Archive Mining: Notwendigkeit, Probleme und Lösungsansätze Rauber, Andreas; Kaiser, Max (M. Knoll; A. Meier, eds.) (2009). 268

In den letzten Jahren haben Bibliotheken und Archive zunehmend die Aufgabe übernommen, neben konventionellen Publikationen auch Inhalte aus dem World Wide Web zu sammeln, um so diesen wertvollen Teil unseres kulturellen Erbes zu bewahren und wichtige Informationen langfristig verfügbar zu halten. Diese massiven Datensammlungen bieten faszinierende Möglichkeiten, rasch Zugriff auf wichtige Informationen zu bekommen, die im Live-Web bereits verloren gegangen sind. Sie sind eine unentbehrliche Quelle für Wissenschaftler, die in der Zukunft die gesellschaftliche und technologische Entwicklung unserer Zeit nachvollziehen wollen. Auf der anderen Seite stellt eine derartige Datensammlung aber einen völlig neuen Datenbestand dar, der nicht nur rechtliche, sondern auch zahlreiche ethische Fragen betreffend seine Nutzung aufwirft. Diese werden in dem Ausmaß zunehmen, in dem die technischen Möglichkeiten zur automatischen Analyse und Interpretation dieser Daten leistungsfähiger werden. Da sich die meisten Webarchivierungsinitiativen dieser Problematik bewusst sind, bleibt die Nutzung der Daten derzeit meist stark eingeschränkt, oder es wird eine Art von "Opt-Out"-Möglichkeit vorgesehen, wodurch Webseiteninhaber die Aufnahme ihrer Seiten in ein Webarchiv ausschließen können. Mit beiden Ansätzen können Webarchive ihr volles Nutzungspotenzial nicht ausschöpfen. Dieser Artikel beschreibt einleitend kurz die Technologien, die zur Sammlung von Webinhalten zu Archivierungszwecken verwendet werden. Er hinterfragt Annahmen, die die freie Verfügbarkeit der Daten und unterschiedliche Nutzungsarten betreffen. Darauf aufbauend identifiziert er eine Reihe von offenen Fragen, deren Lösung einen breiteren Zugriff und bessere Nutzung von Webarchiven erlauben könnte.
Finding high-quality content in social media Agichtein, Eugene; Castillo, Carlos; Donato, Debora; Gionis, Aristides; Mishne, Gilad (2008). 183–194.

The quality of user-generated content varies drastically from excellent to abuse and spam. As the availability of such content increases, the task of identifying high-quality content sites based on user contributions --social media sites -- becomes increasingly important. Social media in general exhibit a rich variety of information sources: in addition to the content itself, there is a wide array of non-content information available, such as links between items and explicit quality ratings from members of the community. In this paper we investigate methods for exploiting such community feedback to automatically identify high quality content. As a test case, we focus on Yahoo! Answers, a large community question/answering portal that is particularly rich in the amount and types of content and social interactions available in it. We introduce a general classification framework for combining the evidence from different sources of information, that can be tuned automatically for a given social media type and quality definition. In particular, for the community question/answering domain, we show that our system is able to separate high-quality items from the rest with an accuracy close to that of humans
Socio-Sense: A System for Analysing the Societal Behavior from Long Term Web Archive Kitsuregawa, Masaru; Tamura, Takayuki; Toyoda, Masashi; Kaji, Nobuhiro Y. Zhang, G. Yu, E. Bertino, G. Xu (eds.) (2008). (Vol. 4976) 1–8.

We introduce Socio-Sense Web analysis system. The system applies structural and temporal analysis methods to long term Web archive to obtain insight into the real society. We present an overview of the system and core methods followed by excerpts from case studies on consumer behavior analyses.
A user reputation model for a user-interactive question answering system Chen, Wei; Zeng, Qingtian; Wenyin, Liu; Hao, Tianyong (2007). 19(15) 2091–2103.

In this paper, we propose a user reputation model and apply it to a user-interactive question answering system. It combines the social network analysis approach and the user rating approach. Social network analysis is applied to analyze the impact of participant users' relations to their reputations. User rating is used to acquire direct judgment of a user's reputation based on other users' experiences with this user. Preliminary experiments show that the computed reputations based on our proposed reputation model can reflect the actual reputations of the simulated roles and therefore can fit in well with our user-interactive question answering system. Copyright © 2006 John Wiley & Sons, Ltd.
RankMass crawler: a crawler with high personalized pagerank coverage guarantee Cho, Junghoo; Schonfeld, Uri (2007). 375–386.

Crawling algorithms have been the subject of extensive research and optimizations, but some important questions remain open. In particular, given the unbounded number of pages available on the Web, search-engine operators constantly struggle with the following vexing questions: When can I stop downloading the Web? How many pages should I download to cover "most" of the Web? How can I know I am not missing an important part when I stop? In this paper we provide an answer to these questions by developing, in the context of a system that is given a set of trusted pages, a family of crawling algorithms that (1) provide a theoretical guarantee on how much of the "important" part of the Web it will download after crawling a certain number of pages and (2) give a high priority to important pages during a crawl, so that the search engine can index the most important part of the Web first. We prove the correctness of our algorithms by theoretical analysis and evaluate their performance experimentally based on 141 million URLs obtained from the Web. Our experiments demonstrate that even our simple algorithm is effective in downloading important pages early on and provides high "coverage" of the Web with a relatively small number of pages.
Distributed Indexing of Large-Scale Web Collections Costa, M.; Silva, M. (2005). 3(1) 2–8.

Sidra is a new indexing and ranking system for large-scale Web collections. Sidra creates multiple distributed indexes, organized and partitioned by different ranking criteria, aimed at supporting contextualized queries over hypertexts and their metadata. This paper presents the architecture of Sidra and the algorithms used to create its indexes. Performance measurements on the Portuguese Web data show that Sidra's indexing times and scalability are comparable to those of global Web search engines.
User-centric Web crawling Pandey, Sandeep; Olston, Christopher (2005). 401–411.

Search engines are the primary gateways of information access on the Web today. Behind the scenes, search engines crawl the Web to populate a local indexed repository of Web pages, used to answer user search queries. In an aggregate sense, the Web is very dynamic, causing any repository of Web pages to become out of date over time, which in turn causes query answer quality to degrade. Given the considerable size, dynamicity, and degree of autonomy of the Web as a whole, it is not feasible for a search engine to maintain its repository exactly synchronized with the Web.In this paper we study how to schedule Web pages for selective (re)downloading into a search engine repository. The scheduling objective is to maximize the quality of the user experience for those who query the search engine. We begin with a quantitative characterization of the way in which the discrepancy between the content of the repository and the current content of the live Web impacts the quality of the user experience. This characterization leads to a user-centric metric of the quality of a search engine's local repository. We use this metric to derive a policy for scheduling Web page (re)downloading that is driven by search engine usage and free of exterior tuning parameters. We then focus on the important subproblem of scheduling refreshing of Web pages already present in the repository, and show how to compute the priorities efficiently. We provide extensive empirical comparisons of our user-centric method against prior Web page refresh strategies, using real Web data. Our results demonstrate that our method requires far fewer resources to maintain same search engine quality level for users, leaving substantially more resources available for incorporating new Web pages into the search repository.
Archiving the World Wide Web Lyman, Peter (2002). 38–51.