The Neptuno Project

Project overview

Newspaper archives are a highly valuable information asset for the widest range of information consumer profiles: students, researchers, historians, business professionals, the general public, and not the least, news writers themselves. With the introduction of digital formats and internet technologies in the news industry, a whole new market of online services for archive news redistribution, syndication, aggregation, and brokering has emerged in a few years. Providing technology for news archive construction, management, access, publication, and billing, is an important business nowadays.

The information collected from everyday news is huge in volume (e.g. LexisNexis claims to handle over 3.3 billion documents), very loosely organized (e.g. compared to a book library), and grows without a global a-priori structure, as news stories add up and evolve unpredictably. This ever-growing corpus of archived news results from the coordinated but to much extent autonomous work of a team of reporters, whose primary goal is not to build an archive, but to serve the best possible information product for immediate consumption. Reporters are often assisted by librarians and archive specialists, who help classify,index, and annotate news as they are sent to the archive, using special-purpose archive management software. In addition to this, powerful search and navigation mechanisms are needed for information consumers to find their way through.

Current technology typically provides keyword-based search (often by fields: body, headline, section, lead, byline), browsing facilities inside newspaper issues, and, in online newspapers, navigation through static hand-made hyperlinks between news materials (e.g. links to earlier background stories). Aspects that can be improved include: a) keyword search falling short in expressive power; b) weak interrelation between archive items: users may need to combine several indirect queries manually before they can get answers to complex queries; c) lack of a commonly adopted standard representation for sharing archive news across newspapers; d) lack of internal consensus for content description terminology between and among reporters and archivers; e) lack of involvement of reporters in the archiving process.

Neptuno is a joint project aiming at applying Semantic Web technologies to improve current state of the art for digital news archive management and exploitation. The goal of the project is to develop a high-quality semantic archive for the Diari SEGRE newspaper where a) reporters and archivers have more expressive means to describe and annotate news materials, b) reporters and readers are provided with better search and browsing capabilities than those currently available, and c) the archive system is open to integration in an electronic news marketplace.

The main components of the platform being developed are:
  • An ontology for archive news, based on journalists' and archivers' expertise and practice. The ontology integrates current dominant standards like NewsML and Subject Reference from IPTC, adapted and extended (mappings are defined wherever a direct inclusion is not appropriate).
  • A knowledge base where archive materials are described using the ontology. A DB-to-ontology conversion module will be developed for the automatic integration of existing lecagy archive materials into the knowledge base.
  • A semantic search module, where meaningful information needs can be expressed in terms of the ontology, and more accurate answers are supplied.
  • A visualization and navigation module to a) display individual archive items, parts or combinations of items, and groups of items, and b) provide semantic navigation facilities based on automatically inferred links between materials (news threads, paths, dynamic clusters).
  • A personalisation module to improve search and adapt navigation to user profiles.
The Diari SEGRE reporters were the primary users of the archive exploitation functionalities. Further extensions suitable for the general public were developed as an extension of the project. To ensure archive quality and the success of the system (i.e. reporters willing to use it), it is essential to take in the reporters' understanding and views on the domain into the system. To this purpose, both archivers and reporters participated in the ontology definition process in order to reach a consensus. The system supports direct contributions of reporters to the archive building process as well, with a minimal user effort overhead.

The ontology classes and properties have been created with Protegé 2.0 in RDF format. The current version includes 1,330 classes and 44 properties. The application modules have been built using Jena 2.1. The demo has been set up with a subset of the archive, including the contents produced during the month of August 2003, which comprises 3,257 news and 1,631 photographs.The knowledge base for these contents contains 7,968 instances and 71,454 sentences  in RDF, which take 11Mb disk space in RDF/XML format. The Neptuno ontology integrates the IPTC Subject Reference System Standard. This part of the Neptuno ontology is used for the classification of news materials, and as an element for search and navigation in the Neptuno semantic newspaper archive. We have prepared an RDF Schema representation of this standard, available for dowload.

Download IPTC Subject Reference Ontology.

The project was funded in 2003 by the Spanish Ministry of Science and Technology, Grant FIT-150500-2003-511.


Project publications

P. Castells, M. Fernández, D. Vallet. An Adaptation of the Vector-Space Model for Ontology-Based Information Retrieval. IEEE Transactions on Knowledge and Data Engineering 19(2), February 2007.
Draft version: PDF

P. Castells, F. Perdrix, E. Pulido, M. Rico, J. M. Fuentes, V. R. Benjamins, J. Contreras, E. Piqué, J. Cal, J. Lorés, T. Granollers. Newspaper Archives on the Semantic Web. In Navarro-Prieto, Raquel; Lorés-Vidal, Jesús (Eds.). HCI related papers of Interacción 2004. ISBN: 1-4020-4204-3. Springer Verlag, 2006.

D. Vallet, M. Fernandez, P. Castells. An Ontology-Based Information Retrieval Model. 2nd European Semantic Web Conference (ESWC 2005). Heraklion, Greece, May 2005. Springer Verlag Lecture Notes in Computer Science, Vol. 3532. Gómez-Pérez, A.; Euzenat, J. (Eds.), 2005, ISBN: 3-540-26124-9, pp. 455-470.

P. Castells, F. Perdrix, E. Pulido, M. Rico, R. Benjamins, J. Contreras, J. Lorés. Neptuno: Semantic Web Technologies for a Digital Newspaper Archive. 1st European Semantic Web Symposium (ESWS 2004). Heraklion, Greece, May 2004. Springer Verlag Lecture Notes in Computer Science, Vol. 3053. Davies, J.; Fensel, D.; Bussler, C.; Studer, R. (Eds.), 2004, XIII, ISBN: 3-540-21999-4, pp. 445-458.

M. Fernández, D. Vallet, P. Castells. Automatic Annotation and Semantic Search from Protégé. Demo at the 8th International Protégé Conference. July 2005, Madrid, Spain.
Demo abstract: PDF

P. Castells, E. Pulido, C. Carranza, M. Rico, F. Perdrix, E. Piqué, J. Cal, R. Benjamins, J. Contreras, J. Lorés, T. Granollers. Neptuno: tecnologías de la web semántica para una hemeroteca digital. V Congreso en Interacción Persona-Ordenador (Interacción 2004). Lleida, mayo 2004, ISBN: 84-609-1266-3, pp. 306-313.


F. Perdrix, E. Piqué, J. Cal, P. Castells, E. Pulido, M. Rico, J. Lorés, T. Granollers, M. González, J. Badia. Uso de escenarios aplicados a la Ingeniería de los Requisitos para la creación de una hemeroteca digital. V Congreso en Interaccin Persona-Ordenador (Interacción 2004). Lleida, mayo 2004, ISBN: 84-609-1266-3, pp. 130-133.
Extended version: PDF

P. Castells and J. A. Macías. Context-Sensitive User Interface Support for Ontology-Based Web Applications. 1st International Semantic Web Conference (ISWC 2002), Poster Session. Sardina, Italy, June 2002.
PDF (abstract)
PDF (poster)

P. Castells and J. A. Macías. An Adaptive Hypermedia Presentation Modeling System for Custom Knowledge Representations. Proceedings of the World Conference on the WWW and Internet (WebNet 2001). Orlando, Florida, October 2001, pp. 148-153.