Richard Cyganiak
Boris Villazon-Terrazas

The Sitemap Protocol is an easy way for webmasters to inform search engines about pages on their sites available for crawling. It is supported by the major search engines such as Google, but also by data-focused search engines like Sindice. Sitemap4rdf is a command-line tool that generates sitemap.xml files for web sites that publish Linked Data from a SPARQL endpoint.

Download sitemap4rdf

v0.2.1 (alpha), released 2010-08-27

News

About sitemap4rdf

Linked Data is a method of publishing machine-readable data on the Web using the RDF technology stack. Data search engines and other data consumers often want a local copy of such datasets for performance reasons. This often requires crawling the entire site, an expensive, fragile and unpredicatable process.

Website crawling can be made more efficient and predictable by using the Sitemap Protocol, originally developed by Google and now supported by all major search engines, as well as data search engines such as Sindice. It consists of a sitemap.xml file that is usually placed in the website root directory and contains a list of all the URLs to be crawled.

Sitemap4rdf is a command-line tool that generates sitemap.xml files for Linked Data sites that have a SPARQL endpoint. Sitemap4rdf queries the endpoint to retrieve a list of all URLs, and generates the sitemap.xml, which then must be uploaded to the site.

Features include support for Sitemap compression, and support for Sitemap splitting and index files for large sites.

Quick start

You need:

What to do:

  1. Download and extract the archive to a suitable location.

  2. Run sitemap4rdf from the command line, specifying your SPARQL endpoint and the prefix of the URLs to include in the Sitemap:

    sitemap4rdf http://example.com/sparql http://example.com/resource/

    (Use ./sitemap4rdf on Linux or OS X.) This generates one or more sitemap*.xml files in the current directory.

  3. Optionally, study the documentation below for further configuration options, or put all your configuration into a configuration file.

  4. Upload the generated Sitemap files to your website. They should be in the root directory, e.g., http://example.com/sitemap.xml.

  5. Optionally, link the Sitemap files in robots.txt. This will ensure that compatible web crawlers will discover your Sitemap automatically.

    If your site doesn't yet have a robots.txt file in the root directory, create it. Then add the following line:

    Sitemap: http://yoursite.com/sitemap.xml

    Or, for large sites where sitemap4rdf splits the Sitemap into multiple files:

    Sitemap: http://yoursite.com/sitemap_index.xml
  6. Submit your Sitemap to search engines. For further information, see Sitemap submission for Google and Sindice's Sitemap submission form.

Configuration options

Run the sitemap4rdf command without paramters for a full list of configuration options.

Explanations of the individual options can be found in the example configuration file in the next section.

Working with configuration files

It is possible to set all the arguments and parameters in a configuration file, and invoke the tool like this:

sitemap4rdf --config config.xml

Here is an example configuration file:

<?xml version="1.0" encoding="UTF-8"?>
<Sitemap4rdf sparqlEndpoint="http://geo.linkeddata.es/sparql" uriPrefix="http://geo.linkeddata.es/">
    <!-- sparqlEndpoint is a SPARQL endpoint URL.
         uriPrefix is the common URL prefix shared by all URLs on the site; only matching
         URLs will be included in the Sitemap -->

    <!-- The date of last modification of the site. This date should be in W3C Datetime format. 
         This format allows you to omit the time portion, if desired, and use YYYY-MM-DD. -->
    <Param name="lastmod" value="2010-08-08"/>    

    <!-- How frequently the page is likely to change. This value provides general information to search engines 
         and may not correlate exactly to how often they crawl the page. Possible values:
         always, hourly, daily, weekly, monthly, yearly, never -->
    <Param name="changefreq" value="monthly"/> 

    <!-- The base location on the Sitemap files, needed when a sitemap_index.xml is being created -->
    <Param name="siteroot" value="http://geo.linkeddata.es/"/>    

    <!-- output directory -->
    <Param name="outputdir" value="/home/ev/"/> 

    <!-- Allows to specify a regular expression, and any URL not matching will not be included -->
    <Param name="exclude" value="Murcia"/>    

    <!-- Allows to zip the Sitemap file -->
    <Param name="gzip" value="no"/> 

</Sitemap4rdf>

SPARQL Query

The SPARQL query which is actually running on the enpoint:

SELECT DISTINCT ?n
WHERE { ?n a [] . 
	FILTER (REGEX(STR(?n), "http://geo.linkeddata.es/" )) .  
} 

where http://geo.linkeddata.es/ is the uriPrefix

Support and feedback

You can contact the authors via email:
Richard Cyganiak
Boris Villazon-Terrazas

License, source code and development

License: This tool is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version.

Source Code: The latest source code is available from the project's SVN repository.

Google code logo