612 West 115th Street, New York NY 10025 USA • kermit@kermitproject.org
| |||||||||
|
Frank da Cruz
The Kermit Project
Columbia University
December 2010
Last update: Sun Mar 7 08:35:11 2021
Download: http://kermit.columbia.edu/ftp/scripts/ckermit/ksitemap Requires: C-Kermit 9.0.
Totally data driven, ksitemap reads a file-list file (or “filelist” for short) containing the names and attributes of the pages and images to be included in the sitemap. The filelist file is kept in the web directory itself, but it need not be world readable.
The ksitemap script should work under any Unix operating system (Linux, Mac OS X, NetBSD, Solaris, etc) that has C-Kermit 9.0 installed (but the top line, which indicates the pathname of the C-Kermit executable, might need to be changed). In Unix the ksitemap script must, of course, also be given execute permission (chmod +x). Ksitemap has not yet been tested in VMS.
If you give a directory name without a filename, 'filelist' is used as the filename.
$ ksitemap /www/filelist (absolute) $ ksitemap ~/web/filelist (symbolic) $ ksitemap web/filelist (relative) $ ksitemap ../web/filelist (relative) $ ksitemap /www/ (absolute directory, no filename) $ ksitemap (no argument, see just below)
If you invoke ksitemap without a command-line argument then:
export KSITEMAPDIR=/net/w/0/htdocs/username/web/
and the name of the file-list file is 'filelist', then you can run ksitemap from any directory any time without any command-line argument.
To invoke for debugging and testing, do:
$ DEBUG=1 ksitemap args
This gives progress messages and it writes the sitemap.xml file in a "tmp" directory.
And it can contain blank lines, which are ignored. Nonblank, non-comment lines are in this format:# This is a comment line
tag=value
An equal sign (=) separates the tag from the value. If you include whitespace (blanks or tabs) before and/or after the equal sign, they are ignored. The following three lines have identical effect:
home=http://www.xyzcorp.com/ home = http://www.xyzcorp.com/ home= http://www.xyzcorp.com/
If you need to include an equal sign in the value itself, surround the value with ASCII doublequotes. If you want the value itself to be enclosed in doublequotes, put three of them on each end (see the section on programming considerations for an explanation). Examples:
cap=View from the Empire State Building looking East cap="A+B=C" cap="""Caption within doublequotes"""
The first few lines define parameters for the whole website:
Tag Status Value encoding Depends sitemap.xml files are encoded in UTF-8. If your filelist file is encoded in some other character set (such as ISO-8859-1) for the purpose of including non-ASCII characters (such as accented letters or non-Roman letters), you must declare its encoding so ksitemap can convert the text to UTF-8. If your file-list file is ASCII, or it is already UTF-8, this item is optional. Otherwise this item is required, and it should come first, so ksitemap can convert all the lines in the file appropriately. The value is the MIME name of the character set used in the file-list file. For a list of supported encodings, see this page). home Required The URL of the website's home directory (with no filename part) geo Optional The default geographical location for images, if any lic Optional The default filename, if any, for a page containing copyright or license information for the site's original images .macroname Optional Definition for macro with given name
These items should come before any of the page-specific items that are described below. If you include a geo or lic tag before any url tag (see below), these will be used for any image for which you do not specify a geo or lic tag. In other words the ones in the top section are global and the ones in an img section are local to that image.
The "home" line's value is the URL of the website root directory, ending with slash, for example:
home=http://kermit.columbia.edu/
This is used to form the full URLs of the files and images in the website. Example:
home=http://kermit.columbia.edu/ lic=copyright.html
This results in the URL of the license file being:
http://kermit.columbia.edu/copyright.html
Macros allow you to use variables in value strings. For example, given:
.year=2010
Then any ocurrence of
in a value string is replaced by
.
The remainder of the file list contains lines for each file and image you want to include in your sitemap. For each page, the lines should appear in the following order:
Tag Status Value url Required Name of an html file relative to the website's root directory. pri Optional Priority of the page, 0.0 to 1.0
For each URL, the page date is supplied automatically based on the modification date of the file and the change frequency (daily, weekly, monthly, yearly) is supplied based on when the file was last modifed.
For redirects, a URL entry can have two values; for example:
url=index.html=index-en.html
This means that the first filename is an HTTP Redirect to the second
filename; that is, the first name is a pointer to a file having the second
name. For example, suppose you have a website with calendars for different
years:
url=cal.html=cal-2011.html
If you have a lot of files using this naming convention, you can use a macro so the variable string can be defined (and changed) in just one place instead of lots of places:
.year=2011 url=cal.html=cal-\m(year).html url=jan.html=jan-\m(year).html url=feb.html=feb-\m(year).html etc...
If there are images on the page that you want to include in the sitemap:
Tag Status Value img Required Name file an image file in the root directory or in a subdirectory. cap Optional A text caption for the image title Optional A text title for the image geo Optional The geographical localization of this image only lic Optional The URL of a license page for this image only
Here's a brief example that has three files. For the first file (index.html), a priority is specified; for the others, the default priority is accepted. The second file is in a subdirectory. The third file has images. Comments, blank lines, and indentation are used for clarity, but they do not do not affect the result. Note that there may be, but need not be, whitespace around the equal sign.
# ksitemap filelist for building sitemap.xml encoding = UTF-8 home=http://kermit.columbia.edu/ geo=New York City USA lic=copyright.html url=index.html pri=1.0 url=cudocs/ilosetup.html url=cable.html img=connectors-340.jpg cap=Male and Female RS-232 Connectors title=Serial Data Connectors img=modemcable.jpg cap=Modem Cable Schematic geo=Bedford MA img=nullmodem-480.jpg cap=Null Modem Cable Schematic lic=special.html geo=Batey Caño - Yamasá
The resulting sitemap.xml looks like this:
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1"> <url> <loc>http://kermit.columbia.edu/</loc> <lastmod>2010-12-07</lastmod> <changefreq>daily</changefreq> <priority>1.0</priority> </url> <url> <loc>http://kermit.columbia.edu/cudocs/ilosetup.html</loc> <lastmod>2010-12-07</lastmod> <changefreq>daily</changefreq> <priority>0.5</priority> </url> <url> <loc>http://kermit.columbia.edu/cable.html</loc> <lastmod>2010-12-07</lastmod> <changefreq>daily</changefreq> <priority>0.5</priority> <image:image> <image:loc>http://kermit.columbia.edu/connectors-340.jpg</image:loc> <image:caption>Male and Female RS-232 Connectors</image:caption> <image:title>Serial Data Connectors</image:title> <image:geo_location>New York City USA</image:geo_location> <image:license>http://kermit.columbia.edu/copyright.html</image:license> </image:image> <image:image> <image:loc>http://kermit.columbia.edu/modemcable.jpg</image:loc> <image:caption>Modem Cable Schematic</image:caption> <image:geo_location>Bedford MA</image:geo_location> <image:license>http://kermit.columbia.edu/copyright.html</image:license> </image:image> <image:image> <image:loc>http://kermit.columbia.edu/nullmodem-480.jpg</image:loc> <image:caption>Null Modem Cable Schematic</image:caption> <image:geo_location>Batey Caño - Yamasá</image:geo_location> <image:license>http://kermit.columbia.edu/special.html</image:license> </image:image> </url> </urlset>
splits a filelist line into two pieces, the tag and the value:.\%9 := \fsplit(\m(line),&x,=,CSV) # Split line on '='
Another observation about
A more serious problem was noted when adding the macro capability to
ksitemap, namely that
Finally it should be noted that ksitemap takes pains to expand macros only
after verifying that a line contains “
|
|
ksitemap / Kermit sitemap script / The Kermit Project / Columbia University / December 2010 / Updated 7 March 2021 |