Bilingual Website Search-String Tabulator

C-Kermit 9.0 Book: Using C-Kermit Download the Weblog script

C-Kermit 9.0: Bilingual Website Search-String Tabulator

Frank da Cruz
fdc@columbia.edu
5 April 2010 (updated 27 July 2013)

This page presents a short script that illustrates new features in C-Kermit 9.0 (added in Alpha.05) that provide script-level access to some of Kermit's character-set features.

Suppose you have a bilingual website in English and Spanish and you want to produce a report from the Web log showing (say) the 20 the most popular Google searches. This is an interesting problem because:

Search strings might be in Spanish or English or a mixture of the two.
The accented form of Spanish letters might or might not have been entered by the user; for example, "español" might have been spelled "espanol".
The accented letters of Spanish might be encoded in different character sets, such as ISO-8859-1, UTF-8, or some Windows code page such as CP 1252
The character set of the search string might or might not be identified.
The search string might be lowercase, uppercase, or any mixture.
Google generally substitutes plus signs for any spaces in the search string.
Since the search string itself is part of the URL, it is further encoded according to RFC1738, so it won't contain control, reserved, or non-ASCII characters. In this encoding, each such character becomes a percent sign (%) followed by two hexadecimal digits. And then for some reason, the %-sign in the hex escape is doubled in the web log.

For example, the search “amazon en español” might show up in the log as:

http://www.google.com/search?q=amazon+en+espa%%C3%%B1ol &ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:es-ES:official&client=firefox-a

The search string exhibits hex encoding, and the clause “&ie=utf-8” identifies the character set. In another example, the search arrives in ISO-8859-1 Latin Alphabet 1:

http://www.google.com/search?hl=es&client=ms-rim&channel=browser &ie=ISO-8859-1&q=color+Casta%%F1o+en+ingl%%E9s&btnG=Buscar&lr=

The web log has one line per website access. Each line consists of nine Tab-separated fields (\9 is Tab), of which the 8th is the referring URL and the 5th is the name of the file that was served. Here is a skeleton program that reads the file and selects entries in which HTML files were served in response to a Google search:

fopen /read filename \%i                    # Open the input file
if fail exit 1                              # Exit if file wasn't opened.

while true {
    fread /line \%i line                    # Read line and trim blanks
    if fail break                           # Quit loop on end of file
    void \fsplit(\m(line),&a,\9,ALL,,1)     # Split line into fields
    if not \findex(.html,\&a[5]) continue   # Skip record if file not HTML
    .google := \findex(.google.,\&a[8])     # Did access come from Google?
    if not google continue                  # If not skip this record
}

(The only reason we're filtering out non-Google searches is to keep the example simple; other search engines have different syntax.) Here we see an important new C-Kermit 9.0 feature:

The \fsplit() function uses the new symbolic Include Set, ALL. This tells Kermit to split the line into its Tab-separated components using Tab and only Tab as the break character (this would have been rather difficult in previous C-Kermit releases, especially if the line contained 8-bit characters). The flag value of 1 says not to collapse adjacent tabs in case there are any blank fields. This is the procedure you would use to read any Tab-Separated-Value (TSV) file with C-Kermit.

Now let's see how to process each record, which has been split into the array \&a[]:

void \fsplit(\&a[8],&b,&?,ALL,,1)           # Split Google URL into clauses
undef charset string                        # Clear the result variables
for i 1 \fdim(&b) 1 {                       # Look for &q= and &ie=
    void \fsplit(\&b[i],&c,=,ALL)           # Split this clause on "="
    if equ "\&c[1]" "q" .string := \&c[2]        # If 'q' set query string
    else if equ "\&c[1]" "ie" .charset := \&c[2] # If 'ie' set charset
}
if not def string continue                  # If no query string skip this record
.string := \fsqueeze(\flower(\fdecodehex(\freplace(\m(string),+,\32))))

So first we split the record into its fields, one of which is the URL. Then we split the URL into its "clauses". Then we split each clause into its ID and its value. If we have a query (q) clause its value is the (encoded) search string. If we have an Input Encoding (ie) clause, its value is the MIME character-set name. Once we have the search string, we do:

.string := \fsqueeze(\flower(\fdecodehex(\freplace(\m(string),+,\32))))

The inner function \freplace() simply replaces all plus signs with blanks. The new \fdecodehex() function undoes the hex encoding, so that %%E9, %%F1, etc, are replaced by the bytes they represent. Then the function flower() lowercases the result, and finally the new \fsqueeze() function converts all runs of internal whitespace to single blanks.

\freplace(\m(string),+,\32) =   this is a string
\fdecodehex(one%%2C%%20two%%2C%%20three)    =   one, two, three
\flower(This Is a STRING) =   this is a string
\fsqueeze(this   is   a   string) =   this is a string

The \fdecodehex() function assumes that the hex escape sequence is %% but you can specify a different one as a second argument (for example, '=' for email in Quoted-Printable format):

\fdecodehex(one=2C=20two=2C=20three,=) = one, two, three

Now comes the interesting part:

if def charset {
    _increment cset<\fupper(\m(charset))>
    if debug echo "CONVERTING [\m(string)] \m(charset)->ISO-8859-1"
    .string := \fcvtcset(\m(string),\m(charset),iso-8859-1)
    if debug show mac string
} else if equal "\fstringtype(\m(string))" "UTF8" {
    .string := \fcvtcset(\m(string),UTF-8,iso-8859-1)
    if debug show mac string
    _increment cset<UNDECLARED_UTF-8_DETECTED>
} else {
    _increment cset<UNDECLARED>
}

The object of this program is to normalize the search strings so those that differ only in character set or alphabetic case or spacing can be counted together. In this example we're going to convert all the search strings to ISO 8859-1. The first clause above says "If the character set was declared in the URL, then convert the string from the declared character set to ISO-8859-1." The conversion is done with the new fcvtcset() function: given a string and two character-set names, it converts the string from the first set to the second (if the two are the same, it does nothing).

Now, in order for this to be driven by the character-set names used in Google's &ie= clause, Kermit had to become aware of the MIME character-set designators. Kermit has always had its own names for character sets, which preceded MIME's; e.g. LATIN1 for ISO Latin Alphabet 1. MIME has different names, e.g. ISO-8859-1 for Latin Alphabet 1. As of Alpha.05, C-Kermit 9.0 understands the MIME names as well, thus you can feed either Kermit names or MIME names to any of Kermit's commands or functions that take character-set names. A new equivalence table is available HERE.

We're not done yet: the majority of Google URLs are missing the character-set ID, yet many of them still contain UTF8 data, which would need to be converted to have a consistent set of strings. For this, C-Kermit 9.0 Alpha.05 has another new function, \fstringtype(), seen in the second clause above:

} else if equal "\fstringtype(\m(string))" "UTF8" {

This function can tell whether its argument is 7-bit text, 8-bit text, UTF-8, or binary (not text at all). Luckily, the format of UTF-8 is just about unmistakable. When the IF statement is true, we convert the untagged UTF-8 to Latin-1. If the string is not UTF-8, we leave it alone because there is no reliable way to know that the character set is, and in any case, it's mostly likely Latin-1 or CP1252, which are compatible as far as Spanish is concerned.

Finally, a couple more transformations:

.string := \fsubstitute(\m(string),ÁÉÍÓÚÜÑ,áéíóúüñ)
.string := \freplace(\m(string),espanol,español)

The first one lowercases any uppercase accented Spanish letters, since \flower() presently handles only A-Z, unaccented. Second, we replace the word "espanol" with "español", because this is a very common case and it can be done safely.

UPDATE: (July 2013) In C-Kermit 9.0.304 (as of development release Dev.06), the \fsubstitute() step is no longer necessary. The \flower() and \fupper() functions work as expected with in this case, using the internationalization features (locale) of the underlying operating system as explained in this page.

Statistics

Normalizing the search strings was the hard part; the statistics are done by accumulating totals in the associative arrays search<> and cset<>. Associative arrays are described in the C-Kermit 7.0 release notes. Each time we have a normalized string and its character set, we increment its associative array element:

_increment cset<\fupper(\m(charset))>
_increment search<\m(string)>

The _increment command is like the increment command (no underscore) except it computes the name of the variable before incrementing it.

To dump out the results, we convert the associative arrays to regular arrays, sort them in the desired order, and display them as follows:

echo
echo "Records:  \flpad(\m(lines),8)"
echo "Google:   \flpad(\m(google),8)"
.n := \faaconvert(search,&a,&b)
echo "Unique:   \flpad(\m(n),8)"
.m := \faaconvert(cset,&c,&d)
echo "Charsets: \flpad(\m(m),8)"
echo
echo Charsets by by frequency...
array sort /reverse /numeric &d &c
for i 1 m 1 {
    echo \flpad(\m(i),3). \flpad([\&d[i]],8) \&c[i]
}
if > n 20 .n = 20
echo
echo Top \m(n) searches by frequency...
array sort /reverse /numeric &b &a
for i 1 n 1 {
    echo \flpad(\m(i),3). \flpad([\&b[i]],8) \&a[i]
}

The output looks like this:

Records:     28801
Google:      11442
Unique:       5546
Charsets:        6

Charsets by frequency...
  1.   [7646] UNDECLARED
  2.   [1962] UTF-8
  3.   [1512] UNDECLARED_UTF-8_DETECTED
  4.    [116] ISO-8859-1
  5.     [20] UTF8
  6.      [1] GB2312

Top 20 searches by frequency...
  1.    [649] amazon español
  2.    [647] amazon
  3.    [418] amazon.com
  4.    [185] dia de la madre 2010
  5.    [117] dia de la madre 2010 colombia
  6.    [109] www.amazon.com
  7.    [106] dia de la madre
  8.     [95] cuando es el dia de la madre
  9.     [88] amazon.com español
 10.     [75] amazon en español
 11.     [69] dominican hair products
 12.     [65] www.amazon.com en español
 13.     [64] dia international de la mujer
 14.     [46] dia del maestro
 15.     [46] nombres de paises en inglés
 16.     [44] dominican hair products for hair growth
 17.     [42] zapatos altos para dama
 18.     [42] amazon español venezuela
 19.     [37] amazon.es
 20.     [37] dia de la madre 2010 chile

Without the normalization, there would be many variations of each string and each variation would be counted separately.

Download the weblog script Download C-Kermit 9.0 Character Set Names

`\freplace(\m(string),+,\32)`	=	`this is a string`
`\fdecodehex(one%%2C%%20two%%2C%%20three)`	=	`one, two, three`
`\flower(This Is a STRING)`	=	`this is a string`
`\fsqueeze(this is a string)`	=	`this is a string`