Caucus Statistics and the Events table Last revised: 22 July 2003 I. Introduction
Caucus 4.5 adds the capability to record user "events" in a relational database ("RDBMS") such as MySQL or Oracle. The major purpose of recording these events is to be able to extract statistical information about Caucus usage, e.g. the length of users' sessions, number of users on at different times of the day, average size of responses, average time to write responses... and so on.By using an RDBMS to log such events, we store a wealth of information that can be selectively extracted and analyzed later, without knowing up front what all the intended or possible uses are.
See Installing MySQL on Linux and Installing MySQL on Solaris for information about using MySQL with Caucus.
II. Licensing
The Caucus events module was funded by a private company (not CaucusCare), and must currently be licensed separately (on top of) any existing Caucus 4.x license. Contact info@caucuscare.com for more information and pricing. It is intended, once the investment in this module has been recouped, that it will thereafter become part of the standard Caucus distribution.III. "Events" log table
Each Caucus "event" is stored as a row in a table called Events. The rows are defined below. The columns Event, Userid, Sessionid, and Time will always have a value (are not null). Any of the other columns may be null (may not have a value), depending on the type of event.
Column Data type Meaning event varchar(12) Event name userid varchar(80) Caucus user's userid sessionid varchar(20) Unique identifier of a single user session time timestamp Date/time of the event duration int Duration in seconds of the event (e.g. 'addresp' duration is time in seconds from viewing item until clicking "post" size int Size in bytes (e.g. 'addresp' is size of new response) conf int Conference number item int Item number r0 int Response number, or first response in a range rn int Last response number in a range iface varchar(40) Caucus interface name (e.g. "CC44") object varchar(80) Object of event, e.g. user creation is userid of new user s1 varchar(80) Arbitrary string #1 s2 varchar(80) Arbitrary string #2 s3 varchar(80) Arbitrary string #3 n1 int Arbitrary number #1 n2 int Arbitrary number #2 n3 int Arbitrary number #3 IV. $logevent() function
Events are logged in Caucus by calling the $logevent() function from within a CML page. The syntax of the $logevent() call is:eval $logevent (event iface duration size conf item r0 rn object s1 s2 s3 n1 n2 n3)
Only the event argument is required. All the rest are optional, although any given argument requires that there be some value for all of the previous arguments. E.g. if you wish to supply a value for r0, you must supply values for event, iface, duration, size, conf, and item.The Caucus 4.5 distribution includes a default set of events, created by calls to $logevent() in the CML files. The list of event names (event column) is not, however, restricted to those supplied by CaucusCare. Local Caucus administrators may add as many additional events as they like, in whichever CML files they choose. If you do create new events that you find useful, please let us know about them!
V. Standard Events
Here are the standard events, as logged by the standard Caucus CML files. All events record columns Event, Userid, Sessionid, and Time. Except in very rare cases, all events will also record Iface.
Name Description login Caucus user logs in. logout Caucus user logs out: clicks on quit, or session timesout, or swebs process receives a SIGHUP signal. Except in very rare circumstances (such as a sudden host shutdown), every login will have a matching logout. A login will never have more than one matching logout. Records Duration as the # of seconds since matching login. If the logout occurs due to a timeout or SIGHUP, the duration is the time from 'login' until the last logged event before the logout (otherwise the session duration would be artificially high). read User read a range of responses to an item. Records Conf, Item, and the complete range of responses that appear on the page as R0 through RN (item text is "response" 0). additem User created a new item. Records Conf, Item, Size (in characters), and Duration (# of seconds between the appearence of the "add item" page and clicking on "Post"). addresp User added a response. Records Conf, Item, R0, Size (in characters), and Duration (# of seconds between the first appearence of the "add response" box and clicking on "Post"). leftconf User was in a conference, and left it. Records Conf, and Duration. Duration is the # of seconds the user was "in" the conference. leftitem User was in an item, and left it. Records Conf, Item, and Duration. Duration is the # of seconds the user was "in" the item. (Includes successive views of the same item, and any editing or posting done while on the same item.) ifspage User visited an "IFS" page (from a %ifs_page_link() or %ifs_popup() macro). Records Conf, Item, and R0. help User clicks on "Help" button. Records name of help file for current page in S1. adduser New user is created (by manager or by self-registration). deluser User is deleted. For the future, we're also considering adding the following events: Register (new user registration), UploadFile, DeleteItem, DeleteResp, EditItem, EditResp. Suggestions for additional "events" are most welcome!
VI. Notes on programming events
Events, as logged in Caucus, typically come in two forms:
- Isolated events
- Context event
An isolated event has data about the event, but otherwise is unconnected to anything else that happened. The 'login' event is the simplest case of an isolated event.
A context event marks something that happened in a larger context. Typically that means that the event embeds or records some information that relates to something else that happened earlier. The simplest case of this is an event that measures duration -- in order to have a duration, there must have been something else that happened to mark the start of the duration. The 'logout' event is a context event, because it measures the duration from the 'login' until the 'logout'.
This measuring of "something else" is implemented in the CML pages. For example, when a 'login' event is logged (in the startup.i CML file), it also saves the current time in the CML variable 'time_login'. Then when the 'logout' event is logged, the relevant CML page (goodbye.cml) can calculate the difference between the current time and the 'time_login' time, and log that as the duration of the 'logout' event.
The 'out of the box' Caucus CML pages use a simple convention for variable names for recording this kind of context information. 'time_xxx' is the time at which event 'xxx' happened. 'event_yyy' is some other kind of information about 'yyy' that is kept for a later call to $logevent().
For example, the 'leftconf' event uses CML variables time_cnum and event_cnum to record the time of entry to a conference, and the conference number of that conference. (See leftconf.i for details.)
VII. Other related tables
When making queries against the Events table, it would be very useful to also know two additional kinds of information:
- The actual names of conferences (rather than just the number in 'events.conf').
- Which Caucus user groups 'userid' belongs to.
This information is made available in two additional tables, Confnames and Groups. The data in these two tables are automatically updated at regular intervals (via cron) from the raw Caucus data. (In a later version of Caucus, everything will live in RDBMS tables, and this issue of table updating will disappear.)
The layout of these two tables is quite simple:
Table Column Data type Meaning Groups name varchar(80) Name of group file userid varchar(80) Actual userid of person included in this group Confnames name varchar(80) Conference name (using underscores) cnum int Conference number (same as Events.conf) The Groups table actually contains all the user-to-group relationships, including groups included in groups. More precisely, if user X is in group B, and group B is included in group A, then there will be a row in Groups containing "B, X", and another row containing "A, X".
VII. Writing Queries
A basic familiarity with writing SQL queries is assumed, although the psuedo-english-language nature of the SQL language is pretty logical. Some good introductions to SQL can be found at:
- riki-lb1.vet.ohio-state.edu/mqlin/computec/tutorials/SQLTutorial.htm
- www.highcroft.com/highcroft/sql_intro.pdf (same as above, but as a PDF)
- www.1keydata.com/sql/sql.html
The best way to write complicated Caucus queries is to start with a very simple query, and extend it piece by piece until you've reached the desired goal. The rest of this section describes an example that works just that way.
For this example, the final goal is to see a list of conferences, in order of popularity, as measured among a particular group of users called "mygroup". Popularity will be measured by amount of time spent in the conference.
Syntax notes: each new "piece" of our example query is shown in boldface. SQL keywords are shown in UPPER CASE; table names and column names are shown in lower case; alias are shown in Mixed Case. (SQL doesn't really care about upper or lower case, this is just a convention that makes SQL easier to read. It also doesn't care about line breaks -- you could write these queries as a single line, or using as many separate lines as you like.)
- Start with an extremely simple query that simply finds the "duration" column in all rows in the Events table -- which measures time spent in a conference. This just gives us a pile of numbers that are times in seconds spent in any conference by anyone.
SELECT duration FROM events WHERE event='leftconf'
- Now calculate the sum of the durations. That gives us a single number across all conferences.
SELECT SUM(duration) FROM events WHERE event='leftconf'
- Give the sum a name (aka "alias") 'TimeSec' that will appear in the output, and that we can use later to reference the sum.
SELECT SUM(duration) TimeSec FROM events WHERE event='leftconf'
- Divvy the sum up into subtotals, each by conference. Show ("select") the conference number so we know which subtotal goes with which conference.
SELECT conf, SUM(duration) TimeSec FROM events WHERE event='leftconf' GROUP BY confBy this point, we are now seeing output containing rows of two numbers; the first number is the conference number, the second number is the total duration spent by everyone in that conference. We're getting close to something useful.
- Order the output to show the most popular (highest duration) conference first, and so on ("DESC" ie descending order).
SELECT conf, SUM(duration) TimeSec FROM events WHERE event='leftconf' GROUP BY conf ORDER BY TimeSec DESC
- Now for the fun part: we connect, or "join" the data in two different tables to give us more useful information. In particular, conference number isn't all that useful. What we really want is the conference name -- but that lives in the table Confnames.
Think of "joining" two tables as if you were gluing them together horizontally, and then just selecting only certain combinations of the rows you want from each. You do this by finding some common element across the two tables, and saying "join them by that element". In our example, that common element is the conference number. Here's the result:
SELECT confnames.name, SUM(duration) TimeSec FROM confnames, events WHERE events.event='leftconf' AND confnames.cnum = events.conf GROUP BY events.conf ORDER BY TimeSec DESCNow that we're SELECTing from two tables, we put the table name in front of the columns we're examining (e.g. events.event and confnames.cnum). This isn't absolutely necessary, but it's clearer when there's more than one table involved.Now we've actually got something useful -- a measure of the popularity of conferences, in order, with names.
- The real goal was to measure the conference popularity among a particular group of users. So we want to "reduce" the results to only consider people in user group "mygroups". This involves yet another join -- this time against the Groups table.
SELECT confnames.name, SUM(duration) TimeSec FROM confnames, events, groups WHERE events.event='leftconf' AND confnames.cnum = events.conf AND events.userid = groups.userid AND groups.name = 'mygroups' GROUP BY events.conf ORDER BY TimeSec DESC
- The last query gave us the popularity measure "for all time" (or at least during the time that the conferences existed and the statistics module was installed!). Frequently we'll want to measure things only over a particular time range, such as the last 30 days. Here's how we can modify the previous query to restrict it in time:
SELECT confnames.name, SUM(duration) TimeSec FROM confnames, events, groups WHERE events.event='leftconf' AND confnames.cnum = events.conf AND events.userid = groups.userid AND groups.name = 'mygroups' AND TO_DAYS(events.time) >= TO_DAYS(SYSDATE()) - 30 GROUP BY events.conf ORDER BY TimeSec DESCTO_DAYS() converts a date-time value into the number of days since a (theoretical) year 0. SYSDATE() means "the current date and time". So the new AND expression reduces the rows selected (and added up into the SUM) to only those with an event date-time less than 30 days old (day number >= now minus 30 days).There are many functions for manipulating dates and times; see the date and time section of the MySQL reference manual for details.