The Quantitative Challenges from Click stream Data common thread through all techniques discussed is the need for data. Fortunately, a natural byproduct of users accessing pages is a data set that contains the sequence of URLs they visited, how long they viewed them, and at what time. This data set is called the click stream. To maximize its potential, managers can merge the click stream with demographic and purchase information. Three potential sources exist for collecting click stream data: (1) The host server (the computer of the site being visited) keeps a record of visits, usually called a server log. As a user requests a page, the server records identifying information (IP address, previous URL visited, and browser type) in the server log.
(2) A third party can capture information about web requests. For example, if a user contacts an Internet Service Provider (ISP) or Commercial On-line Service (COS), such as AOL, it can record any requests the user makes as it relays them to the requested server. Because many ISPs and COS's cache their users' requested pages, they do not pass all requests on to the server; instead, they serve many pages from local cache archives to speed up responses to user requests. Unfortunately, this means that server logs contain only a subset of the viewings that occur. Dree and Zufryden  discuss some of the challenges of using server log data to measure advertising effectiveness. (3) A final-and perhaps the most reliable-source of click stream data is a program installed on the computer where the browser program is running that can "watch" the user and re-cord the URLs of each page viewed in the browser window as well as other application programs that the user is running.
It records the actual pages viewed, and thus avoids the problem of cached requests. Such a program can also record how long windows are active. The drawback is that the analyst must choose the individuals and obtain their consent to participate in such a panel. Generally web users are randomly sampled to construct a representative sample. The information from this sample can be projected to the national population using statistical inference. The largest provider of such information is Media Metrix [Coffey 1999].
The click stream of an actual user session as collected by Media Metrix shows that the user frequently views the same page repeatedly and sometimes pauses to do other tasks between page views (for example run other applications or watch television). Only five of the 12 viewings the user requested could generate a "hit" to the server. This illustrates the advantage of collecting data at a user's machine and not from a host site since it includes all requests, eliminating a potential source of bias. Information about where and how frequently users access web sites is used for various tasks. Marketers use such information to target banner ads. For example, users who often visit business sites may receive targeted banner ads for financial services even while browsing at no business sites.
Web managers may use this information to understand consumer behavior at their site. Additionally, it can be used to compare competing web sites. Members of the financial community use such information to value dot com companies. Analysts use click stream information to track trends in a particular site or within a general community. Financial analysts find this type of intelligence useful for assessing the values of companies because many traditional accounting and finance measures can be poor predictors of firms' values. Another use of click stream data is to profile visitors to a web site.
Identifying characteristics about visitors to a site is an important precept of personalization. One way to find out characteristics of visitors is to ask them to fill out a survey. However, not everyone is willing to fill them out, creating what is known in marketing research as a self-selection bias. The information may be inaccurate as well, for example visitors may give invalid mailing addresses to protect their privacy or inaccurately report incomes to inflate their egos. Also, completing a survey takes time, and the effort required may severely skew the type of individuals that complete it and the results.
An alternative way to profile users is with click stream data. The demographic profiles of sites reported by companies like Media Metrix can be used to determine what type of individuals visit a site. For example, Media Metrix reports that 66 percent of visitors to i village. com are female. Even without knowing anything about a user except that they visit i village. com, the odds are two to one that a visitor is female. This is quite reasonable because i village. com offers content geared toward issues of primary concern to women. Some gaming sites appeal primarily to teenage boys, and sports sites may draw predominately adult men.
On the other hand, such portals as Yahoo! and Excite draw audiences that are fairly representative of the web as a whole. Media Metrix can identify demographic characteristics of visitors using information provided to them by panelists. However, simply a knowledge of the web sites visited by a user and profiles of these web sites (that is, the demographic characteristics of a sample of users) is enough to make a good prediction about a visitor's demographics. For example, suppose we wish to predict whether a user is a woman. In general, about 45 percent of web users are female. Therefore without knowing what sites a person visited one would guess that there is a 45 percent probability of being female and a 55 percent probability of being male.
If forced to choose, one would guess the user to be male, but this would be an inaccurate guess since the odds are almost equal. However, if one knows that this individual visited i village. com, whose visitors are 66 percent women, the hypothesis that this user is female can be improved. This is a Bayesian hypothesis updating problem, and an analyst could apply Bayes formula to recompute the probability that the user was female using this new information: The original probability of being female is denoted by p = . 45, and the new information we have is p = . 66. The updated probability or posterior probability of our hypothesis is denoted by p = .
62. In other words, the probability this is a female user has increased from 45 percent to 62 percent. While most of the web sites visited by this individual indicate the user is most likely a female, some of the sites (aol. com, e play. com, halcyon. com, ly cos. com, and net radio. net) visited might point to the individual being male. However, based on information from all 22 sites the probability that the user is female is 99.97 percent.
To assess the accuracy of this technique I applied it to actual usage information from a sample of 19,000 representative web users with one month of usage and known gender. There is a great deal of variation in users in this sample, such as some users visit only one site, while others may visit hundreds. If the model predicts more than an 80 percent probability that the user is male then the user is predicted to be male. Similar predictions are made for female users. There was enough information to classify 60 percent of users as either male or female. 81 percent of users classified as male are male, and 96 percent of users classified as female are female.
The agreement between the predictions and actual accuracy validates the accuracy of these techniques. More advanced techniques that appropriately account for statistical dependence between web site visits can increase the accuracy of these predictions. In this example, all of the web sites visited by a user were known, but this is not typical. However, many web advertising agencies, such as Double-Click and Fly-cast, have such information for their advertising partners since they serve all their partners' banner ads. These partners give them wide coverage of the web. User profiling techniques allows them to accurately predict characteristics of their users without ever having to ask a user to fill out a survey.
Predicting the genders of an individual seems innocuous, but the same techniques can be applied to other more sensitive demographic variables, such as income (for example, does a user make more than $75,000). Just as there are sites that provide valuable information about gender, there are sites that provide information about income (for example, millionaire. com, w sj. com, business week. com). While this example illustrates that these techniques can be used successfully, questions about privacy need to also be considered before using these techniques in practice. The collection, processing, and use of click stream data is not cost free, as some would claim. A major portal can collect 30 to 50 gigabytes of information from web accesses each day in a server log.
Even a small site will generate several megabytes. To analyze these large data sets, managers need data mining and statistical techniques. These techniques generally require skill and their inaccurate applications can result in poor or misleading predictions about visitors and what they want to see. If, for example, a visitor is wrongly classified as male, the visitor could be shown messages that would not appeal to a female user or even worse cause her to leave the site and never return. Therefore, to apply these techniques appropriately, one should take into account the value of personalization and the costs of misclassify ing a visitor.