naše kosmišmiš

Counting web page unique users in an approximative way


... without storing user sensitive information ...

PaPaČut, 2021
Back to the home page.

Why?

I am running two small websites, this one and then the other one on digital design. For both of them I want to see basic access statistics (page hit rate and user access count). I was using for a long time "navrcholu.cz" for this, but this service is not available anymore. I was also experimenting with Google Analytics, but, frankly, I am not a fan of cookies.

So I was pondering what solution to choose... I wanted something, which is

Finally, I really like to learn new things and this adventure would bring me into the server-side scripting world. I had the chance to use all what I know about SQL and learn PHP ... - sharpen the tools I do not use in my daily life of a guy dealing with microelectronics.

And in the end I decided to write this brief description of the solution since someone can find it useful or inspiring. Full source codes are available in the end of the document.

Web Access Monitoring Specification - requirements

  1. Each page shall have its own counter.
    Note: This requirement is handled by the solution published by me recently.
  2. Unique users accessing the web server shall be counted.
    Clarification: If one user loads from the web server five webpages (either distinct ones, or one page more times, or any combination of these), then the system should count only one user.
  3. GDPR compliance
    Neither personal data, nor derived data which can lead to the user identification, shall be stored by the system, so no user consent is needed.

Note, that I am not an expert on GDPR. All below is provided "as it is", I am not responsible for any problems which can arise from usage of the whole solution. See also Terms of use below.

The "Unique" User Identification

User identification is the biggest problem we are facing, here. Counting the users will need to execute the following steps when a web page is being accessed by a browser:

  1. Web server detects the request to provide a page.
  2. User identification is obtained, server user access database is searched for this user (IDENT).
  3. If the user already accessed the server, we have no new user.
  4. If the user is not found in the database, then we have a new user. We add the user to the local database and increment user counter.
  5. Page is provided to the browser.

The problem lies in the step IDENT. A simple way how to identify a user is using her or his IP address. However, IP address is a kind of personal information; if we store it into the database, we are dealing with personal data. This I want to definitely avoid.

The solution to the problem is simple: we transform the IP address into another user identifier in an irreversible way. The new identifier shall be unique enough (requirement of uniqueness) to allow counting of the unique users (at least approximatively, I will explain later - "unique") and still fuzzy enough (requirement of fuzziness) not to allow anyone to reconstruct the original IP addresses at all. Let us have a look at these two requirements, now.

Fuzziness: How to calculate such an identifier? Hashing of the IP address can help us. To get the new identifier "fuzzy enough", we have to respect a simple rule: IP address has at least 32 bits (4 billions of addresses); the final hash shall not allow anyone to reconstruct the IP, thus it shall be "narrow enough" in terms of bits. If we choose 8 bit hash, then we reduce 32 bits to 8 bits, reduction by 24 bits. This means that each eight bit hash value will represent about 16 millions of IPs (2^32/2^8 = 2^24). This reduction (aliasing) is very important and actually disallows to reconstruct the original IP from the hash. So the algorithm will be as follows:

  1. Web server detects the request to provide a page.
  2. User IP address is obtained and an 8 bit hash is calculated, then the IP address is discarded and not used anymore.
  3. Server user access database is searched for the hash.
  4. If the hash is found, then the user (probably) already accessed the server, we have no new user.
  5. If the hash is not found, then we have a new user. We add it to the local database and increment user counter.
  6. Page is provided to the browser.

Pearson hashing algorithm was chosen for the hashing itself in my implementation.

Uniqueness: We need to monitor the number of unique users, but with 8 bit hash we know, that about 16 millions of IPs are mapped to one value. If we proceed according to the algorithm above, we will be able to identify only at most 256 users and then the system will stop counting. Database will contain all the 256 possible values and no more accessess will be logged. So we will utilize also current time, modified approach follows:

  1. Web server detects the request to provide a page.
  2. User IP address is obtained and 8 bit hash is calculated, then the IP address is discarded and not used anymore.
  3. Server user access database is searched for the hash.
  4. If the hash is found and database indicates that we already searched for it in the last five minutes, then this user already (probably) accessed the server. We have no new user. Database row for this hash is updated, new access time (current time) is logged, there.
  5. If the hash is found and database indicates that the last seach for this one happened more than five minutes ago, then we can have a new user. We increment user counter. Database row for this hash is updated, new access time (current time) is logged, there.
  6. If the hash is not found, user is not found in the database, then we have a new user. We add the hash to the local database along with the current time and increment user counter.
  7. Page is provided to the browser.

This solution has two limitations.

  1. If two users with different IPs leading to the same hash value (collision) access the web at moments spaced less than five minutes apart, we count them as one user. However, for my web server, this is not a big limitation. First, my web is popular, but not that much :-) to see such a collision often. Second, if this happens, the counted number of unique users will be pessimistic, lower than in reality, not higher than reality is. In any case, for a web server with lots of accesses this can be a limitation and the solution would be unsuitable "as is described here" for such an application .
  2. If the same user accesses the web after more than five minutes, we count her/him as a new unique user. This can make the user counter optimistic, bigger than reality is. You can tweak this by prolonging the five minute interval, but for longer timeout you will be facing issue with hash collisions - the probability of a different user with the same hash increases. For my personal purposes I do not see this limitation as a problem. See Conclusion below for more on the five minute timeout.

To conclude, the presented solution is not able to count unique users in the exact sense as e.g., defined here. However, for my website this is a sufficient solution.

Architecture of the Solution

The access counter solution is composed of the following components:

Database

For the part counting the page accessess, see the previous post. Only the part related to the user counting is described here.

There is a database behind the counter, created in MySQL, stored on the server side. You need to set it up and choose some login name and password for it. The following SQL code was used to create the database, database schema is apparent from it (and very simple, here):

CREATE TABLE user_counting (
  hash smallint NOT NULL PRIMARY KEY,
  request_time int,
  visitor_no int
);

The table web_counter_table contains the following columns:

The content of the table can look like the following (after a while on my web :-) ):

hash (smallint) request_time (int) visitor_no (int)
119163978901000003
4163942268400001
97163943267000002
120163943519000001

PHP script counting the users

The following code is handling the user counter, the important part of it is excerpted below. Place the code into a dedicated file like counter.php. The code was simplified before I published it, here, but all necessary is there.

Note:

  
    
    <?php
        $C_VISITOR_TIMEOUT = 300;
        
        function pearsonHash($input_string) {
          $f_transposition = array(
            1, 87, 49, 12, 176, 178, 102, 166, 121, 193, 6, 84, 249, 230, 44, 163,
            14, 197, 213, 181, 161, 85, 218, 80, 64, 239, 24, 226, 236, 142, 38, 200,
            110, 177, 104, 103, 141, 253, 255, 50, 77, 101, 81, 18, 45, 96, 31, 222,
            25, 107, 190, 70, 86, 237, 240, 34, 72, 242, 20, 214, 244, 227, 149, 235,
            97, 234, 57, 22, 60, 250, 82, 175, 208, 5, 127, 199, 111, 62, 135, 248,
            174, 169, 211, 58, 66, 154, 106, 195, 245, 171, 17, 187, 182, 179, 0, 243,
            132, 56, 148, 75, 128, 133, 158, 100, 130, 126, 91, 13, 153, 246, 216, 219,
            119, 68, 223, 78, 83, 88, 201, 99, 122, 11, 92, 32, 136, 114, 52, 10,
            138, 30, 48, 183, 156, 35, 61, 26, 143, 74, 251, 94, 129, 162, 63, 152,
            170, 7, 115, 167, 241, 206, 3, 150, 55, 59, 151, 220, 90, 53, 23, 131,
            125, 173, 15, 238, 79, 95, 89, 16, 105, 137, 225, 224, 217, 160, 37, 123,
            118, 73, 2, 157, 46, 116, 9, 145, 134, 228, 207, 212, 202, 215, 69, 229,
            27, 188, 67, 124, 168, 252, 42, 4, 29, 108, 21, 247, 19, 205, 39, 203,
            233, 40, 186, 147, 198, 192, 155, 33, 164, 191, 98, 204, 165, 180, 117, 76,
            140, 36, 210, 172, 41, 54, 159, 8, 185, 232, 113, 196, 231, 47, 146, 120,
            51, 65, 28, 144, 254, 221, 93, 189, 194, 139, 112, 43, 71, 109, 184, 209);
      
          $curr_hash = 0;
          for ($index = 0; $index < strlen($input_string); $index++) {
            $curr_char = $input_string[$index];
            $curr_hash = $f_transposition[$curr_hash ^ ord($curr_char)];
          }
          return $curr_hash;
        }

        $dbc = mysqli_connect('localhost', 'username_for_the_db', 'password_for_the_db', 'web_counter_database');
        #======================================================================
        # user-based access counting
        #======================================================================

        # Compute hash of the IP address
        $user_id = pearsonHash($_SERVER["REMOTE_ADDR"]);
        $result = mysqli_query($dbc, "START TRANSACTION");

        # Is it in the table from the last five minutes?        
        #if it is in the hashed_accesses table in the last five minutes --> update the time in the table
        #if it is not in the hashed_accesses table in the last five minutes --> add it to the table or update the record which is there, increase unique visitors's count in the hash record

        $query = "SELECT hash,request_time,visitor_no FROM user_counting WHERE hash=\"".$user_id."\"";
        $result = mysqli_query($dbc, $query);
        $num_rows = mysqli_num_rows($result);

        if ((!$result)||($num_rows==0)) {
          $query = "INSERT INTO user_counting (hash, request_time, visitor_no) VALUES (".$user_id.",".$_SERVER["REQUEST_TIME"].",1)";
        } else {
          $row = mysqli_fetch_row($result);
          if ( ($_SERVER["REQUEST_TIME"]-$row[1]) > $C_VISITOR_TIMEOUT ) {
            $query = "UPDATE user_counting SET visitor_no = visitor_no + 1, request_time = ".$_SERVER["REQUEST_TIME"]." WHERE hash=".$user_id;
          } else {
            $query = "UPDATE user_counting SET request_time = ".$_SERVER["REQUEST_TIME"]." WHERE hash=".$user_id;
          }
        }

        $result = mysqli_query($dbc, $query);

        $result = mysqli_query($dbc, "COMMIT");

        mysqli_close($dbc);
        ?>
        <body>
    <html>

Access Reporting

I wrote a simple PHP code reporting the counter statistics in a table. You can place it e.g., in counter_report.php, simplified code follows below. Note that anyone can open this page in the browser, there is no security. As it is, counter values can be publicly accessed.

    <table style="border: 1px solid black">
        <tr>
            <th>Hash</th>
            <th>Timestamp</th>
            <th>Request time</th>
            <th>Visitor no.</th>
        </tr>

       <?php
          $dbc = mysqli_connect('localhost', 'username_for_the_db', 'password_for_the_db', 'web_counter_database');

          $query = "SELECT * FROM user_counting";
          $result = mysqli_query($dbc, $query);

          if ($debug) {
            print_r($result);
          };

          $total = 0;
          $num_rows = mysqli_num_rows($result);
          if ((!$result)||($num_rows==0)) {
            debug_echo("NOT FOUND<br>");
          } else {
            while ($row = mysqli_fetch_row($result)) {
              printf("<tr><td>%3d</td><td>%d</td><td>%s</td><td>%05d</td></tr>\n", $row[0], $row[1], date("d.m.y H:i:s",$row[1]), $row[2]);
              $total = $total + $row[2];
            }
          }
          printf("<tr><td><b>Total</b></td><td></td><td></td><td><b>%05d</b></td></tr>", $total);

          mysqli_close($dbc);
       ?> 

The report can then look like the following

Hash Timestamp Request time Visitor no.
119163978901018.12.21 01:56:5000003
4163942268413.12.21 20:11:2400001
97163943267013.12.21 22:57:5000002
Total00006

Limitations, Next Steps, Conclusion

The solution has some limitations and some potential improvements are also obvious and possible. Selected items from my "todo list" for the future development follows:

To conclude, the presented solution is not able to count unique users in the exact sense as e.g., defined here. The first limitation is brought by the timeout, after which we count the visitor as a new one. The second limitation lies in the hashing which can sometimes result into counting two otherwise distinctive users as only one. However, for my website this is a sufficient solution.

Source Codes

The whole package with the PHP source codes and list of SQL command to create the database schema is available here. In addition to what is described above, the package contains also user trace monitoring: reporting of the sequences of web pages visited by the unqiue "hashes". Using it I can see how the user went through the web and optimize the pages accordingly. Again, the monitoring works only with hashes, no user data are stored, there.

And, by the way, you are visitor no. 924 since the last moment I reset the counter.

Terms of use

I am neither an expert on GDPR, nor on development for web. So by using information at this site, you agree that the author is not liable to you or to any other party for any direct, indirect, incidental, special, consequential or exemplary damages, including but not limited to, damages for loss of profits, goodwill, use, data, or other intangible losses, regardless of whether the author was advised of the possibility of such damage. In no event shall author's liability exceed zero U.S. dollar (USD 0.00) in aggregate. In the case that applicable law may not allow the limitation or exclusion of liability or incidental or consequential damages, the above limitation or exclusion may not apply to you, although author's liability will be limited to the fullest extent permitted by applicable law.