How IRUS excludes robots and rogue usage

Date policy last updated:

January 2022

IRUS processes repository downloads based on the COUNTER Code of Practice. COUNTER provides a list of well-known robots, whose usage should be removed as a bare minimum, but it is not intended to be a comprehensive list.

COUNTER exclusions

We eliminate known robot entries using the COUNTER user agent exclusion list. This list is used as part of the audit process and is not intended to be comprehensive. It works reasonably well in eliminating robot usage on traditional scholarly publisher websites, behind an authentication/authorisation barrier, but it is not enough in the open access repository world. To address this IRUS goes a step further and applies additional exclusions.

Additional IRUS exclusions

IRUS adds further filters to remove more user agents identified as robots and applies a simple threshold for 'overactive' IP addresses.

Besides 'good' bots like Googlebot, there are:

'bad' bots that don’t declare themselves as bots but are mostly harmless
a host of others: hackers, spammers, dictionary attackers, etc.

For those bad actors we need additional filtering mechanisms above and beyond the COUNTER exclusions.

So, we additionally exclude entries from:

usage events where the HTTP Referrer has been identified as fake (e.g. false Google referrers)
IP addresses with 40 or more downloads in a single day
IP address + User Agent combinations with 10 or more downloads of a single item in a single day
IP address ranges grouped by the 1st three octets that have 300 or more downloads in a day

IRUS exclusions are processed daily for a 24-hour period.

These thresholds have been derived empirically by analysing several years' worth of usage data logs and, during an audit review, the COUNTER auditors agreed that these appeared to be reasonable extra measures to remove robotic/rogue activity from our statistics.

Changes and updates

There is always more that could be done improve robot detection and exclusion. We also understand that consistency and comparability over time are important factors and that frequent changes to processing rules would reduce the value of the IRUS dataset.

In future, we may consider implementing alternative or additional ways to identify robot and rogue usage. Where possible or sensible, this may mean we reprocess and replace statistics. Any proposed changes will be communicated to participants via the mailing list and in consultation with the user group.

Note that we may also reprocess and replace statistics for other reasons. For example, where there has been a change on the repository platform.

Considerations

Identifying whether something is "genuine", or "robot" usage is not always clear cut. For example:

automated downloads are not necessarily robots, e.g. an institution doing a major literature search uses a script to cross search a number of databases and repositories – should this be excluded?
what actually constitutes genuine usage, e.g. a lecturer with a class of 30 students asks them all to practice downloading the same three items – is this genuine usage?

In IRUS, as with most other statistical packages, we are using downloads and clicks as a proxy for usage, i.e. we are making a qualitative statement using a quantitative measure – this can never be totally accurate.

All measurements are ultimately a judgement call and necessarily arbitrary. What distinguishes IRUS from other statistics packages is that we work to a transparent, global standard that is applied consistently across all participating repositories.