Iliou, Christos, Kostoulas, Theodoros, Tsikrika, Theodora, Katos, Vasilis, Vrochidis, Stefanos and Kompatsiaris, Ioannis (2020) Web Bot Detection Dataset.
Automated programs (bots) are responsible for a large percentage of website traffic. These bots, called web bots, vary in sophistication based on their purpose, ranging from simple automated scripts to advanced web bots that have a browser fingerprint, support the main browser functionalities, and exhibit a humanlike behaviour. Advanced web bots are especially appealing to malicious web bot creators, due to their browserlike fingerprint and humanlike behaviour that reduce their detectability. Malicious purposes of web bots include, but are not limited to, content scraping, vulnerability scanning, account takeover, distributed denial of service attacks, marketing fraud, carding, spam, buying all the available stock of specific limited products to later resell at higher price (i.e., scalper bots).
Thus, web servers must be equipped with the tools to detect such malicious web bots. For that, state-of-the-art approaches, both in academia and in commercial solutions, propose the combination of rule-based techniques with machine learning based methods [1]. For the latter, the browsing behaviours of visitors are used to train machine learning models that distinguish web bots from human visitors [1].
Early machine learning based web bot detection methods examined the web logs of the visitors, while more recent approaches use the mouse movements that the visitors perform [1]. The lack of datasets that include mouse movements of humans browsing the web (alone or in combination with the respective web logs) motivated the creation of this dataset which contains the web logs and mouse movements of (i) humans, (ii) moderate web bots that have a browser fingerprint and (iii) advanced web bots that have a browser fingerprint and also exhibit a humanlike behaviour. This dataset can be used to research web bot detection and evasion techniques that use and/or combine web logs with mouse movements.
The dataset was collected using a web server hosting web pages crawled from Wikipedia (https://www.wikipedia.org/) and consists of two parts, each used for the different evaluation phases of [1], respectively:
1. For the first evaluation phase, the web server used hosted 61 web pages from five different categories/topics crawled from Wikipedia, while 50 human sessions were generated by a closed set of participants, i.e., the authors of [1]; in each session, the authors visited the web server for an adequate (not predefined) period of time to generate sufficient data for our experiments.
2. For the second evaluation, an expanded version of the same web server was used; this web server hosted a total of 110 web pages from 11 categories/topics (including the content used in the first version of the web server) crawled again from Wikipedia. In this case, 28 users were asked to visit this web server and to create two sessions each (resulting in a total of 56 human sessions). Each user was instructed to spend about 15–20 minutes per session.
In both evaluation phases, we created the same amount of moderate web bot and advanced web bot sessions as the sessions generated from the humans. Details about the behaviour of these bots can be found in [1].
Specific details about the dataset and its structure can be found in a README file included in the dataset.
[1] Christos Iliou, Theodoros Kostoulas, Theodora Tsikrika, Vasilis Katos, Stefanos Vrochidis, and Ioannis Kompatsiaris. 2021. Detection of Advanced Web Bots by Combining Web Logs with Mouse Behavioural Biometrics. Digital Threats 2, 3, Article 24 (September 2021), 26 pages. https://doi.org/10.1145/3447815