WUSTL-IIOT-2021 Dataset for IIoT Cybersecurity Research

Here, we present a dataset, called WUSTL-IIoT-2021, consisting network data of industrial Internet of Things (IIoT) to be used in cybersecurity research. The dataset is developed using our IIoT testbed described in [1]. The purpose of our testbed is to emulate real-world industrial systems as closely as possible and allows the possibility of carrying out real cyber-attacks. We have collected a 2.7 GB of data, for a total of about 53 hours. We have pre-processed and cleansed the dataset (removed the rows with missing values, corrupted values (i.e., invalid entries), and extreme outliers. The dataset that we utilized and uploaded here is a smaller version of that which is a little over 400 MB. If interested, [2], [3], [4] present our two other research papers utilizing this dataset in the case studies.

Important Note:After downloading the dataset, remove these columns: 'StartTime', 'LastTime', 'SrcAddr', 'DstAddr', 'sIpId', 'dIpId', as they are unique to the attacks and would expose the type of the attack to the model; therefore, the model would not be generalized for unseen data. In the rest of this article, we assume you have removed these columns.

Although, all the samples are labeled with the type of attack they belong to (under column 'Traffic'), to simplify, you can transform the problem into a binary classification by labeling all the attack traffic as class 1 and normal traffic as class 0. Specifics of our dataset are in Table 1.

Table 1: Specifics of the developed dataset.
Dataset WUSTL-IIoT
Number of observations 1,194,464
Number of features 41
Number of attack samples 87,016
Number of normal samples 1,107,448

Please note that we have deliberately built our dataset to be imbalanced, since this the realist scenario that happens in real-world settings. We have generated command injection attacks, reconnaissance, and DoS against the testbed to have a large variety of attack records in our dataset. The percentage of attack traffic in the dataset is less than 8%. This assumption makes the system as similar as possible to the real-world industrial control systems. The statistics of the dataset are shown in Table 2, where the average data rate was 419 kbit/s, and the average packet size was measured as 76.75 bytes. Since DoS attacks are usually heavy in traffic and the number of samples, we deliberately devoted about 90% of the attacks to them. Other types of attacks happen less frequently and when they happen, they send only a few number of traffic data.

Table 2: Statistical information of the traffic types in our developed dataset.
Traffic's type Percentage (%)
Normal Traffic 92.72
Total Attack Traffic 7.28
Command Injection Traffic 0.31
DoS Traffic 89.98
Reconnaissance Traffic 9.46
Backdoor Traffic 0.25

In addition, to provide more insights about our dataset, we discuss the selected features. An important step in developing a dataset is selecting and extracting features from the traffic. Here, in designing ours, we chose the features that their values change during the attack phases compared to the normal operation phases. If a selected feature does not vary during the attacks, then even the best algorithm will not be able to detect an intrusion or an anomalous situation using that feature. In our study, we reviewed the potential features, using Argus tool [5], and chose 41 features that are common in network flows and also change during the attack phases. Table 3 shows the chosen features along with their description.

Table 3: Selected traffic features to build our dataset.
Features Type Descriptions
Mean flow (mean) Float The average duration of the active flows
Source Port (Sport) Integer Source port number
Destination Port (Dport) Integer Destination port number
Source Packets (Spkts) Integer Source/Destination packet count
Destination Packets (Dpkts) Integer Destination/Source packet count
Total Packets (Tpkts) Integer Total transaction packet count
Source Bytes (Sbytes) Integer Source/Destination bytes count
Destination Bytes (Dbytes) Integer Destination/Source bytes count
Total Bytes (TBytes) Integer Total transaction bytes count
Source Load (Sload) Float Source bits per second
Destination Load (Dload) Float Destination bits per second
Total Load (Tload) Float Total bits per second
Source Rate (Srate) Float Source packets per second
Destination Rate (Drate) Float Destination packets per second
Total Rate (Trate) Float Total packets per second
Source Loss (Sloss) Float Source packets retransmitted/dropped
Destination Loss (Dloss) Float Destination packets retransmitted/dropped
Total Loss (Tloss) Float Total packets retransmitted/dropped
Total Percent Loss (Ploss) Float Percent packets retransmitted/dropped
Source Jitter (ScrJitter) Float Source jitter in millisecond
Destination Jitter (DrcJitter) Float Destination jitter in millisecond
Source Interpacket (SIntPkt) Float Source interpacket arrival time in millisecond
Destination Interpacket (DIntPkt) Float Destination interpacket arrival time in millisecond
Protocol (Proto) Char transaction protocol
Duration(Dur) Integer record total duration
TCP RTT (TcpRtt) Float TCP connection setup round-trip time, the sum of 'synack' and 'ackdat'.
Idle Time (Idle) Float time since the last packet activity. This value is useful in real-time processing, and is the current time - last time.
Sum (sum) Integer total accumulated duration of aggregated records
Min (min) Integer minimum duration of aggregated records
Max (max) Integer maximum duration of aggregated records
Source Diff Serve Byte (sDSb) Integer Source different serve byte value
Source TTL (sTtl) Float Source → Destination TTL value
Destination TTL (dTtl) Float Destination → Source TTL value
Source App Byte (SAppBytes) Integer Source → Destination application bytes
Destination App Byte (DAppBytes) Integer Destination → Source application bytes
Total App Byte (TotAppByte) Integer total application bytes
SYN_Ack (SynAck) Float TCP connection setup time, the time between the SYN and the SYN_ACK packets
Run Time (RunTime) Float total active flow run time. This value is generated through aggregation, and is the sum of the records duration.
Source TOC (sTos) Integer source TOS byte value
Source Jitter (SrcJitAct) Float source idle jitter (mSec)
Destination Jitter (DstJitAct) Float destination active jitter (mSec)

Further, we have studied the importance of the features. They are ranked based on how salient they are in helping the algorithm distinguish the normal traffic from the attack traffic. In this technique, the values of each feature are permuted randomly one at a time, creating new datasets. The machine learning model is trained on these datasets, and the increase in classification error is measured for each. If the increase is high, then the feature is important, and conversely, if it is low, the feature is considered as not important. For each feature, the "model reliance" or importance coefficient is defined as the ratio of the model's error value after permutation to the standard error value when none of the variables are permuted. For more detailed information, we refer readers to [6].

As we report later in the results, random forest has shown the best classification performance, so we have picked this algorithm to calculate the importance. In Figure 1, the top five important features in our dataset along with their normalized (so the total of 41 feature importance values sum to 1) importance coefficient are shown. While these are the top five features, the threshold for the importance has shown that all the 41 features are relevant enough to be used for training.

figure1

Figure 1: Five most important features in WUSTL-IIoT-2021.

Download the WUSTL-IIOT-2021 Dataset for IIoT Cybersecurity Research from HERE (106,192,911 bytes)

Please cite this dataset as follows:

Acknowledgment: This work was supported in part by the grant ID NPRP-10-0206-170360 funded by the Qatar National Research Fund (QNRF), and by the NSF under Grant CNS-1718929. The statements made herein are solely the responsibility of the authors.

References:

  1. M. Zolanvari, M. A. Teixeira, L. Gupta, K. M. Khan, and R. Jain. "Machine learning- based network vulnerability analysis of Industrial Internet of Things," in IEEE Internet of Things Journal 6 (2019), pp. 6822-6834, http://www.cse.wustl.edu/~jain/papers/vulnerab.htm".
  2. M. Zolanvari, M. Teixeira, R. Jain, “Effect of Imbalanced Datasets on Security of Industrial IoT Using Machine Learning,” in Proceedings of IEEE ISI (Intelligence and Security Informatics), November 2018, http://www.cse.wustl.edu/~jain/papers/imb_isi.htm".
  3. M. Zolanvari, Z. Yang, K. M. Khan, R. Jain, and N. Meskin, "TRUST XAI: A Novel Model for Explainable AI with An Example Using IIoT Security," in IEEE Internet of Things Journal, to appear, accepted September 2021,
  4. M. Zolanvari, A. Ghubaish, and R. Jain, "ADDAI: Anomaly Detection using Distributed AI," in Proceedings of IEEE ICNSC (International Conference on Networking, Sensing and Control), to appear October 2021, accepted September 2021,
  5. Argus. Available online: https://qosient.com/argus/ (accessed October 2021).
  6. A. Fisher, C. Rudin, and F. Dominici. "Model Class Reliance: Variable importance measures for any machine learning model class, from the 'Rashomon' perspective." 2018. arXiv: 1801.01489.