WUSTL-IIOT-2021 Dataset for IIoT Cybersecurity Research

Here, we present a dataset, called WUSTL-IIoT-2021, consisting network data of industrial Internet of Things (IIoT) to be used in cybersecurity research. The dataset is developed using our IIoT testbed described in [1]. The purpose of our testbed is to emulate real-world industrial systems as closely as possible and allows the possibility of carrying out real cyber-attacks. We have collected a 2.7 GB of data, for a total of about 53 hours. We have pre-processed and cleansed the dataset (removed the rows with missing values, corrupted values (i.e., invalid entries), and extreme outliers. The dataset that we utilized and uploaded here is a smaller version of that which is a little over 400 MB. If interested, [2], [3], [4] present our two other research papers utilizing this dataset in the case studies.

Important Note:After downloading the dataset, remove these columns: 'StartTime', 'LastTime', 'SrcAddr', 'DstAddr', 'sIpId', 'dIpId', as they are unique to the attacks and would expose the type of the attack to the model; therefore, the model would not be generalized for unseen data. In the rest of this article, we assume you have removed these columns.

Although, all the samples are labeled with the type of attack they belong to (under column 'Traffic'), to simplify, you can transform the problem into a binary classification by labeling all the attack traffic as class 1 and normal traffic as class 0. Specifics of our dataset are in Table 1.

Table 1: Specifics of the developed dataset.
Dataset WUSTL-IIoT

Number of observations 1,194,464

Number of features 41

Number of attack samples 87,016

Number of normal samples 1,107,448

Table 1: Specifics of the developed dataset.
Dataset	WUSTL-IIoT
Number of observations	1,194,464
Number of features	41
Number of attack samples	87,016
Number of normal samples	1,107,448

Please note that we have deliberately built our dataset to be imbalanced, since this the realist scenario that happens in real-world settings. We have generated command injection attacks, reconnaissance, and DoS against the testbed to have a large variety of attack records in our dataset. The percentage of attack traffic in the dataset is less than 8%. This assumption makes the system as similar as possible to the real-world industrial control systems. The statistics of the dataset are shown in Table 2, where the average data rate was 419 kbit/s, and the average packet size was measured as 76.75 bytes. Since DoS attacks are usually heavy in traffic and the number of samples, we deliberately devoted about 90% of the attacks to them. Other types of attacks happen less frequently and when they happen, they send only a few number of traffic data.

Table 2: Statistical information of the traffic types in our developed dataset.
Traffic's type Percentage (%)

Normal Traffic 92.72

Total Attack Traffic 7.28

Command Injection Traffic 0.31

DoS Traffic 89.98

Reconnaissance Traffic 9.46

Backdoor Traffic 0.25

Table 2: Statistical information of the traffic types in our developed dataset.
Traffic's type	Percentage (%)
Normal Traffic	92.72
Total Attack Traffic	7.28
Command Injection Traffic	0.31
DoS Traffic	89.98
Reconnaissance Traffic	9.46
Backdoor Traffic	0.25

In addition, to provide more insights about our dataset, we discuss the selected features. An important step in developing a dataset is selecting and extracting features from the traffic. Here, in designing ours, we chose the features that their values change during the attack phases compared to the normal operation phases. If a selected feature does not vary during the attacks, then even the best algorithm will not be able to detect an intrusion or an anomalous situation using that feature. In our study, we reviewed the potential features, using Argus tool [5], and chose 41 features that are common in network flows and also change during the attack phases. Table 3 shows the chosen features along with their description.

Table 3: Selected traffic features to build our dataset.
Features Type Descriptions

Mean flow (mean) Float The average duration of the active flows

Source Port (Sport) Integer Source port number

Destination Port (Dport) Integer Destination port number

Source Packets (Spkts) Integer Source/Destination packet count

Destination Packets (Dpkts) Integer Destination/Source packet count

Total Packets (Tpkts) Integer Total transaction packet count

Source Bytes (Sbytes) Integer Source/Destination bytes count

Destination Bytes (Dbytes) Integer Destination/Source bytes count

Total Bytes (TBytes) Integer Total transaction bytes count

Source Load (Sload) Float Source bits per second

Destination Load (Dload) Float Destination bits per second

Total Load (Tload) Float Total bits per second

Source Rate (Srate) Float Source packets per second

Destination Rate (Drate) Float Destination packets per second

Total Rate (Trate) Float Total packets per second

Source Loss (Sloss) Float Source packets retransmitted/dropped

Destination Loss (Dloss) Float Destination packets retransmitted/dropped

Total Loss (Tloss) Float Total packets retransmitted/dropped

Total Percent Loss (Ploss) Float Percent packets retransmitted/dropped

Source Jitter (ScrJitter) Float Source jitter in millisecond

Destination Jitter (DrcJitter) Float Destination jitter in millisecond

Source Interpacket (SIntPkt) Float Source interpacket arrival time in millisecond

Destination Interpacket (DIntPkt) Float Destination interpacket arrival time in millisecond

Protocol (Proto) Char transaction protocol

Duration(Dur) Integer record total duration

TCP RTT (TcpRtt) Float TCP connection setup round-trip time, the sum of 'synack' and 'ackdat'.

Idle Time (Idle) Float time since the last packet activity. This value is useful in real-time processing, and is the current time - last time.

Sum (sum) Integer total accumulated duration of aggregated records

Min (min) Integer minimum duration of aggregated records

Max (max) Integer maximum duration of aggregated records

Source Diff Serve Byte (sDSb) Integer Source different serve byte value

Source TTL (sTtl) Float Source → Destination TTL value

Destination TTL (dTtl) Float Destination → Source TTL value

Source App Byte (SAppBytes) Integer Source → Destination application bytes

Destination App Byte (DAppBytes) Integer Destination → Source application bytes

Total App Byte (TotAppByte) Integer total application bytes

SYN_Ack (SynAck) Float TCP connection setup time, the time between the SYN and the SYN_ACK packets

Run Time (RunTime) Float total active flow run time. This value is generated through aggregation, and is the sum of the records duration.

Source TOC (sTos) Integer source TOS byte value

Source Jitter (SrcJitAct) Float source idle jitter (mSec)

Destination Jitter (DstJitAct) Float destination active jitter (mSec)

Table 3: Selected traffic features to build our dataset.
Features	Type	Descriptions
Mean flow (mean)	Float	The average duration of the active flows
Source Port (Sport)	Integer	Source port number
Destination Port (Dport)	Integer	Destination port number
Source Packets (Spkts)	Integer	Source/Destination packet count
Destination Packets (Dpkts)	Integer	Destination/Source packet count
Total Packets (Tpkts)	Integer	Total transaction packet count
Source Bytes (Sbytes)	Integer	Source/Destination bytes count
Destination Bytes (Dbytes)	Integer	Destination/Source bytes count
Total Bytes (TBytes)	Integer	Total transaction bytes count
Source Load (Sload)	Float	Source bits per second
Destination Load (Dload)	Float	Destination bits per second
Total Load (Tload)	Float	Total bits per second
Source Rate (Srate)	Float	Source packets per second
Destination Rate (Drate)	Float	Destination packets per second
Total Rate (Trate)	Float	Total packets per second
Source Loss (Sloss)	Float	Source packets retransmitted/dropped
Destination Loss (Dloss)	Float	Destination packets retransmitted/dropped
Total Loss (Tloss)	Float	Total packets retransmitted/dropped
Total Percent Loss (Ploss)	Float	Percent packets retransmitted/dropped
Source Jitter (ScrJitter)	Float	Source jitter in millisecond
Destination Jitter (DrcJitter)	Float	Destination jitter in millisecond
Source Interpacket (SIntPkt)	Float	Source interpacket arrival time in millisecond
Destination Interpacket (DIntPkt)	Float	Destination interpacket arrival time in millisecond
Protocol (Proto)	Char	transaction protocol
Duration(Dur)	Integer	record total duration
TCP RTT (TcpRtt)	Float	TCP connection setup round-trip time, the sum of 'synack' and 'ackdat'.
Idle Time (Idle)	Float	time since the last packet activity. This value is useful in real-time processing, and is the current time - last time.
Sum (sum)	Integer	total accumulated duration of aggregated records
Min (min)	Integer	minimum duration of aggregated records
Max (max)	Integer	maximum duration of aggregated records
Source Diff Serve Byte (sDSb)	Integer	Source different serve byte value
Source TTL (sTtl)	Float	Source → Destination TTL value
Destination TTL (dTtl)	Float	Destination → Source TTL value
Source App Byte (SAppBytes)	Integer	Source → Destination application bytes
Destination App Byte (DAppBytes)	Integer	Destination → Source application bytes
Total App Byte (TotAppByte)	Integer	total application bytes
SYN_Ack (SynAck)	Float	TCP connection setup time, the time between the SYN and the SYN_ACK packets
Run Time (RunTime)	Float	total active flow run time. This value is generated through aggregation, and is the sum of the records duration.
Source TOC (sTos)	Integer	source TOS byte value
Source Jitter (SrcJitAct)	Float	source idle jitter (mSec)
Destination Jitter (DstJitAct)	Float	destination active jitter (mSec)

Further, we have studied the importance of the features. They are ranked based on how salient they are in helping the algorithm distinguish the normal traffic from the attack traffic. In this technique, the values of each feature are permuted randomly one at a time, creating new datasets. The machine learning model is trained on these datasets, and the increase in classification error is measured for each. If the increase is high, then the feature is important, and conversely, if it is low, the feature is considered as not important. For each feature, the "model reliance" or importance coefficient is defined as the ratio of the model's error value after permutation to the standard error value when none of the variables are permuted. For more detailed information, we refer readers to [6].

As we report later in the results, random forest has shown the best classification performance, so we have picked this algorithm to calculate the importance. In Figure 1, the top five important features in our dataset along with their normalized (so the total of 41 feature importance values sum to 1) importance coefficient are shown. While these are the top five features, the threshold for the importance has shown that all the 41 features are relevant enough to be used for training.

Download the WUSTL-IIOT-2021 Dataset for IIoT Cybersecurity Research from HERE (106,192,911 bytes)

Please cite this dataset as follows:

M. Zolanvari, M. A. Teixeira, L. Gupta, K. M. Khan, and R. Jain. "WUSTL-IIOT-2021 Dataset for IIoT Cybersecurity Research," Washington University in St. Louis, USA, October 2021, http://www.cse.wustl.edu/~jain/iiot2/index.html

Acknowledgment: This work was supported in part by the grant ID NPRP-10-0206-170360 funded by the Qatar National Research Fund (QNRF), and by the NSF under Grant CNS-1718929. The statements made herein are solely the responsibility of the authors.

References:

M. Zolanvari, M. A. Teixeira, L. Gupta, K. M. Khan, and R. Jain. "Machine learning- based network vulnerability analysis of Industrial Internet of Things," in IEEE Internet of Things Journal 6 (2019), pp. 6822-6834, http://www.cse.wustl.edu/~jain/papers/vulnerab.htm".
M. Zolanvari, M. Teixeira, R. Jain, “Effect of Imbalanced Datasets on Security of Industrial IoT Using Machine Learning,” in Proceedings of IEEE ISI (Intelligence and Security Informatics), November 2018, http://www.cse.wustl.edu/~jain/papers/imb_isi.htm".
M. Zolanvari, Z. Yang, K. M. Khan, R. Jain, and N. Meskin, "TRUST XAI: A Novel Model for Explainable AI with An Example Using IIoT Security," in IEEE Internet of Things Journal, to appear, accepted September 2021,
M. Zolanvari, A. Ghubaish, and R. Jain, "ADDAI: Anomaly Detection using Distributed AI," in Proceedings of IEEE ICNSC (International Conference on Networking, Sensing and Control), to appear October 2021, accepted September 2021,
Argus. Available online: https://qosient.com/argus/ (accessed October 2021).
A. Fisher, C. Rudin, and F. Dominici. "Model Class Reliance: Variable importance measures for any machine learning model class, from the 'Rashomon' perspective." 2018. arXiv: 1801.01489.