Important Note:After downloading the dataset, remove these columns: 'StartTime', 'LastTime', 'SrcAddr', 'DstAddr', 'sIpId', 'dIpId', as they are unique to the attacks and would expose the type of the attack to the model; therefore, the model would not be generalized for unseen data. In the rest of this article, we assume you have removed these columns.
Although, all the samples are labeled with the type of attack they belong to (under column 'Traffic'), to simplify, you can transform the problem into a binary classification by labeling all the attack traffic as class 1 and normal traffic as class 0. Specifics of our dataset are in Table 1.
Dataset | WUSTL-IIoT |
---|---|
Number of observations | 1,194,464 |
Number of features | 41 |
Number of attack samples | 87,016 |
Number of normal samples | 1,107,448 |
Please note that we have deliberately built our dataset to be imbalanced, since this the realist scenario that happens in real-world settings. We have generated command injection attacks, reconnaissance, and DoS against the testbed to have a large variety of attack records in our dataset. The percentage of attack traffic in the dataset is less than 8%. This assumption makes the system as similar as possible to the real-world industrial control systems. The statistics of the dataset are shown in Table 2, where the average data rate was 419 kbit/s, and the average packet size was measured as 76.75 bytes. Since DoS attacks are usually heavy in traffic and the number of samples, we deliberately devoted about 90% of the attacks to them. Other types of attacks happen less frequently and when they happen, they send only a few number of traffic data.
Traffic's type | Percentage (%) |
---|---|
Normal Traffic | 92.72 |
Total Attack Traffic | 7.28 |
Command Injection Traffic | 0.31 |
DoS Traffic | 89.98 |
Reconnaissance Traffic | 9.46 |
Backdoor Traffic | 0.25 |
In addition, to provide more insights about our dataset, we discuss the selected features. An important step in developing a dataset is selecting and extracting features from the traffic. Here, in designing ours, we chose the features that their values change during the attack phases compared to the normal operation phases. If a selected feature does not vary during the attacks, then even the best algorithm will not be able to detect an intrusion or an anomalous situation using that feature. In our study, we reviewed the potential features, using Argus tool [5], and chose 41 features that are common in network flows and also change during the attack phases. Table 3 shows the chosen features along with their description.
Features | Type | Descriptions |
---|---|---|
Mean flow (mean) | Float | The average duration of the active flows |
Source Port (Sport) | Integer | Source port number |
Destination Port (Dport) | Integer | Destination port number |
Source Packets (Spkts) | Integer | Source/Destination packet count |
Destination Packets (Dpkts) | Integer | Destination/Source packet count |
Total Packets (Tpkts) | Integer | Total transaction packet count |
Source Bytes (Sbytes) | Integer | Source/Destination bytes count |
Destination Bytes (Dbytes) | Integer | Destination/Source bytes count |
Total Bytes (TBytes) | Integer | Total transaction bytes count |
Source Load (Sload) | Float | Source bits per second |
Destination Load (Dload) | Float | Destination bits per second |
Total Load (Tload) | Float | Total bits per second |
Source Rate (Srate) | Float | Source packets per second |
Destination Rate (Drate) | Float | Destination packets per second |
Total Rate (Trate) | Float | Total packets per second |
Source Loss (Sloss) | Float | Source packets retransmitted/dropped |
Destination Loss (Dloss) | Float | Destination packets retransmitted/dropped |
Total Loss (Tloss) | Float | Total packets retransmitted/dropped |
Total Percent Loss (Ploss) | Float | Percent packets retransmitted/dropped |
Source Jitter (ScrJitter) | Float | Source jitter in millisecond |
Destination Jitter (DrcJitter) | Float | Destination jitter in millisecond |
Source Interpacket (SIntPkt) | Float | Source interpacket arrival time in millisecond |
Destination Interpacket (DIntPkt) | Float | Destination interpacket arrival time in millisecond |
Protocol (Proto) | Char | transaction protocol |
Duration(Dur) | Integer | record total duration |
TCP RTT (TcpRtt) | Float | TCP connection setup round-trip time, the sum of 'synack' and 'ackdat'. |
Idle Time (Idle) | Float | time since the last packet activity. This value is useful in real-time processing, and is the current time - last time. |
Sum (sum) | Integer | total accumulated duration of aggregated records |
Min (min) | Integer | minimum duration of aggregated records |
Max (max) | Integer | maximum duration of aggregated records |
Source Diff Serve Byte (sDSb) | Integer | Source different serve byte value |
Source TTL (sTtl) | Float | Source → Destination TTL value |
Destination TTL (dTtl) | Float | Destination → Source TTL value |
Source App Byte (SAppBytes) | Integer | Source → Destination application bytes |
Destination App Byte (DAppBytes) | Integer | Destination → Source application bytes |
Total App Byte (TotAppByte) | Integer | total application bytes |
SYN_Ack (SynAck) | Float | TCP connection setup time, the time between the SYN and the SYN_ACK packets |
Run Time (RunTime) | Float | total active flow run time. This value is generated through aggregation, and is the sum of the records duration. |
Source TOC (sTos) | Integer | source TOS byte value |
Source Jitter (SrcJitAct) | Float | source idle jitter (mSec) |
Destination Jitter (DstJitAct) | Float | destination active jitter (mSec) |
Further, we have studied the importance of the features. They are ranked based on how salient they are in helping the algorithm distinguish the normal traffic from the attack traffic. In this technique, the values of each feature are permuted randomly one at a time, creating new datasets. The machine learning model is trained on these datasets, and the increase in classification error is measured for each. If the increase is high, then the feature is important, and conversely, if it is low, the feature is considered as not important. For each feature, the "model reliance" or importance coefficient is defined as the ratio of the model's error value after permutation to the standard error value when none of the variables are permuted. For more detailed information, we refer readers to [6].
As we report later in the results, random forest has shown the best classification performance, so we have picked this algorithm to calculate the importance. In Figure 1, the top five important features in our dataset along with their normalized (so the total of 41 feature importance values sum to 1) importance coefficient are shown. While these are the top five features, the threshold for the importance has shown that all the 41 features are relevant enough to be used for training.
Download the WUSTL-IIOT-2021 Dataset for IIoT Cybersecurity Research from HERE (106,192,911 bytes)
Please cite this dataset as follows:
Acknowledgment: This work was supported in part by the grant ID NPRP-10-0206-170360 funded by the Qatar National Research Fund (QNRF), and by the NSF under Grant CNS-1718929. The statements made herein are solely the responsibility of the authors.
References: