Studying Tor Network Traffic Using Hidden Markov Modeling and Dynamic Learning by Tamer Sameeh
Experimentation techniques aid in the testing of Tor’s performance and discovery of security problems, as well as allowing researchers to privately and safely conduct Tor experiments without imposing harm on live Tor users. Nevertheless, researchers using these techniques configure them in such a way to generate network traffic on the basis of simplifying assumptions as well as invalid measurements and without analyzing the efficacy of their configuration choices.
A recently published paper developed a novel technique for dynamic learning of Tor network traffic models through means of hidden Markov modeling and privacy-preserving measurement techniques. The researchers conducted a safe, yet detailed analytical study of Tor via means of 17 relay nodes (comprising around 2% of Tor’s bandwidth) over a period of 6 months, which formulated models that can be used to create a sequence of packets and streams. The researchers proved that their measurement results and developed traffic models can be utilized to generate traffic flows throughout private Tor networks, and how their models are more realistic when compared to traditional and alternative traffic generation techniques.
Contributions of this research paper:
Authors of this paper made four major contributions that will have great influence on research centered on Tor security and performance. These are:
The researchers conducted a large and extensive measurement study of Tor. They used a novel privacy preserving measurement tool known as PrivCount, in addition to 17 relay nodes that correspond to around 2% of Tor’s overall bandwidth, to measure Tor network features over a period of 3 months. The study offered a thorough analysis of Tor traffic including the number of active and inactive circuits, the number of active and inactive clients, the number and types of streams as well as their distribution per client, and the number of inbound and outbound bytes along with their distribution per stream. These measurements are expected to yield accurate Tor traffic models and to gain a more detailed understanding of Tor and its use.
Learning Tor traffic:
The researchers also designed new techniques for dynamic and safe learning of Tor traffic models via means of hidden Markov modeling (HMM). They extended the PrivCount measurement tool to be able to support their techniques and utilize it on their relays to measure, i.e., train, Tor packet and stream models along a 3-month period. They evaluated their models and showed that even though Tor traffic is greatly variable over short periods, their best model instances fit Tor traffic quite well. Their models can be utilized to generate streams within a circuit and packets within a stream through means of standard probability distribution generation techniques.
Developing traffic models:
The researchers also built a group of modeling semantics and a traffic generation tool, that they named “TGen”, which can be used to create complex behavioral patterns. TGen enables configurable control over the process of creation of TCP connections, as well as the size, duration, and schedule of packet streams. They described two new client models: one based on the most common protocols used on Tor (i.e., HTTP and BitTorrent), and one that uses their Tor measurement results and HMM stream and packet models as the basis for traffic generation. Their proposed models are relevant across a range of Tor experimentation tools and research topics.
The researchers’ Tor measurements and improved traffic models facilitate the meaningful exploration of open Tor research problems. Their results will improve research conducted using general purpose packet-level Tor experimentation tools. For instance, the evaluation of the efficiency of proposed Tor load balancing algorithms will be more meaningful when the background traffic (i.e., the number and distribution of circuits, streams, and packets) within a Tor test network is more realistic (i.e., more similar to conditions on the live Tor network). Moreover, their results will improve research that utilizes higher-level Tor flow or circuit simulations that run over long simulated durations of testing. For instance, proposed secure bandwidth measurement algorithms have greatly benefited from experiments that involve many iterations of full network measurement to quantify feedback effects and the time required to reach a steady state.
These simulations can utilize the proposed models as generators to ensure accurate distribution of flows over arbitrary timescales. The changes to PrivCount and Shadow that were necessary to carry out this research have been contributed to the open-source community and have been merged into the PrivCount2 and Shadow3 projects. Moreover, the authors of the paper have released their PrivCount measurement data, TGen HMM models, and Shadow experimentation models, so that researchers and developers can benefit from their work.