Add routing logic for Ping
This card depends on GH-585.
Calculating the route for a Ping is not trivial.
First, if we're wondering about the latency of a link, we don't want to Ping the Node at the far end of that link: we want to Ping a Node one hop beyond that Node, for two reasons: first, a Node shouldn't know that one of its links is being latency-checked, so that an evil Node can't respond differently for latency checks than for normal traffic; and second, the logic involved in responding to a Ping is different from the logic involved in routing normal data, so we wouldn't get an accurate result even if we were dealing with an honest Node. Therefore, we will refer to the distant Node we're pinging as the Ping target, and the one-hop-closer Node at the far end of the link whose latency we're checking as the check target.
Second, since evil Nodes can play around with latencies in order to impugn honest Nodes near them, we should check latency from more than one route, to give us a chance to miss the evil Nodes and get honest latency figures.
Third, if a check target has only one neighbor, it's possible to latency-check it, by sending the Ping in from and out to that same neighbor, but it would be easy for an evil Node to detect, because real traffic doesn't travel that way.
Even though routing latency checks is different from routing normal data, we'd still like to use the Routing Engine for it, because that means keeping our routing code in one place so that our changes can be simpler.
Come up with a way to use the Routing Engine to generate latency-check routes that check the same link, but use different relay Nodes and different Ping targets. You'll probably need to make several calls to the Routing Engine, and you might need to manually reverse a calculated route segment so that latency-check routes are the same over and back.
Leave in the logic for computing and logging the latency for new links, but modify it to designate the Node at the far end of the new link as the check target, and select a Node beyond it as the Ping target. If the new Node has only one neighbor (can be represented in the Neighborhood, but should never happen in real life), don't send the Ping.
Suggestion: add a parameter to the Routing Engine that is a Set of public keys of Nodes that you'd really rather not use for routing, but will if you absolutely have to. These Nodes can be given a temporarily high undesirability score.
Here's a response from ChatGPT that will eventually be helpful:
This is a classic network tomography problem, where you want to infer per-link latencies from end-to-end round-trip time (RTT) measurements. Since direct observations of individual link latencies are not available, you need an estimation algorithm that will improve over time as you collect more data.
Approach
You can model the problem as a system of linear equations, where the total RTT for each probe is the sum of the latencies of the individual links along the route. Over time, you accumulate multiple equations with different unknowns (latencies of individual links), and statistical methods can be used to refine your estimates.
Step 1: Model the Observations
If a packet traverses a path ( P = (e_1, e_2, ..., e_k) ), then the round-trip time ( T_P ) observed for that packet is:
[ T_P = 2 \sum_{i=1}^{k} L_i + \epsilon_P ]
where:
- ( L_i ) is the latency of edge ( e_i ).
- ( \epsilon_P ) represents noise and other uncertainties (such as queueing delays).
- The factor of 2 accounts for the fact that the return path is identical.
For multiple probes on different paths, you get a system of equations:
[ T_1 = 2(L_{a} + L_{b} + L_{c}) + \epsilon_1 ] [ T_2 = 2(L_{b} + L_{c} + L_{d}) + \epsilon_2 ] [ T_3 = 2(L_{a} + L_{c} + L_{e}) + \epsilon_3 ] ...
Step 2: Estimation Using Least Squares
Over time, as you collect multiple RTT measurements along overlapping routes, you can estimate the latencies by solving the system using least squares regression:
[ \mathbf{T} = 2\mathbf{A} \mathbf{L} + \mathbf{\epsilon} ]
where:
- ( \mathbf{T} ) is a vector of observed RTTs.
- ( \mathbf{A} ) is the routing matrix (a binary matrix indicating which links were used in each measurement).
- ( \mathbf{L} ) is the vector of unknown link latencies.
- ( \mathbf{\epsilon} ) is the noise term.
The standard least squares solution is:
[ \hat{\mathbf{L}} = \frac{1}{2} (\mathbf{A}^T \mathbf{A})^{-1} \mathbf{A}^T \mathbf{T} ]
If ( \mathbf{A}^T \mathbf{A} ) is singular or ill-conditioned (common if paths overlap insufficiently), you can use regularized least squares (ridge regression) or Bayesian inference.
Step 3: Bayesian Filtering for Time-Varying Latencies
Since latencies change over time, a Kalman filter or particle filter can be used to continuously update estimates. Given prior estimates ( \hat{L}_i^{(t-1)} ), a new measurement updates the belief about ( L_i ):
[ \hat{L}_i^{(t)} = \alpha \cdot \hat{L}_i^{(t-1)} + (1 - \alpha) \cdot \text{New Estimate} ]
where ( \alpha ) is a forgetting factor that allows adaptation to slow changes.
Alternatively, a Gaussian Process (GP) regression can model each link’s latency as a smoothly varying function over time.
Summary of the Approach
- Collect multiple RTT measurements across different paths over time.
- Solve for individual link latencies using least squares regression.
- Apply filtering techniques (Kalman, particle filter, or Gaussian Processes) to adapt estimates as latencies change.
Would you like a Python implementation for this?
Here's a Rust demo from ChatGPT, using the nalgebra library for matrix math. Please note that the data structures used by this demo are nothing like the ones used by our Neighborhood; our version of this demo, should we decide to write one, should use our own data structures.
use nalgebra::{DMatrix, DVector};
use std::collections::HashMap;
/// Represents a network graph with latencies for each edge.
struct Network {
edges: HashMap<(usize, usize), f64>, // (Node1, Node2) -> Latency
}
impl Network {
/// Creates a new network with given edges and latencies.
fn new(edges: Vec<((usize, usize), f64)>) -> Self {
let edges_map = edges.into_iter().collect();
Network { edges: edges_map }
}
/// Simulates an RTT measurement along a given path.
fn measure_rtt(&self, path: &[usize]) -> f64 {
let mut total_latency = 0.0;
for window in path.windows(2) {
if let Some(&latency) = self.edges.get(&(window[0], window[1])) {
total_latency += latency;
} else {
panic!("Path contains unknown edge {:?}", (window[0], window[1]));
}
}
2.0 * total_latency // Round-trip time (RTT)
}
/// Estimates individual link latencies from multiple RTT measurements.
fn estimate_latencies(&self, paths: &Vec<Vec<usize>>, rtt_measurements: &Vec<f64>) -> HashMap<(usize, usize), f64> {
let num_edges = self.edges.len();
let num_measurements = paths.len();
// Build the matrix A (path-incidence matrix)
let mut a_matrix = DMatrix::<f64>::zeros(num_measurements, num_edges);
let mut edge_index = HashMap::new();
// Assign indices to edges
for (i, &edge) in self.edges.keys().enumerate() {
edge_index.insert(edge, i);
}
// Populate the A matrix
for (i, path) in paths.iter().enumerate() {
for window in path.windows(2) {
if let Some(&idx) = edge_index.get(&(window[0], window[1])) {
a_matrix[(i, idx)] = 2.0; // Multiply by 2 for round-trip contribution
}
}
}
// Convert RTT measurements into a vector
let t_vector = DVector::from_column_slice(&rtt_measurements);
// Solve using least squares: L = (Aᵀ A)⁻¹ Aᵀ T
let at_a = a_matrix.transpose() * &a_matrix;
let at_t = a_matrix.transpose() * &t_vector;
let estimated_latencies = at_a.try_inverse().unwrap_or_else(|| panic!("Matrix inversion failed!")) * at_t;
// Convert results to a HashMap
let mut estimated_latencies_map = HashMap::new();
for (edge, &idx) in edge_index.iter() {
estimated_latencies_map.insert(*edge, estimated_latencies[idx]);
}
estimated_latencies_map
}
}
fn main() {
// Define a sample network with known latencies
let network = Network::new(vec![
((1, 2), 10.0),
((2, 3), 20.0),
((3, 4), 15.0),
((1, 3), 25.0),
((2, 4), 30.0),
]);
// Define paths taken by probe packets
let paths = vec![
vec![1, 2, 3], // RTT = 2(10 + 20) = 60
vec![2, 3, 4], // RTT = 2(20 + 15) = 70
vec![1, 3, 4], // RTT = 2(25 + 15) = 80
];
// Simulated RTT measurements
let rtt_measurements: Vec<f64> = paths.iter().map(|p| network.measure_rtt(p)).collect();
// Estimate per-link latencies
let estimated_latencies = network.estimate_latencies(&paths, &rtt_measurements);
// Print estimated latencies
println!("Estimated Link Latencies:");
for ((n1, n2), latency) in estimated_latencies {
println!("Edge ({}, {}) -> {:.2} ms", n1, n2, latency);
}
}
It occurs to me that even if ChatGPT is not hallucinating on the math here (we'll want to make sure to cover our code with good tests to find out), there's a basic inconsistency here.
First, at the end of measure_rtt(), we're multiplying total latency by 2.0, implying that the latency for the "back" route segment is identical to the latency for the "over" route segment. However, the key to the HashMap of latencies is a simple ordered tuple, implying that there should be a different latency value for the edge from A to B than there is for the edge from B to A.
If we're to make the simplifying assumption that A->B = B->A, then adding (B, A) -> 25.0 to the HashMap should overwrite (A, B) -> 30.0 if it already exists. That will mean creating a special struct for the key and implementing PartialEq and Hash for it.
ChatGPT offered various enhancements to this basic algorithm, including a Kalman filter, Gaussian estimations, and even a Bayesian filter, to deal with noise from measurement error and random network latency changes; my counsel is to start with the simple Least Squares method above and only worry about more advanced methods if it doesn't work out.
One vulnerability to the ping-one-beyond scheme is that somebody who is intent on slowing down the Network can put up several evil Nodes in a cluster instead of just one. In this case, pinging through an evil Ping target to an evil check target would allow the Ping target to detect the Ping nature of the request and alert the check target to return the response immediately rather than inserting its evil delay.
A way to circumvent this (credit to @czarte ) is to apply the Ping-analysis logic not only to Ping requests, but to everyday normal data requests as well. Anytime the local Node receives a normal data response, it should go through the same calculations it would use on a Ping response. That way, Nodes that treat Pings differently from normal data would at least have their latencies averaged; we might even be able to identify and ban them.