Fix a flaw in the noise-removing code in our onion service statistics
I think I found a flaw in the aggregation part of our onion service statistics.
For context, I'm quoting a paragraph from our technical report where we explain how we're extrapolating network totals from hidden-service statistics:
"When processing hidden-service statistics, we need to handle the fact that they have been obfuscated by relays. As first step, we're attempting to remove the additive Laplace-distributed noise by rounding up or down to the nearest multiple of bin_size
. The idea is that it's most likely that noise was added to the closest right side of a bin than to the right side of another bin. In step two, we're subtracting half of bin_size
, because the relay added between 0 and bin_size − 1
to the originally observed value."
In our Java code, these two steps turned into:
/** Removes noise from a reported stats value by rounding to the nearest
* right side of a bin and subtracting half of the bin size. */
private long removeNoise(long reportedNumber, long binSize) {
long roundedToNearestRightSideOfTheBin =
((reportedNumber + binSize / 2) / binSize) * binSize;
long subtractedHalfOfBinSize =
roundedToNearestRightSideOfTheBin - binSize / 2;
return subtractedHalfOfBinSize;
}
I think that this code has a flaw: in (reportedNumber + binSize / 2) / binSize
, we use integer truncation. However, as described in that Wikipedia article, "for negative numbers truncation does not round in the same direction as the floor function: truncation always rounds toward zero, the floor function rounds towards negative infinity."
I'm going to attach a graph (once this ticket exists) that visualizes the effect for reported values around zero. In short: the step function has one really long step, and that seems flawed or at least inconsistent.
My suggestion would be to change that Java code and use Math.floorDiv()
rather than integer truncation. Something like this:
diff --git a/src/main/java/org/torproject/metrics/stats/hidserv/Parser.java b/src/main/java/org/torproject/metrics/stats/hidserv/Parser.java
index f2abc789..888c8959 100644
--- a/src/main/java/org/torproject/metrics/stats/hidserv/Parser.java
+++ b/src/main/java/org/torproject/metrics/stats/hidserv/Parser.java
@@ -245,7 +245,7 @@ public class Parser {
* right side of a bin and subtracting half of the bin size. */
private long removeNoise(long reportedNumber, long binSize) {
long roundedToNearestRightSideOfTheBin =
- ((reportedNumber + binSize / 2) / binSize) * binSize;
+ Math.floorDiv((reportedNumber + binSize / 2), binSize) * binSize;
long subtractedHalfOfBinSize =
roundedToNearestRightSideOfTheBin - binSize / 2;
return subtractedHalfOfBinSize;
See the JavaDocs for that method for details. Quoting: "Normal integer division operates under the round to zero rounding mode (truncation). This operation instead acts under the round toward negative infinity (floor) rounding mode. The floor rounding mode gives different results than truncation when the exact result is negative."
I'm yet unclear whether this might affect overall results. Right now, we're handling all negative reported values wrong by adding 1 binSize
to them which we shouldn't do. Of course, negative reported values should be the exception, not the rule, but there's a reason why we're keeping them in the process, just like we're keeping values that we think are too large because of noise. We'll have to see.
But regardless, I think we need to fix this. Opinions?