userstats-bridge-country graph is undercounting users

mentioned in issue onionoo#40022 (closed)

marked this issue as related to onionoo#40022 (closed)

added Doing label

changed milestone to %Metrics OKR Q1 - Q2 2022

mentioned in merge request !20 (closed)

I did some analysis to see what a corrected graph for Snowflake would look like. My estimate of the degree of undercounting was off: it would be correct when comparing to Onionoo, but the graph on metrics.tpo has different numbers than Onionoo, for reasons I don't understand.

Source code

The "manual" and "Onionoo" series have the relationship I expect: before 2022-01-31 they have a ratio of 1:1; between 2022-02-01 and 2022-03-15 (tpo/anti-censorship/pluggable-transports/snowflake#40095 (closed)) they have a ratio of 4:1; between 2022-03-16 and 2022-04-11 (tpo/anti-censorship/pluggable-transports/snowflake#40110 (closed)) they have a ratio of 8:1; and after 2022-04-12 (tpo/anti-censorship/pluggable-transports/snowflake#40111 (closed)) they have a ratio of 4:1.

The metrics.tpo series also agrees exactly with the other two before 2022-01-31, except for some 1-day dips caused by I don't know what. In this graph, I divided out the frac factor that otherwise would inflate the estimated count in the metrics.tpo series by 10–30%.

But after 2022-02-01 (after the beginning of load balancing on the snowflake bridge), the metrics.tpo series diverges. It is no longer proportional to the number of directory requests; it's usually bigger. But then, there is a stretch between 2022-02-28 and 2022-03-09 where metrics.tpo and Onionoo are exactly equal again, just as they were at the left side of the graph.

What could cause the metrics.tpo estimate to be different than the Onionoo estimate? There is only one bridge reporting snowflake statistics. I've looked over the reproducible metrics document, and I don't find anything that could cause a difference other than frac, which has been removed from this graph.

My only guess is that it has something to do with the sparseness of descriptor publishing. I noticed that since the addition of load balancing, the instances do not publish extra-info descriptors at regular intervals. Notice that the descriptors are dense between 2022-02-28 and 2022-03-09, which is also where the metrics.tpo and Onionoo graphs agree. The descriptors are sparse where there were 8 instances, between 2022-03-18 and 2022-04-12, which is also where the metrics.tpo graph is the most noisy and spiky.

added Q2 label

mentioned in issue tpo/community/support#40050 (closed)

assigned to @hiro

I have a vague idea of why this might happen but I am not sure yet.

While working on !20 (closed) I went through how the data parsed from descriptors is added to the DB on meronense. This data is then used to produce the estimation that we publish on metrics.tpo.

My feeling is that on onionoo we use the log text file to produce a node status and bandwidth document. And that is overwritten when a new descriptor arrives. So the ration between the actual count and onionoo is exactly N:1.

On the other hand on metrics web we add records to the DB and we check the intervals. Ex:

https://gitlab.torproject.org/tpo/network-health/metrics/website/-/blob/master/src/main/sql/clients/init-userstats.sql#L22

CREATE TABLE imported (

  -- The 40-character upper-case hex string identifies a descriptor
  -- uniquely and is used to join metrics (responses, bytes, status)
  -- published by the same node (relay or bridge).
  fingerprint CHARACTER(40) NOT NULL,

  -- The node type is used to decide the statistics that this entry will
  -- be part of.
...

On the imported table we allow different entries to be added for the same fingerprint and we process the intervals later:

https://gitlab.torproject.org/tpo/network-health/metrics/website/-/blob/master/src/main/sql/clients/init-userstats.sql#L322

-- If the existing entry that we're currently looking at starts before
    -- the previous entry ends, we have created two overlapping entries in
    -- the last iteration, and that is not allowed.  Undo the previous
    -- change.
    IF cur.merged_start IS NOT NULL AND
        cur.merged_start < last_end AND
        undo_end IS NOT NULL AND undo_val IS NOT NULL THEN
      UPDATE merged SET stats_end = undo_end, val = undo_val
        WHERE id = last_id;
      undo_end := NULL;
      undo_val := NULL;

...

My guess is that depending on how the descriptors arrive we are counting for some of the flakeys, because we are not overwriting the record as we would on onionoo document files.

But after 2022-02-01 (after the beginning of load balancing on the snowflake bridge), the metrics.tpo series diverges. It is no longer proportional to the number of directory requests; it's usually bigger. But then, there is a stretch between 2022-02-28 and 2022-03-09 where metrics.tpo and Onionoo are exactly equal again, just as they were at the left side of the graph.

What could cause the metrics.tpo estimate to be different than the Onionoo estimate? There is only one bridge reporting snowflake statistics. I've looked over the reproducible metrics document, and I don't find anything that could cause a difference other than frac, which has been removed from this graph.

As a data point, I sat down on the weekend tweaking your excellent scripts (thanks!) just checking how things would look like for a random obfs4 bridge. I picked dktoke and for that one Onionoo and metrics.tpo give out essentially the same values (as one would expect):

So, I guess the issues you see might not be related to recent code changes on our side at least.

mentioned in issue analysis#40012 (closed)

I am testing patch where I also take into account nicknames and not just fingerprints on: https://metrics.hiro.network/userstats-bridge-country.html

@dcf I have run your code with the data generated from my patch and here are the results:

It seems on some days I am counting way more users. This could either be a mistake in my code, or something else in the metrics codebase that I didn't consider.

The patch is running live on: https://metrics.hiro.network/userstats-bridge-country.html

The idea is that the tuple (fingerprint, nickname) is used here to count stats instead of overwriting the data for the different flakeys.

I have run the comparison again. This time I added metrics.tpo values.

The data for my patch is showing spikes that we don't see in metrics.tpo. So this might be just something I have introduced when adding the nickname to the count.

It's confusing to me why the metrics.hiro and manual data are equals on some days except for the spikes.

I have run some more analysis and here is a summary.

I have calculated the ratio between metrics.torproject.org data, and respectively: metrics.hiro.network, manual flakey count and onionoo.

date	metrics.hiro	flakey manual	onionoo
2022-05-14	2.560865353567969	2.1018953826283497	0.5305134000645786
2022-05-15	3.737969676994067	2.1249225444957154	0.5286750164798946
2022-05-16	2.2173486088379706	2.1116955810147298	0.5250409165302782
2022-05-17	2.6631422435156122	2.133057987774657	0.5242028746076326
2022-05-18	2.2255566311713455	2.09270732494353	0.5301710229106164
2022-05-19	4.62880658436214	3.5715006858710563	0.8957475994513031
2022-05-20	6.599210526315789	3.5134473684210525	0.8742105263157894
2022-05-21	2.22069825436409	2.0847568578553615	0.5099750623441397
2022-05-22	2.2286073166902183	2.110844716596012	0.524886167373214
2022-05-23	2.244562666249413	2.0959505554686277	0.5241746205601627
2022-05-24	2.703240058910162	1.9187083946980854	0.4798232695139912
2022-05-25	2.2493498049414824	2.127140767230169	0.5352730819245773
2022-05-26	2.2245245090459256	2.0741039121694755	0.526828513994124
2022-05-27	2.2689687209662432	2.082331991328585	0.5252400123877361

I guess the ratio is not always the same (give or take) because in the case of our current metrics.tpo and onionoo logic it might depend on which status is overwritten.

Hence I thought to look at the onionoo ratio between respectively: metrics.hiro.network, manual flakey count and metrics.tpo

date	metrics.hiro	flakey manual	metrics
2022-05-14	4.827145465611686	3.962002434570907	1.8849665246500305
2022-05-15	7.070448877805486	4.019336034912718	1.8915211970074812
2022-05-16	4.223192019950124	4.021963840399002	1.9046134663341645
2022-05-17	5.080365584620234	4.069145918688938	1.9076583674755752
2022-05-18	4.19780888618381	3.9472306755934268	1.8861838101034694
2022-05-19	5.167534456355283	3.987173047473201	1.116385911179173
2022-05-20	7.5487658037326915	4.018994581577363	1.143889223359422
2022-05-21	4.354523227383863	4.087958435207824	1.960880195599022
2022-05-22	4.245886927909064	4.02152856715525	1.9051749925216872
2022-05-23	4.282089552238806	3.998573134328358	1.9077611940298507
2022-05-24	5.633824432166974	3.99878146101903	2.0841006752608964
2022-05-25	4.202247191011236	3.973935621014273	1.868205283935621
2022-05-26	4.222483122982096	3.936962136777223	1.8981508658643969
2022-05-27	4.319870283018868	3.9645341981132076	1.9038915094339623

I noticed that the ratio with metrics.hiro to onionoo is closer to the ratio of the manual count. So I started to wonder if we have some issue in how we handle time intervals.

I hence printed the descriptors distribution with the same R script from @dcf :

There is some correlation between descriptors being more or less sparse and when we have spikes in the data. Then I realized that in the manual code we "Keep only the most recent "published" for each "end"" and discard previous descriptors. In metrics.tpo instead we go through a rather complicated merge and combine operation in the database to sum up adjacent intervals. I have started documenting this here: https://gitlab.torproject.org/tpo/network-health/team/-/wikis/metrics/website/clients-stats but otherwise this logic is contained in the DB code: https://gitlab.torproject.org/tpo/network-health/metrics/website/-/blob/master/src/main/sql/clients/init-userstats.sql#L362

I was hence wondering if we can make the flakeys publish more descriptors and see if we have data spikes. What do you think @dcf?

Then I was also wondering why this happens only with snowflake? Do the flakeys publish more descriptors than any other relay/bridge? I wonder if we just happen to notice with snowflake because we have a way to check the data from flakeys but we aren't doing the same with other relays/bridges.

I have calculated the ratio between metrics.torproject.org data, and respectively: metrics.hiro.network, manual flakey count and onionoo.

I am not understanding this table. On 2022-05-22, the graph shows manual and metrics.hiro having the same value, about 13500. But the ratios are different, 2.229 and 2.111 respectively. If the numerators and denominators are the same, the ratios should be the same. And if I estimate metrics.tpo = 6000 from the graph, I get that the ratio for both should be 13500 / 6000 = 2.250. What am I missing?

In this head scratching issue I see no spikes in the last 4 days.

I hence printed frequencies of interval end in descriptors and found four spikes round the time we also have spikes in the data:

Linking #40034 here. Could give some hints.

userstats-bridge-country graph is undercounting users

Designs

Child items 0

Activity