Here is the scenario that has been encountered by asn. The following is in chronological order:
[02:21:27] An intro point (IP) is picked and circuit A is launched.
[02:23:47] Circuit A never completes and thus is marked for close which is NOT free thus still in our HS circuitmap.
[02:23:48] A second later, the HS housekeeping kicks in (every 1 sec) and we notice that we are missing a circuit for this IP so we launch a new circuit B. What happens then is this new circuit B takes the place of circuit A in the HS circuitmap with the same hs_token (the token in this case is the IP authentication public key).
At this point, we have circuit A marked for close and we have circuit B that was launched and replaced circuit A with the same hs_token which identify the IP. So far so good.
[02:23:48] At the same second, we get a circuit_build_failed() which is only called from circuit_about_to_free()->circuit_close_all_marked() which means it is freeing circuit A leading to a hs_circuitmap_remove_circuit() which will use its hs_token to remove circuit B because it is the same token.
We end up with no circuit for the intro point IP in our circuitmap.
[02:23:49] A second after our circuit build failed which removed circuit B from the circuitmap, we cleanup the IP from our list through the house keeping and we launch a new circuit to a new intro point. The reason why we removed the IP is either because it has expire, no node_t could be found or we retried too many times on that circuit which the allowed number of retries is 3 right now. When we do remove an IP from our service list, we also try to close the circuit but because circuit B got removed from the circuitmap, we can't find it thus it is not closed.
[02:24:54] Circuit B completes but we are unable to find the IP in our service list. And boom we get the warning:
In summary, we end up with an intro circuit being established but not in our circuitmap leading to the failed lookup when we remove that IP from our service list for whatever reason.
Seems like one thing we could do is actually remove a circuit from the HS circuitmap when we mark for close because at that point, the circuit is good as dead. We avoid this race between mark for close and actually being freed where in between a new circuit to the same destination with the same hs_token can be launched.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items
0
Link issues together to show that they're related.
Learn more.
Trac: Description: Here is the scenario that has been encountered by asn. The following is in chronological order:
[02:21:27] An intro point (IP) is picked and circuit A is launched.
[02:23:47] Circuit A never completes and thus is marked for close which is NOT free thus still in our HS circuitmap.
[02:23:48] A second later, the HS housekeeping kicks in (every 1 sec) and we notice that we are missing a circuit for this IP so we launch a new circuit B. What happens then is this new circuit B takes the place of circuit A in the HS circuitmap with the same hs_token (the token in this case is the IP authentication public key).
At this point, we have circuit A marked for close and we have circuit B that was launched and replaced circuit A with the same hs_token which identify the IP. So far so good.
[02:23:48] At the same second, we get a circuit_build_failed() which is only called from circuit_about_to_free()->circuit_close_all_marked() which means it is freeing circuit A leading to a hs_circuitmap_remove_circuit() which will use its hs_token to remove circuit B because it is the same token.
We end up with no circuit for the intro point IP in our circuitmap.
[02:23:49] A second after our circuit build failed which removed circuit B from the circuitmap, we cleanup the IP from our list through the house keeping and we launch a new circuit to a new intro point. The reason why we removed the IP is either because it has expire, no node_t could be found or we retried too many times on that circuit which the allowed number of retries is 3 right now. When we do remove an IP from our service list, we also try to close the circuit but because circuit B got removed from the circuitmap, we can't find it thus it is not closed.
[02:24:54] Circuit B completes but we are unable to find the IP in our service list. And boom we get the warning:
In summary, we end up with an intro circuit being established but not in our circuitmap leading to the failed lookup when we remove that IP from our service list for whatever reason.
Seems like one thing we could do is actually remove a circuit from the HS circuitmap when we mark for close because at that point, the circuit is good as dead. We avoid this race between mark for close and actually being freed where in between a new circuit to the same destination with the same hs_token can be launched.
to
Here is the scenario that has been encountered by asn. The following is in chronological order:
[02:21:27] An intro point (IP) is picked and circuit A is launched.
[02:23:47] Circuit A never completes and thus is marked for close which is NOT free thus still in our HS circuitmap.
[02:23:48] A second later, the HS housekeeping kicks in (every 1 sec) and we notice that we are missing a circuit for this IP so we launch a new circuit B. What happens then is this new circuit B takes the place of circuit A in the HS circuitmap with the same hs_token (the token in this case is the IP authentication public key).
At this point, we have circuit A marked for close and we have circuit B that was launched and replaced circuit A with the same hs_token which identify the IP. So far so good.
[02:23:48] At the same second, we get a circuit_build_failed() which is only called from circuit_about_to_free()->circuit_close_all_marked() which means it is freeing circuit A leading to a hs_circuitmap_remove_circuit() which will use its hs_token to remove circuit B because it is the same token.
We end up with no circuit for the intro point IP in our circuitmap.
[02:23:49] A second after our circuit build failed which removed circuit B from the circuitmap, we cleanup the IP from our list through the house keeping and we launch a new circuit to a new intro point. The reason why we removed the IP is either because it has expire, no node_t could be found or we retried too many times on that circuit which the allowed number of retries is 3 right now. When we do remove an IP from our service list, we also try to close the circuit but because circuit B got removed from the circuitmap, we can't find it thus it is not closed.
[02:24:54] Circuit B completes but we are unable to find the IP in our service list. And boom we get the warning:
In summary, we end up with an intro circuit being established but not in our circuitmap leading to the failed lookup when we remove that IP from our service list for whatever reason.
Seems like one thing we could do is actually remove a circuit from the HS circuitmap when we mark for close because at that point, the circuit is good as dead. We avoid this race between mark for close and actually being freed where in between a new circuit to the same destination with the same hs_token can be launched.
But I found a problem with the theory above: when we call hs_circuitmap_remove_circuit() in step 4, shouldn't we either hit the "Could not find circuit" warning, or fail the "tmp == circ" assertion?
OK I had this happen to me again on my unstable service.
Here is the creation of the circuit:
Sep 23 02:33:29.000 [info] hs_circ_launch_intro_point(): Launching a circuit to intro point $809814F480D046EF977004BD545914BFFA47C06D~uhkakw at 185.148.145.74 for service 4acth47i6kxnvkewtm6q7ib2s3ufpo5sqbsnzjpbi7utijcltosqemad.Sep 23 02:33:29.000 [info] origin_circuit_new(): Circuit 697 chose an idle timeout of 681 based on 652 seconds of predictive building remaining.Sep 23 02:33:29.000 [info] onion_pick_cpath_exit(): Using requested exit node '$809814F480D046EF977004BD545914BFFA47C06D~uhkakw at 185.148.145.74'Sep 23 02:33:29.000 [info] select_entry_guard_for_circuit(): Selected primary guard Unnamed ($3E53D3979DB07EFD736661C934A1DED14127B684) for circuit....
and then after a second:
Sep 23 02:33:30.000 [warn] Unknown introduction point auth key on circuit 3571433146 for service 4acth47i6kxnvkewtm6q7ib2s3ufpo5sqbsnzjpbi7utijcltosqemadSep 23 02:33:30.000 [info] circuit_mark_for_close_(): Circuit 3571433146 (id: 697) marked for close at src/or/hs_service.c:2711 (orig reason: 12, new reason: 0)...Sep 23 02:33:30.000 [info] circuit_free(): Circuit 0 (id: 697) has been freed.
Note the 697 id which is the only identifier. There are no other references to 3571433146 or 697 in the logs.
There are no cleanup_intro_points() logs between creation and destruction (it only took a second). I'm not sure where the circuit went, or maybe the circuitmap was never updated? :/
BTW note that this is an unstable hidden service where internet goes up and down a few times a day. I have 200 intro circuit creations in 3 days.
My cleanup intro point logs always have maxed out retries:
Sep 23 01:46:41.000 [info] cleanup_intro_points(): Intro point $9C900A7F6F5DD034CFFD192DAEC9CCAA813DB022 (retried: 4 times). Removing it.Sep 23 01:46:41.000 [info] cleanup_intro_points(): Intro point $CC865174BE13444B251CCB2F82B3F0F9A614DC53 (retried: 4 times). Removing it.Sep 23 01:46:41.000 [info] cleanup_intro_points(): Intro point $60AEEBDC21E6D5411853BDA03478604FFBA00B20 (retried: 4 times). Removing it.Sep 23 02:09:19.000 [info] cleanup_intro_points(): Intro point $F1D71314781D23800C3C62CB720BBCF449B27399 (retried: 4 times). Removing it.
also i got this once
Sep 23 02:33:30.000 [info] circuit_free(): Circuit 0 (id: 697) has been freed.
Ok, those latest logs were enlightening and 99.9% certainty now :)
The offending function is hs_service_intro_circ_has_closed() called in circuit_about_to_free().
Same scenario has above but just simpler for the sake of the example:
mark for close intro circuits
launch new ones to same intro points.
about_to_free() the circuits in step (1).
--> This calls has_closed() which removes the intro point object if it has been retried too many times.
New circuit opens, boom no intro point object and we get the warning of Unknown auth key....
From asn's log, which suspend/wakeup many times his tor, we happen to try intro points many times (successfully) and in their 18h-24h lifetime, it was done more than 3 times which triggered this condition in the has closed function:
This can only happen if the service launch the new circuits before the free() of the previous circuits happens and the circuit retry counter has reached > 3.
Couple of possible fixes here. First, we should NOT clean anything in that function. We should ONLY do that in cleanup_intro_points() which would have cleanup everything properly (circuit included). For this, we need to move the "remember failing intro point" logic in there in case we've tried too many times.
Second, this circuit_retries counter should probably not be incremented when we just go out of hibernation and tor is assuming at that point that all circuits are not working. We either do a callback from hibernation into the HS subsystem to cleanup those counters or another possible option would be to reset to 0 the counter on a successful intro point establishment.
Agreed on only doing cleanup in cleanup_intro_points().
Not sure about the circuit_retries counter situation. A real hibernation callback is probably not easy to do in a cross-platform way, and resetting the retries count is a recipe for guard discovery attacks. However, IIUC this bug is not as impotant as the cleanup issue.
I pushed bug23603_test with a draft unittest here. It needs improvements.
David, I did not understand your suggested fix for cleanup_intro_points to make it remove intro points that will get cleaned up anyway. I think my confusion stems from the fact that we increase intro point circuit_retries when we launch a circuit and not when we fail it.
This means that there are cases (say network down) where we will launch a 4th retry on an intro point, the retry will succeed, and then we will still clean it up since it reached 4 retries. I think the right way to do this is to increase the retry counter when we fail the circuit in hs_service_intro_circ_has_closed(). No?
I think the suggested fix also creates confusion over the semantics of MAX_INTRO_POINT_CIRCUIT_RETRIES since we would then have in cleanup_intro_points:
This logic looks confusing to me. What's the meaning of max after all? I think all this confusion stems from the fact that circuit_retries gets incremented on launch and not on fail.
Indeed. The circuit_retries is really about "How many time have used this IP to build a circuit" regardless of success or fail. Even if the circuit was taken down 5 min later, we still count that as a "try".
Apart from all the numerous issues with this behavior, one of the bug here is that we reach 3 try which is not above the limit of 4, so we open a new circuit and circuit_retries == 4 which makes us clean it up the second after because we've reached our max...
We should:
Allow the circuit creation up to 3 tries.
Deny circuit creation at > 3 and clean it up at > 3.
Kind of tricky here, so I think the easiest way to do that is that if circuit_retries is 3 and we want to launch, ++ it and deny the launch.
OR we only cleanup the intro point if it doesn't have a circuit established and the counter has reached the max? (Which is what v2 does, cleanups only if it realizes that the retry has reached the max and the last circuit is gone thus a retry is soon about to be launched).
If you have a more elegant solution, I'm all ears?
Indeed. The circuit_retries is really about "How many time have used this IP to build a circuit" regardless of success or fail. Even if the circuit was taken down 5 min later, we still count that as a "try".
Apart from all the numerous issues with this behavior, one of the bug here is that we reach 3 try which is not above the limit of 4, so we open a new circuit and circuit_retries == 4 which makes us clean it up the second after because we've reached our max...
We should:
Allow the circuit creation up to 3 tries.
Deny circuit creation at > 3 and clean it up at > 3.
Kind of tricky here, so I think the easiest way to do that is that if circuit_retries is 3 and we want to launch, ++ it and deny the launch.
OR we only cleanup the intro point if it doesn't have a circuit established and the counter has reached the max? (Which is what v2 does, cleanups only if it realizes that the retry has reached the max and the last circuit is gone thus a retry is soon about to be launched).
If you have a more elegant solution, I'm all ears?
Hmmm, I feel like the first solution you suggested is less racy and more straightforward than the first (at least to my mind). Curious why v2 is doing it the other way. Maybe I have not thought of something?
I'd also like to either rename circuit_retries to circuits_launched, or restore the circuit_retries semantic by only increasing it when we fail and actually retry a circuit. Because right now circuit_retries will also count the first circuit launch which is not a retry. I feel like that will also simplify logic and understanding.