I use tor only as a client to connect icedove to the tor network with
the extension Torbirdy (on port 9050). With the tor version 0.2.8.6 I
can't immediately connect to any mail server or news feed after the pc
woke up from standby ("long" time in standby) and I started icedove. I
have to wait for several minutes in order to connect successfully, but
the timespan seems to be random. This does not occur after a (re)boot.
The first version I remember to have this issue is 0.2.8.6-2, I did an
upgrade from 0.2.7.6-1 to 0.2.8.6-2, so I skipped the alpha and rc
versions and the first upload to unstable. I am very sure that the issue
didn't occur in version 0.2.7.6-1 which I used for several months. I can
exclude network connectivity problems because e.g. I can immediately
start the Tor Browser after standby.
Today I purged tor, installed version 0.2.7.6-1, copied the old "state"
file to /var/lib/tor, and set the pc in standby mode for a couple of
minutes. After waking up from standby I immediately tried to connect to
a mail server which worked. Then I upgraded step by step to every
version of tor 0.2.8 which I could find on snapshot.debian.org and tried
to connect to a mail server immediately after waking up from standby.
Unfortunately I could not reproduce the bug then. Finally with version
0.2.8.6-3 the bug occured again, but only after a "long" standby time
(almost 90 minutes).
Attached are two log files from the weekend and the complete log from
today after the installation of version 0.2.7.6-1.
As you can see, the bug is not easily reproducible, and the logs don't
show any particular reason for why tor does not open new circuits
immediately. Please tell me what I can do to give you more information
about the bug.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items 0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items 0
Link issues together to show that they're related.
Learn more.
I am the original bug reporter. Over the last weeks I used version 0.2.7.6 and I did not ran into the timeout in thunderbird (I guess this is 2 minutes) and so had to wait for tor opening circuits. However tor still did not open new circuits immediately after standby. Until today I never saw this as a bug, but now I think tor should open new circuits immediately as it does after a (re)boot. Today I tested version 0.2.4.27 and it behaved just like 0.2.7.6.
So in conclusion there is a regression in 0.2.8, but from my pov the bug was there for a long time.
Ok, it just happened to me (running "Tor 0.2.9.3-alpha-dev (git-bfaded9143d127cb)"). My tor client has been on an airplane all day, firewalled from the internet. Now I made a request to it, and it just sat there -- it received the request but did not attempt to make any circuits or anything.
I connected over the control port and asked it 'getinfo circuit-status' and it has no circuits open.
It is supposed to say
log_fn(severity, LD_APP|LD_DIR, "Application request when we haven't %s. " "Optimistically trying directory fetches again.", !router_have_minimum_dir_info() ? "used client functionality lately" : "received a consensus with exits");
and then start making circuits and stuff.
It's weird that it didn't even try to make a circuit for the new stream request though.
Oct 25 18:51:45.580 [info] routerlist_remove_old_routers(): We have 0 live routers and 0 old router descriptors.Oct 25 21:58:43.520 [notice] Your system clock just jumped 11174 seconds forward; assuming established circuits no longer work.Oct 25 21:58:54.825 [info] routerlist_remove_old_routers(): We have 0 live routers and 0 old router descriptors.
How come my Tor client didn't spring into action after it noticed the clock jump?
In 0.2.8, we redesigned the timer table.
And we redesigned how we retrieve new consensuses when our consensus has expired.
Maybe we need to tweak one or more of these things.
I think we implemented expotential backoff in 0.2.9. Not sure if that will make it better or worse.
Ok, I spent a while investigating this ticket yesterday. For context, the Tor client on my travel laptop is still stuck in this bug, so I can poke it in various ways to help debug if we think of useful ways. Here is a checkpoint of my thinking process.
There are two mysteries here:
Mystery 1) Why does my Tor client not care to get any new directory information?
It's October 29 now, and my tor's datadirectory contains:
-rw------- 1 arma arma 1405537 Oct 21 12:14 cached-microdesc-consensus-rw------- 1 arma arma 3473955 Oct 21 12:43 cached-microdescs-rw------- 1 arma arma 0 Oct 21 12:43 cached-microdescs.new
After one of the clock jump events, my Tor tried to fetch a consensus from a relay in the fallbackdir list, then it fell back to trying to fetch a consensus from each of the directory authorities, and when one attempt failed for each of those, it seems to have permanently given up -- and that was days ago. I'm going to leave this mystery for later -- somebody please feel free to pick it up in the meantime.
Mystery 2) Why does my Tor client not recognize the new application (socks) requests and jumpstart itself into trying to fetch new directory stuff, make new circuits, and so on?
My investigations led me to be suspicious of Tor commit b1d56fc58.
Background: in the past a new socks request would come in, and connection_ap_handshake_rewrite_and_attach() would get called on it, which would call connection_ap_handshake_attach_circuit(), which would call circuit_get_open_circ_or_launch(), and that function is the one that does the "omg I don't have enough directory information and I just got a socks request, I should reset everything so we can try again" logic.
But in commit b1d56fc58, we made connection_ap_handshake_rewrite_and_attach() no longer attach it, but rather it now calls connection_ap_mark_as_pending_circuit(), which effectively queues the stream for handling the next time the main loop runs. That function adds the stream to the pending_entry_connections smartlist, and sets untried_pending_connections to 1, and returns.
Then run_main_loop_once() calls connection_ap_attach_pending(0), which goes through and calls connection_ap_handshake_attach_circuit() on each stream that needs it.
But when I gdb attach to my unhappy Tor client and set a breakpoint on connection_ap_attach_pending and then do a new socks request, it seems that it never calls the function!
The earlier parts of connection_ap_handshake_rewrite_and_attach()are being called, for example rep_hist_note_used_port() gets called just fine. So I think the stream is being added to pending_entry_connections, and then somehow things aren't working after that.
Ok, I spent a while investigating this ticket yesterday. For context, the Tor client on my travel laptop is still stuck in this bug, so I can poke it in various ways to help debug if we think of useful ways. Here is a checkpoint of my thinking process.
There are two mysteries here:
Mystery 1) Why does my Tor client not care to get any new directory information?
It's October 29 now, and my tor's datadirectory contains:
{{{
-rw------- 1 arma arma 1405537 Oct 21 12:14 cached-microdesc-consensus
-rw------- 1 arma arma 3473955 Oct 21 12:43 cached-microdescs
-rw------- 1 arma arma 0 Oct 21 12:43 cached-microdescs.new
}}}
After one of the clock jump events, my Tor tried to fetch a consensus from a relay in the fallbackdir list, then it fell back to trying to fetch a consensus from each of the directory authorities, and when one attempt failed for each of those, it seems to have permanently given up -- and that was days ago. I'm going to leave this mystery for later -- somebody please feel free to pick it up in the meantime.
This sounds like an issue with the consensus download implementation. Here are my questions:
has your tor has marked each of the directory authorities down?
has it marked all of the fallback directories down?
has it exceeded the download schedule for downloading consensuses?
There are two kinds of download_status_t in 0.2.8, one retries on a schedule regardless of success or failure, and the other retries on failure. It seems there might be a problem with the first kind (which is new in 0.2.8, and used to download the consensus).
I'd like to know what your consensus download_status_t has for all its fields, particularly the attempt and failure counts. It may well have exceeded the 255 failure count limit.
I also wonder how #8625 (moved) (commit 09a0f2d) affects this, but it is only in 0.2.9, and I'm not sure it would make it any better.
Mystery 2) Why does my Tor client not recognize the new application (socks) requests and jumpstart itself into trying to fetch new directory stuff, make new circuits, and so on?
My investigations led me to be suspicious of Tor commit b1d56fc58.
Background: in the past a new socks request would come in, and connection_ap_handshake_rewrite_and_attach() would get called on it, which would call connection_ap_handshake_attach_circuit(), which would call circuit_get_open_circ_or_launch(), and that function is the one that does the "omg I don't have enough directory information and I just got a socks request, I should reset everything so we can try again" logic.
But in commit b1d56fc58, we made connection_ap_handshake_rewrite_and_attach() no longer attach it, but rather it now calls connection_ap_mark_as_pending_circuit(), which effectively queues the stream for handling the next time the main loop runs. That function adds the stream to the pending_entry_connections smartlist, and sets untried_pending_connections to 1, and returns.
Then run_main_loop_once() calls connection_ap_attach_pending(0), which goes through and calls connection_ap_handshake_attach_circuit() on each stream that needs it.
But when I gdb attach to my unhappy Tor client and set a breakpoint on connection_ap_attach_pending and then do a new socks request, it seems that it never calls the function!
The earlier parts of connection_ap_handshake_rewrite_and_attach()are being called, for example rep_hist_note_used_port() gets called just fine. So I think the stream is being added to pending_entry_connections, and then somehow things aren't working after that.
This might be related to the above bug, or it might not be.
This code would need to reset the authority and fallback failures, and also the download status for the consensus.
Mystery 1) Why does my Tor client not care to get any new directory information?
This sounds like an issue with the consensus download implementation. Here are my questions:
has your tor has marked each of the directory authorities down?
has it marked all of the fallback directories down?
has it exceeded the download schedule for downloading consensuses?
There are two kinds of download_status_t in 0.2.8, one retries on a schedule regardless of success or failure, and the other retries on failure. It seems there might be a problem with the first kind (which is new in 0.2.8, and used to download the consensus).
I'd like to know what your consensus download_status_t has for all its fields, particularly the attempt and failure counts. It may well have exceeded the 255 failure count limit.
I also wonder how #8625 (moved) (commit 09a0f2d) affects this, but it is only in 0.2.9, and I'm not sure it would make it any better.
To be clear this is "Tor 0.2.9.3-alpha-dev (git-bfaded9143d127cb)" for me. So we are definitely in 0.2.9 territory.
But, I am going to move the mystery 1 investigation over to #20499 (moved), which looks like it is the same mystery over there. Then we can focus on mystery 2 in this ticket -- i.e. regardless of what our directory download situation is, Tor is supposed to wake up and reset stuff when it gets a new stream request and it doesn't have enough directory information to handle it.
My investigations led me to be suspicious of Tor commit b1d56fc58.
[...]
Then run_main_loop_once() calls connection_ap_attach_pending(0), which goes through and calls connection_ap_handshake_attach_circuit() on each stream that needs it.
But when I gdb attach to my unhappy Tor client and set a breakpoint on connection_ap_attach_pending and then do a new socks request, it seems that it never calls the function!
Still suspicious of commit b1d56fc58.
Is it possible that event_base_loop() in run_main_loop_once() is never finishing for me? After all, it has a bunch of callbacks, like second_elapsed_callback() and check_descriptor_callback() and so on, which all schedule themselves to get put back in the queue once they're done. Does this event_base_loop function keep running until there are no more events?
I see some ways it can choose to finish, like if flags is non-zero, or if connection_start_reading_from_linked_conn() calls tor_event_base_loopexit(), but neither of those apply to me here.
maybe set a breakpoint on the outside of event_base_loop() and see if we reach there? Though we should be reaching there if we get any incoming connections, since they count as events.
I set breakpoints all over run_main_loop_once() and they never catch. Maybe my gcc has optimized the function so it's hard to gdb on, but it looks suspiciously like event_base_loop(base, 0) just never finishes.
Where in the incoming connections events does it tell event_base_loop that it should finish?
In theory, calls to event_base_loopbreak() or event_base_loopexit() will make event_base_loop() stop. But yeah, we don't seem to do that at an appropriate time.
Okay; I think that branch bug19969_028 in my public repository may have a fix here. If we take this one, we should add more tickets in 0.3.0 to clean up this code and simplify the control flow; I tried to do this with as little collateral damage as possible.
For a simpler fix, we could add a connection_ap_attach_pending() call to second_elapsed_callback().
Trac: Status: new to needs_review Keywords: regression deleted, regression 029-backport 028-backport added
Awesome. It builds for me and runs in a trivial test. The code looks plausible -- though I admit that there could be bugs in the way the patch calls the libevent stuff and I wouldn't know it.
I think you're right that the patch should ultimately make its way into 0.2.8 -- this is the current stable, and folks on crummy network connections are really hurting from this regression (if I were them I would be sticking to 0.2.7 at present, which is a poor idea for other reasons).
To be thorough, if this bug is what we think it is, then there are cases where the rest of the things in run_main_loop_once() never get called. Those are:
/* All active linked conns should get their read events activated, * so that libevent knows to run their callbacks. */ SMARTLIST_FOREACH(active_linked_connection_lst, connection_t *, conn, event_active(conn->read_event, EV_READ, 1)); called_loop_once = smartlist_len(active_linked_connection_lst) ? 1 : 0;
Does this one matter? It might. But maybe because it's only about linked conns, they will probably only happen when there's other activity too? Worth thinking more about before we brush it off.
/* Make sure we know (about) what time it is. */ update_approx_time(time(NULL));
No worries, this one happens in second_elapsed_callback() too.