 
            Hi all. I've only been on this list for a relatively short while so this is most likely in an FAQ I should be digging thru... I've noticed that the uptime for some of the physical probes we've installed are very short. According to the control panel, one in particular has connect times of around 2-3 minutes and disconnect times of 4-6 minutes. This cycle is repeated every 10 minutes or so. All day, every day. Superficially that sort of pattern suggests some sort of continuous reboot loop. However, if I ping the probe remotely, I get just an occasional packet loss. Certainly no long gaps that you'd expect from a continuous reboot. That I can ping it remotely also suggests that external connectivity isn't the problem. Besides the periodicity is just too regular and frequent for a routing loss or similar. (In any event, if the network it's on is losing connectivity every 5 minutes I would have plenty of other symptoms.) The problem probe is also reported as getting UDMs assigned and running them so I'm not quite sure how it manages to squeeze those tests in if that connectivity status report is accurate. So my main question is, should I be concerned about the control panel connectivity reports? And if I should be, is there much I can do beyond ensuring the device has good connectivity? Mark.
 
            Hi, On 2013/11/20 16:32 , Mark Delany wrote:
I've noticed that the uptime for some of the physical probes we've installed are very short. According to the control panel, one in particular has connect times of around 2-3 minutes and disconnect times of 4-6 minutes. This cycle is repeated every 10 minutes or so. All day, every day.
Superficially that sort of pattern suggests some sort of continuous reboot loop.
This is not normal. The probe is not rebooting, but it is constantly losing its connection to the controller. And it doesn't seem to report any results. I'll send it to a different controller to see if it makes a difference. Philip
 
            On 20Nov13, Philip Homburg allegedly wrote:
Hi,
On 2013/11/20 16:32 , Mark Delany wrote:
I've noticed that the uptime for some of the physical probes we've installed are very short.
This is not normal. The probe is not rebooting, but it is constantly losing its connection to the controller. And it doesn't seem to report any results.
I'll send it to a different controller to see if it makes a difference.
You probably know this Philip, but that made an instant difference. That probe is now reporting being up 100% since the switch. Just out of curiousity, do you have a theory on why switching controllers makes a difference? If it's a bit of a mystery at the controller end, is there anything we can do to help diagnose the problem at the probe end? As you can see I have a nicely reproducable situation if it's of use. Mark.
 
            Hi Mark, On 2013/11/20 20:50 , Mark Delany wrote:
You probably know this Philip, but that made an instant difference. That probe is now reporting being up 100% since the switch.
Just out of curiousity, do you have a theory on why switching controllers makes a difference?
We have some ideas of what it might be. I'll look if I can find out what is going on from our end. Philip
 
            On 2013/11/21 11:05 , Philip Homburg wrote:
Hi Mark,
On 2013/11/20 20:50 , Mark Delany wrote:
You probably know this Philip, but that made an instant difference. That probe is now reporting being up 100% since the switch.
Just out of curiousity, do you have a theory on why switching controllers makes a difference?
We have some ideas of what it might be. I'll look if I can find out what is going on from our end.
The cause was... incorrect MSS clamping (together with broken Linux Ethernet drivers). First a bit of background information. In a small fraction of the Internet, path MTU discovery does not work. Unfortunately, Linux does not have PMTU blackhole detection enabled by default. This causes some number of probes to fail, mostly probes that connect over IPv6. The failure mode is that the probes connect fine, but when they want to report results the probe hits the PMTU blackhole. The connection times out, the probe connects again and the same thing happens. Over and over again. One way out of this is to tell the probe host to fix the PMTU problem. But we cannot be sure that it PMTU and the probe host may not be able to fix the problem. An easy way around this problem is reducing the MSS: if the controller sends a smaller MSS to the probe then the PMTU blackhole can be avoided. A quick and dirty way to cause this lower MSS to be sent is to lower the MTU on the controller's interface. So after verifying that it works, we starting running the controllers with MTU 1400. Problem solved. Well, not quite. What 'mtu 1400' really does seems to depend on the Ethernet driver. In some cases the controller continues to receive packets up to the normal Ethernet MTU of 1500, it just does not send anything bigger than 1400. In this case, the trick works great. In other cases, and that includes the 'ctr-ams04' controller, the Ethernet driver considers everything above 1400 as a framing error and discards it. Normally, this does not cause any problems. Controllers almost exclusively use TCP connections and for TCP we have the MSS option to keep the packets smaller than the lowered MTU. Enter middleboxes. Middleboxes, like home routers have been doing MSS clamping for years. This way most users never notice PMTU problems. However in this case, MSS clamping causes the whole thing to fail. I got permission from Mark to run tcpdump on his probe, so I can show the packets sent by the controller and how they are received by his probe. This is what we see on the controller: 14:01:59.723671 IP probeXXXXXX.53447 > ctr-ams04.atlas.ripe.net.https: Flags [S], seq 1044848773, win 14600, options [mss 1452,sackOK,TS val 622493 ecr 0,nop,wscale 2], length 0 14:01:59.723696 IP ctr-ams04.atlas.ripe.net.https > probeXXXXX.53447: Flags [S.], seq 4267097689, ack 1044848774, win 13480, options [mss 1360,sackOK,TS val 1898121349 ecr 622493,nop,wscale 7], length 0 The controller receives an MSS of 1452 from the probe, which is a weird number because the probe is connected to Ethernet. So this suggests that MSS clamping is going on and that the probe is connecting over PPPoE. Then the controller responds with an MSS of 1360, which is the MTU of 1400 minus the IPv4 and TCP headers. At the probe however, it looks quite differently: 14:01:57.758623 IP probeXXXXX.53447 > ctr-ams04.atlas.ripe.net.https: Flags [S], seq 1044848773, win 14600, options [mss 1460,sackOK,TS val 622493 ecr 0,nop,wscale 2], length 0 14:01:58.129470 IP ctr-ams04.atlas.ripe.net.https > probeXXXXX.53447: Flags [S.], seq 4267097689, ack 1044848774, win 13480, options [mss 1452,sackOK,TS val 1898121349 ecr 622493,nop,wscale 7], length 0 So the probe actually sent 1460 as expected, but now the MSS of ctr-ams04 is suddenly raised to 1452! The net result is that the probe starts sending packets bigger than 1400, which gets dropped by the Ethernet driver on ctr-ams04 and we effectively have a PMTU blackhole. To make sure I assign blame to the right party (after all, the NCC also has firewalls, etc) I also captured the same exchange for a probe at my home: First on ctr-ams04: 15:08:10.852178 IP probeYYYYY.52323 > ctr-ams04.atlas.ripe.net.https: Flags [S], seq 2626247187, win 14600, options [mss 1460,sackOK,TS val 61346 ecr 0,nop,wscale 2], length 0 15:08:10.852203 IP ctr-ams04.atlas.ripe.net.https > probeYYYYY.52323: Flags [S.], seq 1208489868, ack 2626247188, win 13480, options [mss 1360,sackOK,TS val 1902092478 ecr 61346,nop,wscale 7], length 0 And then on the probe: 15:08:09.021948 IP probeYYYYY.52323 > ctr-ams04.atlas.ripe.net.https: Flags [S], seq 2626247187, win 14600, options [mss 1460,sackOK,TS val 61346 ecr 0,nop,wscale 2], length 0 15:08:09.039768 IP ctr-ams04.atlas.ripe.net.https > probeYYYYY.52323: Flags [S.], seq 1208489868, ack 2626247188, win 13480, options [mss 1360,sackOK,TS val 1902092478 ecr 61346,nop,wscale 7], length 0 Finally, it came as a surprise to me that the probe connected just fine when I sent it to our test controller, which is ctr-ams01. Ctr-ams01 is in the same network as ctr-ams04 so I was quite surprised that it made a difference. It turns out that ctr-ams01 is a virtual machine and the driver for VMware does not have a problem with packet bigger than 1400. Philip
 
            Never a boring moment in network computing :-( Kudos for figuring it out. Daniel On 22.11.2013, at 16:11 , Philip Homburg <philip.homburg@ripe.net> wrote:
The cause was... incorrect MSS clamping (together with broken Linux Ethernet drivers). ...
 
            On 2013/11/22 17:31 , Daniel Karrenberg wrote:
Never a boring moment in network computing :-(
Kudos for figuring it out.
I had help from Robert, Colin, and our whiteboard. That triggers the question of whether we should add detection mechanisms for (broken) middleboxes to Atlas... Philip
participants (3)
- 
                 Daniel Karrenberg Daniel Karrenberg
- 
                 Mark Delany Mark Delany
- 
                 Philip Homburg Philip Homburg