Over the past couple of weeks I’ve been experiencing a strange problem with one of our client’s network. They payed for a 5Mpbs connection but had the ISP opened up the pipe to 100Mbps temporarily to their DR site to allow for the initial replication of servers to move a lot faster. The problem was that we weren’t getting anything close to 100Mbps whenever we did speedtests. Results averaged in the 40Mbps range and sometimes it would even get as low as 18Mbps.
Before I go any further, let me give you a brief overview of the current design. The client has their own /24 block and ASN. The same ISP that is providing their primary internet at HQ is also providing internet at the DR site along with the lease line that connects them. eBGP relationships have been established to the ISP at each site along with an iBGP peering between the HQ and the DR. They’re also using EIGRP as their IGP.
After the above configurations were completed I performed some initial tests. We were receiving a partial table at the DR site while only getting a default route at HQ. To prevent the upstream ISP sending the full table unexpectedly I configured a prefix list at both sites to only allow a default route in. Further tests were performed to ensure IGP was routing properly and that both sites were learning routes from each other. The issue arose when we were ready to test replication of the servers at HQ to the DR site. The ISP informed us that they had opened up the pipe to 100Mbps and that we could begin replication. However, conducting a number of bandwidth tests on speedtest didn’t concur with their confirmation. Speeds were fluctuating greatly but didn’t get anywhere close to 100Mbps. This was where the troubleshooting process began and what highlighted my BGP design #FAIL.
Now there could be any number of reasons why I wasn’t getting the desired results. So how do you go about troubleshooting such an issue? A carefully planned process and my good old buddy google. That’s how! The next step I did after testing on my laptop was to test on some other desktops and servers in the network. This would allow me to eliminate my POS laptop that freezes a zillion times a day out of the equation. The tests performed on the servers yielded the same results. Ok, so now we’re moving on to the next set of tests to perform. Before performing anymore tests internally I decided it would be more productive to test from the ISP’s fibre switch and walk my way into the network. It really didn’t make any sense performing a bunch of tests on the internal network when the problem might be with the ISP’s switch, the border router or even the PIX(shoot me now) that are all inline before hitting the internal network. The link comes in from the ISP via fibre, into their switch and then connects to the border router via ethernet. The border router then connects to the PIX. To ensure I was getting the desired bandwidth from the ISP I connected my laptop the to the fibre switch, placed the IP address used to peer up and then perform tests. Results were good. I was getting 98Mbps on average. Ok, so this proves that we are indeed getting the full bandwidth. The next step was to connect back the fibre switch to the border router then connect the router to my laptop. The results of these tests weren’t promising. So now I’m baffled. Could it be the border router? It can’t possible be! It’s a brand new 2901 ISR generation2 and as you know, the new ISRs are built to handle a lot more bandwidth. Time to hit up my friend google. I found another user with the exact same model router with the exact same design using a fibre switch from their upstream ISP, who was having the exact same problem. A lot of users were saying it could be a duplex/speed issue with the ports however this wasn’t the case for me as both the speed and duplex negotiated correctly. I saw @jastorino on skype around the same time I was researching the issue so decided to pitch my problem to him see what input he had. One idea that came up was that it could be possible that the ISP was doing rate limiting via sourced traffic. This made sense as when I connected directly to the ISPs switch using their IP I got full bandwidth, but when I connected to the router using IPs from the client’s block we didn’t. They denied that they were rate limiting as I was suggesting. At this point I knew I must be on to something. Why would I get full bandwidth when traffic was sourced from the ISPs IP but didn’t when it was sourced from the client’s IP. Something’s not right. Moving on, I checked to see how the client’s block was being seen as reachable from a couple of BGP looking glass. Performing a traceroute to an IP at the HQ revealed something quite interesting. The path traffic was taking to reach HQ was through the DR site. This meant that traffic was not taking the same path to return into the HQ network as the path used by exit traffic. This lead to a loud outburst of ooooooooooooooooooooh while at my desk :). Everyone around me wanted to know what happened. So now I’ve found the problem! The internet facing link at the DR site wasn’t opened to 100Mbps while the lease line was. Return traffic was actually being limited by the DR’s internet link. How do I fix it? My first thought was to split the /24 block into two /25 then reconfigure BGP to advertise those blocks to the upstream. This would allow return traffic to take the same path a the exit traffic depending on which block it was coming from. Unfortunately because ISPs wouldn’t advertise any prefix smaller than a /24 this was no longer an option. Using AS prepending was the next fix for this scenario. Prepending the client’s AS 3 times at the DR site made the AS path appear much longer than that of the HQ and because of BGP’s path selection process, return traffic would choose the HQ site.
I know this probably isn’t the best solution but it worked just fine for me. I’m by no means an expert at BGP. However, working on this project exposed me to a lot of good factors about the protocol that I will be sure to take into consideration during future designs.