Well, if you’ve gone through all the previous competencies in order to get here, you’re probably feeling pretty good about this one! I would think that most system engineers ready for this exam are comfortable with this area. I’m not sure which is more frequent between operating system and networking issues, but the fact that it’s hard to decide should say something.
Anyway, the potential for network issues is vast. The best way to get a good grasp of the scope of operation here is to hit up the ol’ OSI model. This site is nice and fun. If you work up the model from the bottom, you quickly zero in on the problem. However, I think it’s worth saying that an instructive higher-level model comes into view when you collapse some of the higher OSI layers into each other and get the results:
- The Physical Layer (the actual cabling and/or wireless communication standards)
- The Data Link Layer (switching and access point devices)
- The Network Layer (the Internet Protocol and associated routing devices, including the Linux kernel)
- The Transport Layer (Transmission Control Protocol, Universal Datagram Protocol)
- The Application Layer (HTTP, FTP, SSH, SMTP, etc. etc.)
This model is closer to what is meant by “TCP/IP stack” (that is, the stack of hardware and software used to provide services at each layer) – the OSI model’s upper layers are now typically serviced by single protocols such as those listed, so administratively, they can be considered together from a high level. With modern protocols, we don’t typically suffer from protocol-based failures, so isolating the particular service layer in which a protocol failure occurred is usually beyond the scope of an engineer’s care – that’s for the protocol design and implementation teams. Understanding the technical aspects of various application layer protocols is useful for other reasons, of course, but not generally for network troubleshooting.
Understanding that the lower layers (in numerical value above) provide services to the adjacent upper layer helps to narrow down the problem. If IP traffic can’t be routed to the destination, the problem is at layer three, and one needn’t scrutinize the application until that matter is resolved. Working from the foundation upwards through this model, as was previously noted, is probably the highest level view of network troubleshooting.
Now, while I’m going to give something of an overview of troubleshooting practice below, I really do recommend the DevOps Troubleshooting: Linux Server Best Practices book cited below in the Resources section. All of its chapters are sound, and most of it will be known to seasoned system administrators, but it’s an excellent introduction, a great refresher, and a nice reference.
The Physical Layer
If an application, for instance, suddenly ceases to send and receive traffic as usual, you may first choose to inspect the application itself. If it seems to be operating fine (all processes are running as expected, system resource usage indicates that it is idle) and perhaps the logs direct you to a networking issue, you might begin with a simple ping. If the system is known to respond to ICMP requests and it does not, the issue may be as low as the physical layer (a cable is bad or unplugged, or perhaps networking hardware has failed).
I’d typically execute ip link and look for interfaces whose state is UP. You could even do ip link | grep UP if you have too many interfaces to visually parse (which would be impressive). If nothing is up, you need to figure out which interface should be up based on the cabling arrangement and bring it up. You may need to configure the relevant /etc/sysconfig/network-scripts/ifcfg-interfacename file to ensure the interface is set up properly, but beyond this, it’s unlikely that troubleshooting at this layer will be tested in the LFCE examination since you don’t have physical access to the box (and it’s a VM anyway). You could consider a lack of a virtual NIC interface to be a physical layer problem, I suppose, but you’re probably not going to have access to the management operating system so you won’t be expected to worry about that sort of thing either.
If the machine has a valid connection configured to the desired network and no hardware failure seems apparent, then perhaps you should check
The Data Link Layer
If your system runs on a VLAN or a switch, these systems may have failed. Wireless access points and other layer two hardware are lumped into this category for troubleshooting purposes. Again, this is probably not going to be tested too rigorously on the LFCE examination since it predominately refers to stand-alone hardware devices whose management is outside of the scope of the LFCE.
Typically, device failures at this layer will result in a broader outage than your system alone, so it’s normally easy to determine. I can think of one instance in which an optical fibre network switch suffered a singular longwave transmitter failure, causing abnormalities which periodically disrupted the data link layer (and weren’t recognized by the MPIO software). It was helpful to understand the service layers for the fibre network when the problem was reported to me since I am not typically directly involved with the fibre network hardware.
If this isn’t the case, however, and the system does recognize the Ethernet (or other medium) connection and is capable of sending and receiving information across that connection, then move on to
The Network Layer
This is probably the lowest layer of great focus for the LFCE examination. If networking infrastructure (including, perhaps, your kernel’s routing table) is to blame for the problem, this is perhaps the hangout of the most frequent culprits. This layer includes the Internet Protocol system for addressing and routing packets to their destinations. If this cannot reliably occur, any operations of your systems which depend on this functionality will fail.
For a given Linux/GNU system about which I know little (which is an important assumption for our purposes here, since we’ll be thrown into a testing environment with which we are afforded no prior familiarity), I typically proceed in troubleshooting at this layer as follows (though admittedly I’m going to blend layers below since it’s just easier to do it all in one place):
- Examine the routing table of the system with ip route. This gives you a good idea of the system’s understanding of its options for issuing and receiving communication.
- Examine the system’s routing policy rules with ip rule. This allows you to see if the system will choose routes according to any system-specific policies which deviate from the norm.
- Examine the system’s firewall with iptables -L. Many, many layer three troubles are the result of the kernel simply dismissing or preventing communication attempts. Rule this out before proceeding any further.
- Even though it’s technically a layer 4 issue, ensure that the proper protocol types (TCP/UDP) are specified in your rules. If, for example, DNS traffic is intended to be permitted to the system by allowing communication over port 53 but the protocol specified is TCP, that won’t work (DNS traffic is all UDP).
- Attempt to establish a connection of the kind experiencing trouble:
- If the target system is known to accept and respond with ICMP, a simple ping is often sufficient to determine whether or not communication is possible at a basic level.
- The mtr command can provide you with an understanding of the network infrastructure which must be traversed to reach the target system.
- This utility functions by sending packets to the provided destination but setting the TTL (time to live) value on the packets to increasing integers, starting with 1. This causes the first packet to reach the first hop along the route to the destination where the routing device receiving the packet decrements the TTL value by 1 (it’s a hop count) and, finding the packet to now have reached the end of its life, sends an ICMP Time Exceeded message back to the packet sender, providing the packet recipient with knowledge of the routing device’s IP address (along with other information contained in the ICMP packet metadata). The second packet reaches the second routing device with similar results and so on and so forth. The reason you sometimes see hops with only question marks in place of an IP address or domain name is that the device has been configured not to send ICMP Time Exceeded messages (or not to receive ICMP messages at all, etc.). So, you can’t always gather tons of information about the devices, but it can be a useful diagnostic utility.
- nmap is a great utility for determining the ports/services available on the target system
- If all of that works well, attempt to connect to the target system over the protocol experiencing trouble
- curl is useful for HTTP/FTP
- openssl is useful for validating SSL/TLS connections (and was used in determining POODLE vulnerabilities a while back)
- telnet is useful for plain-text protocols (like SMTP)
- If the server system doesn’t respond properly to protocol-specific communication in Step 4.D above, you may try running tcpdump or iptraf to ensure that the server actually receives the traffic sent to it.
- If the system has a particularly confusing firewall configuration or you suspect that the firewall might be the problem, you could (if permitted) temporarily disable the firewall and test communications. To be safer, you could establish a firewall rule to permit all communications between the client system being used to test the service and the server system.
- If in the course of step 4 above, the system is found to respond over certain protocols but not others (i.e. ping works, demonstrating ICMP connectivity between the systems, but curl fails to reach the website hosted on the target system),
- Verify that neither system is firewalling the communication.
- Verify that the routing infrastructure is able to route traffic of the type being used (HTTP, or HTTPS in our parenthetical example above)
- Continue to testing at the Application Layers below.
If you can actually establish a connection with the target system using the protocol experiencing the trouble, your problem is not at this layer (unless it’s some sort of periodic issue which was not detected by your troubleshooting efforts). Though already covered (by step 3.A above) in basically the only important way relevant to troubleshooting, allow me to briefly cover
The Transport Layer
Knowing the difference in behavior between TCP and UDP is very important. Knowing, for example, that UDP relies on broadcast traffic, meaning that it does not cross subnet boundaries unless a routing device is configured to provide such a service, can be very helpful when determining a problem’s likely causes. Knowing that TCP is a stateful protocol (that is, it keeps track of information about the state of the connection between two communicating parties, such as packet order and connection duration) and that UDP is stateless and doesn’t even establish connections between systems (but rather broadcasts datagrams) is unavoidably important for understanding application communications and troubleshooting them.
However, when it comes to troubleshooting, it is generally (there are probably good counter-examples which aren’t coming to mind at the moment) safe to expect TCP and UDP to function properly if the Network Layer checks out and IP traffic is being handled appropriately. Still, firewall rules can easily be made to block entire protocols, for example, so understanding this layer is important for troubleshooting, but it is usually not addressed directly (well, except for TCP…ha).
So if the Network Layer checks out and the target system seems receptive to the protocol being used for communication, you’re left with
The Application Layers
This is where the ability to establish basic connections with systems is useful in testing. Step 4.D above covers the basic idea regarding the establishment of test communications between the problematic systems, but if basic connections are capable of being established, then you need to investigate whether or not the application itself is faulty.
I make great use of netstat to determine what processes are listening on which ports. Head to the server (as opposed to the client) in the problematic situation and run netstat -luntap to see all the current relationships (including listening relationships) between the software running on the system and the various ports exposed to the network. You can grep for the port number in which you are interested if necessary. If you don’t see the expected process listening on the expected port, you’ve got yourself some sort of application problem.
If you do see the expected process listening on the expected port, you may have some sort of application hang which is preventing it from responding to requests. From here, you need to troubleshoot the application according to the appropriate process. Generally, check the logs and look for process failures. Modifying configuration and/or restarting the software may be necessary.
Other issues which may be preventing your application from functioning as expected could be related to the available infrastructure. If the application depends on reverse DNS lookups to authenticate communications, for example, it may refuse to respond to communications if the system cannot reach DNS to resolve IP addresses. Generally, I examine the following infrastructure components depending on the general issues with which they are related:
- Use dig or nslookup to check out DNS functionality for all systems involved in the problematic communication. Make sure each system can resolve one another (if they’re supposed to be able to do so, of course) and that each system can make proper general use of their respective DNS resources (resolve google.com, for example). DNS is a very, very frequent culprit in networking issues.
- Make sure the system has the appropriate IP address(es) for its NIC(s), and if it is to use DHCP, ensure that it is doing so successfully. Most networks automate the process of DHCP service discovery and use, so you’re probably not going to be troubleshooting at this level, but you should certainly be aware that it’s a possible issue.
That’s about it. Of course, troubleshooting network issues can lead one through just about every component of a system, especially when application issues are chased down. If anyone can think of anything else which deserves mention here, let me know, but this is the general toolkit and framework which I follow (off the top of my head, that is) when I troubleshoot networking issues.
- Manual pages
- curl (8)
- dig (8)
- ip (8)
- Specifically, sections on the ip route and ip rule commands.
- iptables (8)
- iptraf (8)
- mtr (8)
- nmap (8)
- openssl (8)
- ping (8)
- tcpdump (8)
- telnet (8)
- The Linux Bible, 8th Edition
- Chapter 14: Administering Networking
- This is good for reference regarding the various networking components of the system, but it doesn’t provide a particular troubleshooting method.
- Chapter 14: Administering Networking
- DevOps Troubleshooting: Linux Server Best Practices
- Chapter 5: Is the Server Down? Tracking Down the Source of Network Problems
- Chapter 6: Why Won’t the Hostnames Resolve? Solving DNS Server Issues
- Chapter 7: Why Didn’t My Email Go Through? Tracing Email Problems
- Chapter 8: Is the Website Down? Tracking Down Web Server Problems
- The Linux Bible, 8th Edition
- See the list of manual pages above
- Many applications use rsyslog to output their problems here, so it’s always worth a check
- By default, iptables log entries are funneled here through rsyslog.
- Troubleshoot networking issues.
- That’s basically it.
Unfortunately, this is a difficult competency for which to practice in a lab environment. The major problem is that you pretty much have to make your own problems before you can troubleshoot them, and if you make your own problems, you already know what they are. Hopefully, at some point in the future, I will put together some virtual machines with perhaps a few scripts designed to configure the systems problematically so that people can troubleshoot them. The vision is something of a downloadable Linux/GNU obstacle course, which is how I imagine the pinnacle of a training environment. We’ll see if I can get it done eventually, but for now the best way to practice is to use a Linux machine at home and set up some complexity in hopes of running into a problem or two.