TCP Notes
A few notes on TCP.
TCP/IP
TCP/IP is a suite of protocols that governs how data is transmitted over the internet. Two key protocols in this suite are IP (Internet Protocol) and TCP (Transmission Control Protocol).
IP
IP is a connectionless protocol that handles the addressing and routing of data packets between devices on a network. It sends each packet independently without establishing a connection, meaning it does not guarantee delivery or packet order, focusing only on directing packets to the correct destination using IP addresses.
TCP
While IP is responsible for routing packets across networks, TCP ensures that these packets are delivered correctly and in the right order. It establishes a connection between the sender and receiver, manages retransmissions of lost packets, and organizes the data into a continuous stream, ensuring that communication remains reliable.
POSIX APIs
/*
client server related
1. socket(); 1. socket(); fcntl()
2. bind(); // optional 2. bind(); epoll
3. connect(); 3. listen(); epoll_create()
4. send(); 4. accept(); epoll_ctl()
5. recv(); 5. recv(); epoll_wait()
6. close(); 6. send(); ...
7. close();
*/
// creates a new socket
int socket(int domain, int type, int protocol);
// binds a socket to a specific local IP address and port
// if omitted on the client side, the operating system will
// automatically assign an available local IP adress and an
// ephermeral port for the connection
int bind(int sockfd, const struct sockaddr *addr, socklen_t addrlen);
// initiates a connection to another socket
int connect(int sockfd, const struct sockaddr *addr, socklen_t addrlen);
// marks a socket as a passive socket to accept incomming connection requests
int listen(int sockfd, int backlog);
// accepts an incoming connection on a listening socket
int accept(int sockfd, struct sockaddr *addr, socklen_t *addrlen);
// sends data over a connected socket
ssize_t send(int sockfd, const void *buf, size_t len, int flags);
// receives data from a connected socket
ssize_t recv(int sockfd, void *buf, size_t len, int flags);
// closes a socket
int close(int sockfd);
// changes the behavior of a file descriptor
// for example, setting it to non-blocking mode, controlling file locking,
// or modifying file access settings
int fcntl(int fd, int cmd, ... /* arg */);
TCB
TCB, or Transmission Control Block, is a data structure used by TCP to store
information about a specific connection. It keeps track of the state of the TCP
connection and maintains various details necessary for managing the reliable, ordered
delivery of data. The TCB includes buffers like wmem
for outgoing data and rmem
for incomming data.
When socket()
is called, a file descriptor is assigned to the created socket, and a
TCB is allocated underneath (in kernel space), along with its components such as
wmem
and rmem
.
When bind()
is called, the IP address and port information are assigned to the TCB.
When send()
is called, data is copied into the wmem
buffer.
When recv()
is called, data is copied from the rmem
buffer.
send()
and recv()
are similar to write()
and read()
, as they handle copying
data to and from buffers, but they are not directly responsible for actual network
data transmission.
TCP Connection State Diagram
IETF. (1981). RFC 793: Transmission Control Protocol. Retrieved from https://www.ietf.org/rfc/rfc793.txt, p. 22.
[Page 22]
September 1981
Transmission Control Protocol
Functional Specification
+---------+ ---------\ active OPEN
| CLOSED | \ -----------
+---------+<---------\ \ create TCB
| ^ \ \ snd SYN
passive OPEN | | CLOSE \ \
------------ | | ---------- \ \
create TCB | | delete TCB \ \
V | \ \
+---------+ CLOSE | \
| LISTEN | ---------- | |
+---------+ delete TCB | |
rcv SYN | | SEND | |
----------- | | ------- | V
+---------+ snd SYN,ACK / \ snd SYN +---------+
| |<----------------- ------------------>| |
| SYN | rcv SYN | SYN |
| RCVD |<-----------------------------------------------| SENT |
| | snd ACK | |
| |------------------ -------------------| |
+---------+ rcv ACK of SYN \ / rcv SYN,ACK +---------+
| -------------- | | -----------
| x | | snd ACK
| V V
| CLOSE +---------+
| ------- | ESTAB |
| snd FIN +---------+
| CLOSE | | rcv FIN
V ------- | | -------
+---------+ snd FIN / \ snd ACK +---------+
| FIN |<----------------- ------------------>| CLOSE |
| WAIT-1 |------------------ | WAIT |
+---------+ rcv FIN \ +---------+
| rcv ACK of FIN ------- | CLOSE |
| -------------- snd ACK | ------- |
V x V snd FIN V
+---------+ +---------+ +---------+
|FINWAIT-2| | CLOSING | | LAST-ACK|
+---------+ +---------+ +---------+
| rcv ACK of FIN | rcv ACK of FIN |
| rcv FIN -------------- | Timeout=2MSL -------------- |
| ------- x V ------------ x V
\ snd ACK +---------+delete TCB +---------+
------------------------>|TIME WAIT|------------------>| CLOSED |
+---------+ +---------+
TCP Connection State Diagram
Figure 6.
SYN queue and accept queue
The sync queue, or SYN queue is used by the server to temporarily hold incoming SYN packets during the initial handshake phase of a TCP connection. When a client sends a SYN packet to initiate a connection, the server responds with a SYN-ACK, and if the handshake hasn't yet been completed, the packet is placed in the sync queue.
Once the handshake is finished and the connection is established, the connection is moved to the accept queue, which holds fully established connections that are ready fro communication.
When the server calls the accept()
system call, it retrieves a connection from the
accept queue.
TCP Header Format
IETF. (1981). RFC 793: Transmission Control Protocol. Retrieved from https://www.ietf.org/rfc/rfc793.txt, p. 14.
[Page 14]
September 1981
Transmission Control Protocol
3. FUNCTIONAL SPECIFICATION
3.1. Header Format
TCP segments are sent as internet datagrams. The Internet Protocol
header carries several information fields, including the source and
destination host addresses [2]. A TCP header follows the internet
header, supplying information specific to the TCP protocol. This
division allows for the existence of host level protocols other than
TCP.
TCP Header Format
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Source Port | Destination Port |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Sequence Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Acknowledgment Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Data | |U|A|P|R|S|F| |
| Offset| Reserved |R|C|S|S|Y|I| Window |
| | |G|K|H|T|N|N| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Checksum | Urgent Pointer |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Options | Padding |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| data |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
TCP Header Format
Note that one tick mark represents one bit position.
Figure 3.
Source Port: 16 bits
The source port number.
Destination Port: 16 bits
The destination port number.
Three-way Handshake
- SYN: The client initiates the connection by sending a SYN packet (set SYN bit to with a random sequence number.
- SYN-ACK: The server responds with a SYN-ACK packet, ackownledging the client's SYN and sending its own sequence number.
- ACK: The client sends an ACK packet to confirm receipt of the server's SYN-ACK, completing the handshake.
The 5-tuple
The connections (the small rectangular boxes above), whether the semi-established
ones in the SYN queue or the established ones in the accept queue, are identified by
a tuple of 5 elements (source_ip, source_port, dest_ip, dest_port, protocol)
.
DDoS SYN Flood
A DDoS SYN flood attack targets the SYN queue of a server by sending a massive number of SYN packets to initiate TCP connections, often with spoofed IP addresses (meaning the source IP address is forged to be from a random or unreachable IP address). The server responds with SYN-ACK packets, but because the attacker doesn't complete the 3-way handshake with the final ACK packet, the server’s SYN queue fills up with half-open connections. This consumes server resources and prevents legitimate connections, leading to a Denial of Service (DoS). The attack exploits the server's inability to distinguish between legitimate and malicious SYN requests, causing service disruption.
A DDoS attack can sometimes be referred to as a CC attack in certain contexts.
ET or LT for accept
The accept()
system call retrieves a connection from the accept queue, which holds
fully established connections.
For accept()
, Level-Triggered (LT) mode notifies the application as long as there
are pending connections in the accept queue, allowing repeated notifications until
all pending connections are handled.
In contrast, Edge-Triggered (ET) mode notifies the application only once when new
connections arrive, requiring the application to process all pending connections in a
loop. ET requires non-blocking I/O to avoid blocking on accept()
when no
connections are available, as it will return -1 immediately and set errno
to
EAGAIN
or EWOULDBLOCK
.
// set a socket to non-blocking mode
int set_nonblocking(int fd) {
int flags = fcntl(fd, F_GETFL, 0);
if (flags == -1) {
perror("fcntl F_GETFL");
return -1;
}
if (fcntl(fd, F_SETFL, flags | O_NONBLOCK) == -1) {
perror("fcntl F_SETFL O_NONBLOCK");
return -1;
}
return 0;
}
// accept in ET mode with non-blocking fd
while(1) {
int conn_fd = accept(listen_fd, NULL, NULL);
if (conn_fd == -1) {
if (errno == EAGAIN || errno == EWOULDBLOCK) {
// no more connections to accept
break;
} else {
perror("accept")
}
}
// set the accept socket to non-blocking
if (set_nonblocking(conn_fd) == -1) {
close(conn_fd);
continue;
}
// connection accepted
printf("accepted connection, fd %d\n", conn_fd);
}
LT is easier to use, while ET is suited for high-performance applications with careful design.
backlog in listen()
The implementation of the backlog
parameter in the listen()
system call has
evolved over time. In early implementations, it primarily defined the size of the SYN
queue, holding half-open connections. Modern systems distinguish between the SYN
queue and the accept queu, with backlog influencing both.
Linux, for example, introduces parameters like somaxconn
and tcp_max_syn_backlog
to manage these queues more effectively. While backlog
serves as a hint for
connection handling, the actual behavior depends on system-specific settings, kernel
versions, and dynamic adjustments, meaning the size of the connection queues can
vary significantly.
MTU
MTU, or the Maximum Transmission Unit, defines the largest packet size that can be transmitted over a network without fragmentation. TCP uses the Path MTU Discovery mechanism to determine the optimal packet size to improve efficiency and avoid fragmentation-related overhead.
Sliding Window
The sliding window is a broader concept in TCP that refers to how the sender and receiver manage and track data flow during communication. It determines how much data can be sent at any given time before needing an acknowledgment. The sliding window mechanism allows for efficient flow control by enabling the sender to transmit multiple packets before waiting for an acknowledgment, but within the limits of the window size.
Congestion Window
The congestion window (cwnd) is a specific part of the sliding window mechanism, specifically dealing with congestion control. It dictates how much data the sender can send based on the network’s capacity to handle traffic, adjusting dynamically to prevent congestion.
Congestion Control
Congestion control in TCP contains different phases that manage the growth and reduction of the congestion window (cwnd) to ensure efficient and stable data transmission. These phases help avoid network congestion, minimize packet loss, and optimize throughput. The primary phases are:
-
Slow Start: Rapidly increases the congestion window (cwnd) exponentially until the slow-start threshold (ssthresh) is reached. (Quickly explores the available bandwidth.)
-
Congestion Avoidance: Grows the cwnd linearly after the ssthresh is reached, allowing for more controlled data transmission. (Gradually increases the sending rate in a controlled mannger to avoid congestion once the network's capacity is better understood.)
-
Fast Retransmit/Fast Recovery: Quickly recovers from packet loss by retransmitting lost packets. After detecting packet loss (via three duplicate ACKs), the cwnd is halved (multiplicative decrease), and then it grows linearly during recovery. (Efficiently recover from packet loss.)
-
Timeout Retransmission: When a timeout occurs and a packet is not acknowledged, the cwnd is halved to reduce network traffic and allow for congestion recovery. After the halving, TCP enters slow start again until the ssthresh is reached. (To recover from severe network congestion or loss when duplicate ACKs are not received.)
The 4-way handshake
The 4-way handshake is the process TCP uses to gracefully terminate a connection between a client and a server. It can be initiated by either the client, the server, or both.
- Initiation: Either the client or server sends a FIN to signal that it has no more data to send.
- Acknowledgment: The receiving side sends an ACK to acknowledge the FIN.
- Second FIN: The side that hasn’t yet sent a FIN sends its own FIN once it finishes transmitting its remaining data.
- Final ACK: The receiving side sends a final ACK to confirm the second FIN, completing the connection termination.
If both sides initiate the termination, it is also handled gracefully (see State Diagram).
Communicate without a TCP server
Two TCP sockets can communicate without a central TCP server by establishing a direct connection between two devices. This approach is similar to peer-to-peer (P2P) communication, where devices (peers) exchange data directly with each other, without the need for a centralized server to manage the connection.