The only I/O model that provides true scalability on Windows NT platforms is overlapped I/O using completion ports for notification. In Chapter 5, we covered the various methods of socket I/O and explained that for a large number of connections, completion ports offer the greatest flexibility and ease of implementation. Mechanisms like WSAAsyncSelect and select are provided for easier porting from Windows 3.1 and UNIX, respectively, but are not designed to scale. The event-based models are not scalable because of the operating system limit of simultaneous wait events.
The other major advantages of overlapped I/O are the several Microsoft-specific extensions that can only be called in an overlapped manner. When you use overlapped I/O there are several options for how the notifications can be received. Event-based notification is not scalable because the operating system limit of waiting on 64 objects necessitates using many threads. This is not only inefficient but requires a lot of housekeeping overhead to assign events to available worker threads. Overlapped I/O with callbacks is not an option for several reasons. First, many of the Microsoft-specific extensions do not allow Asynchronous Procedure Calls (APCs) for completion notification. Second, due to the nature of how APCs are handled on Windows, it is possible for an application thread to starve. Once a thread goes into an alertable wait, all pending APCs are handled on a first in first out (FIFO) basis. Now consider the situation in which a server has a connection established and posts an overlapped WSARecv with a completion function. When there is data to receive, the completion routine fires and posts another overlapped WSARecv. Depending on timing conditions and how much work is performed within the APC, another completion function is queued (because there is more data to be read). This can cause the server's thread to starve as long as there is pending data on that socket.
Before delving deeper into the architecture of scalable Winsock applications, let's discuss the Microsoft-specific extensions that will aid us in developing scalable servers. These APIs are TransmitFile, AcceptEx, ConnectEx, TransmitPackets, DisconnectEx, and WSARecvMsg. There is a related extension function,GetAcceptExSockaddrs, which is used in conjunction with AcceptEx.
Before describing each of the extension API functions, it is important to point out that these functions are defined in MSWSOCK.H. Also, only three of the functions (TransmitFile, AcceptEx, and GetAcceptExSockaddrs) are actually exported from MSWSOCK.DLL. However, applications should avoid using those. Instead, applications should dynamically load the extension function, which is required for all the remaining extension APIs. Not all providers have to support these APIs, so it is best to explicitly load these APIs from the provider you are using. See Chapter 7 and the SIO_GET_EXTENSION_ FUNCTION_POINTERfor an example of how to load the extension APIs.
AcceptEx :
Perhaps the most useful extension API for scalable TCP/IP servers is AcceptEx. This function allows the server to post an asynchronous call that will accept the next incoming client connection. This function is defined as :
- BOOL
- PASCAL FAR
- AcceptEx (
- IN SOCKET sListenSocket,
- IN SOCKET sAcceptSocket,
- IN PVOID lpOutputBuffer,
- IN DWORD dwReceiveDataLength,
- IN DWORD dwLocalAddressLength,
- IN DWORD dwRemoteAddressLength,
- OUT LPDWORD lpdwBytesReceived,
- IN LPOVERLAPPED lpOverlapped
- );
The four parameters that follow sAcceptSocket are related. The lpOutputBuffer is required and is filled in with the local and remote addresses for the client connection as well as an optional buffer to receive the first data chunk received from the client. The dwReceiveDataLength indicates how many bytes of the supplied buffer should be used to receive data sent by the client. An application may choose not to receive data and may specify zero. ThedwLocalAddressLength specifies the size of the socket address structure corresponding to the address family of the client socket plus 16 bytes. The local address of the client socket connection is placed in the lpOutputBuffer following the receive data if specified. The dwRemoteAddressLength is the same. The remote address of the client connection will be written to the lpOutputBuffer following the receive data (if specified) and the local address. Note thatdwReceiveDataLength may be zero but dwLocalAddressLength and dwRemoteAddressLength cannot be.
The lpdwBytesReceived indicates the number of bytes received on the newly-established client connection if the operation succeeds immediately. Finally,lpOverlapped is the WSAOVERLAPPED structure for this overlapped operation. This parameter is required—if you want to perform a blocking accept call, just use accept or WSAAccept.
Before going any farther, let's take a quick look at an example using the AcceptEx function. The following code creates an IPv4 listening socket and posts a single AcceptEx :
- SOCKET s, sclient;
- HANDLE hCompPort;
- LPFN_ACCEPTEX lpfnAcceptEx=NULL;
- GUID GuidAcceptEx=WSAID_ACCEPTEX;
- // The WSAOVERLAPPEDPLUS type will be described in detail in
- // Chapter 12 and includes a WSAOVERLAPPED structure as well as
- // context information for the overlapped operation.
- WSAOVERLAPPEDPLUS ol;
- SOCKADDR_IN salocal;
- DWORD dwBytes;
- char buf[1024];
- int buflen=1024;
- // Create the completion port
- hCompPort = CreateIoCompletionPort(INVALID_HANDLE_VALUE,
- NULL,
- (ULONG_PTR)0,
- 0
- );
- // Create the listening socket
- s = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
- // Associate listening socket to completion port
- CreateIoCompletionPort((HANDLE)s,
- hCompPort,
- (ULONG_PTR)0,
- 0
- );
- // Bind the socket to the local port
- salocal.sin_family = AF_INET;
- salocal.sin_port = htons(5150);
- salocal.sin_addr.s_addr = htonl(INADDR_ANY);
- bind(s, (SOCKADDR *)&salocal, sizeof(salocal));
- // Set the socket to listening
- listen(s, 200);
- // Load the AcceptEx function
- WSAIoctl(s,
- SIO_GET_EXTENSION_FUNCTION_POINTER,
- &GuidAcceptEx,
- sizeof(GuidAcceptEx),
- &lpfnAcceptEx,
- sizeof(lpfnAcceptEx),
- &dwBytes,
- NULL,
- NULL
- );
- // Create the client socket for the accepted connection
- sclient = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
- // Initialize our "extended" overlapped structure
- memset(&ol, 0, sizeof(ol));
- ol.operation = OP_ACCEPTEX;
- ol.client = sclient;
- lpfnAcceptEx(s,
- sclient,
- buf,
- buflen - ((sizeof(SOCKADDR_IN) + 16) * 2),
- sizeof(SOCKADDR_IN) + 16,
- sizeof(SOCKADDR_IN) + 16,
- &dwBytes,
- &ol.overlapped
- );
- // Call GetQueuedCompletionStatus within the completion function
- // After the AcceptEx operation completes associate the accepted client
- // socket with the completion port
Also be aware that because of the high performance nature of AcceptEx, the listening socket's socket attributes are not automatically inherited by the client socket. To do this, the server must call setsockopt with SO_UPDATE_ ACCEPT_CONTEXT with the client socket handle. See Chapter 7 for more information.
Another point to be aware of, which we mentioned in Chapter 5, is that if a receive buffer is specified to AcceptEx (for example, dwReceiveDataLength is greater than zero), then the overlapped operation will not complete until at least one byte of data has been received on the connection. So a malicious client could post many connections but never send any data. Chapter 5 discusses methods to prevent this by using the SO_CONNECT_TIME socket option. The AcceptEx function is available on Windows NT 4.0 and later versions.
GetAcceptExSockaddrs :
This is really a companion function to AcceptEx because it is required to decode the local and remote addresses contained within the buffer passed to the accept call. As you remember, a single buffer will contain any data received on the connection as well as the local and remote addresses for that connection. Any data indicated to be received will always be placed at the start of this buffer followed by the addresses. However, these addresses are in a packed form and the GetAcceptExSockaddrs function will decode them into the appropriate SOCKADDR structure for the address family. This function is defined as :
- VOID
- PASCAL FAR
- GetAcceptExSockaddrs (
- IN PVOID lpOutputBuffer,
- IN DWORD dwReceiveDataLength,
- IN DWORD dwLocalAddressLength,
- IN DWORD dwRemoteAddressLength,
- OUT struct sockaddr **LocalSockaddr,
- OUT LPINT LocalSockaddrLength,
- OUT struct sockaddr **RemoteSockaddr,
- OUT LPINT RemoteSockaddrLength
- );
- // buf and bufflen were defined previously
- SOCKADDR *lpLocalSockaddr=NULL,
- *lpRemoteSockaddr=NULL;
- int LocalSockaddrLen=0,
- RemoteSockaddrLen=0;
- LPFN_GETACCEPTEXSOCKADDRS lpfnGetAcceptExSockaddrs=NULL;
- // Load the GetAcceptExSockaddrs function
- lpfnGetAcceptExSockaddrs(
- buf,
- buflen - ((sizeof(SOCKADDR_IN) + 16) * 2),
- sizeof(SOCKADDR_IN) + 16,
- sizeof(SOCKADDR_IN) + 16,
- &lpLocalSockaddr,
- &LocalSockaddrLen,
- &lpRemoteSockaddr,
- &RemoteSockaddrLen
- );
TransmitFile :
TransmitFile is an extension API that allows an open file to be sent on a socket connection. This frees the application from having to manually open the file and repeatedly perform a read from the file, followed by writing that chunk of data on the socket. Instead, an open file handle is given along with the socket connection and the file data is read and sent on the socket all within kernel mode. This prevents the multiple kernel transitions required when you perform the file read yourself. This API is defined as :
- BOOL
- PASCAL FAR
- TransmitFile (
- IN SOCKET hSocket,
- IN HANDLE hFile,
- IN DWORD nNumberOfBytesToWrite,
- IN DWORD nNumberOfBytesPerSend,
- IN LPOVERLAPPED lpOverlapped,
- IN LPTRANSMIT_FILE_BUFFERS lpTransmitBuffers,
- IN DWORD dwReserved
- );
The TransmitFile function is useful for file-based I/O such as Web servers. In addition, one beneficial feature of TransmitFile is the capability of specifying the flags TF_DISCONNECT and TF_REUSE_SOCKET. When both of these flags are specified, the file and/or memory buffers are transmitted and the socket is disconnected once the send operation has completed. Also, the socket handle passed to the API can then be used as the client socket in AcceptEx or the connecting socket in ConnectEx. This is extremely beneficial because socket creation is very expensive. A server can use AcceptEx to handle client connections, then use TransmitFile to send data (specifying these flags), and afterward the socket handle may be used in a subsequent call to AcceptEx.
Note that you can call TransmitFile with a NULL file handle and NULL lpTransmitBuffers but still specify TF_DISCONNECT and TF_REUSE_SOCKET. This call will not send any data but allows the socket to be reused in AcceptEx. This is a good workaround for platforms that do not support the DisconnectEx API discussed later in this chapter. Finally, the TransmitFile function is available on Windows NT 4.0 and later version. Also, because TransmitFile is geared toward server applications, it is fully functional only on server versions of Windows. On home and professional versions, there may be only two outstandingTransmitFile (or TransmitPackets) calls at any given time. If there are more, then they are queued and not processed until the executing calls are finished.
TransmitPackets :
The TransmitPackets extension is similar to TransmitFile because it too is used to send data. The difference between them is that TransmitPackets can send both files and memory buffers in any number and order. This function is defined as :
- BOOL
- (PASCAL FAR * LPFN_TRANSMITPACKETS) (
- SOCKET hSocket,
- LPTRANSMIT_PACKETS_ELEMENT lpPacketArray,
- DWORD nElementCount,
- DWORD nSendSize,
- LPOVERLAPPED lpOverlapped,
- DWORD dwFlags
- );
- typedef struct _TRANSMIT_PACKETS_ELEMENT {
- ULONG dwElFlags;
- #define TP_ELEMENT_MEMORY 1
- #define TP_ELEMENT_FILE 2
- #define TP_ELEMENT_EOP 4
- ULONG cLength;
- union {
- struct {
- LARGE_INTEGER nFileOffset;
- HANDLE hFile;
- };
- PVOID pBuffer;
- };
- } TRANSMIT_PACKETS_ELEMENT, *PTRANSMIT_PACKETS_ELEMENT,
- FAR *LPTRANSMIT_PACKETS_ELEMENT;
A word of caution about using TransmitPackets with datagram sockets : the system is able to process and queue the send requests extremely fast, and it is possible that too many datagrams will pile up in the protocol driver. At this point, for unreliable protocols it is perfectly acceptable for the system to drop packets before they are even sent on the wire!
The TransmitPackets extension API is available on Windows XP and later version and is subject to the same type of limitation that TransmitFile is. On a non-server version of Windows NT, there can be only two outstanding TransmitPackets (or TransmitFile) calls at any given time.
ConnectEx :
The ConnectEx extension function is a much-needed API available with Windows XP and later versions. This function allows for overlapped connect calls. Previously, the only way to issue multiple connect calls without using one thread for each connect was to use multiple non-blocking connects, which can be cumbersome to manage. This function is defined as :
- BOOL
- (PASCAL FAR *LPFN_CONNECTEX) (
- IN SOCKET s,
- IN const struct sockaddr FAR *name,
- IN int namelen,
- IN PVOID lpSendBuffer,
- IN DWORD dwSendDataLength,
- OUT LPDWORD lpdwBytesSent,
- IN LPOVERLAPPED lpOverlapped
- );
Like with AcceptEx function, because ConnectEx is designed for performance, any previously set socket options or attributes are not automatically copied to the connected socket. To do so, the application must call SO_UPDATE_CONNECT_CONTEXT on the socket after the connection is established. In addition, as with AcceptEx, socket handles that have been “disconnected and re-used,” either by TransmitFile, TransmitPackets, or DisconnectEx, may be used as the socket parameter to ConnectEx.
There isn't anything difficult about the ConnectEx API, and the only requirement is the socket passed into ConnectEx needs to be previously bound with a call to bind. There are no special flags, and it simply is an overlapped version of connect with the optional bonus of sending a block of data after the connection is established.
DisconnectEx :
This extension API is simple. It takes a socket handle and performs a transport level disconnect and prepares the socket handle for re-use in a subsequentAcceptEx call. Both the TransmitFile and TransmitPackets APIs allow the socket to be disconnected and re-used after the send operation completes, but this standalone API was introduced for those applications that don't use either of those two APIs before shutting down. This extension API is available with Windows XP or later versions. However, for Windows 2000 or Windows NT 4.0 it is possible to call TransmitFile with a null file handle and buffers but specify the disconnect and re-use flags, which will achieve the same results. This API is defined as :
- typedef
- BOOL
- (PASCAL FAR * LPFN_DISCONNECTEX) (
- IN SOCKET s,
- IN LPOVERLAPPED lpOverlapped,
- IN DWORD dwFlags,
- IN DWORD dwReserved
- );
WSARecvMsg :
This last extension function is not too interesting in the discussion of high-performance, scalable I/O, but it is new to Windows XP (and later versions) and we chose to be consistent and cover it with the rest of the extension APIs. The WSARecvMsg is nothing more than a complicated WSARecv with the exception that it returns information about which interface the packet was received on. This is useful for datagram sockets that are bound to the local wildcard address on a multihomed machine and need to know which interface a packet arrived on. This function is defined as :
- typedef
- INT
- (PASCAL FAR * LPFN_WSARECVMSG) (
- IN SOCKET s,
- IN OUT LPWSAMSG lpMsg,
- OUT LPDWORD lpdwNumberOfBytesRecvd,
- IN LPWSAOVERLAPPED lpOverlapped,
- IN LPWSAOVERLAPPED_COMPLETION_ROUTINE lpCompletionRoutine
- );
- typedef struct _WSAMSG {
- LPSOCKADDR name; /* Remote address */
- INT namelen; /* Remote address length */
- LPWSABUF lpBuffers; /* Data buffer array */
- DWORD dwBufferCount; /* Number of elements in the array */
- WSABUF Control; /* Control buffer */
- DWORD dwFlags; /* Flags */
- } WSAMSG, *PWSAMSG, * FAR LPWSAMSG;
By default, no control information is returned when WSARecvMsg is called. To enable control information, one or more socket options must be set on the socket, indicating the type of information to be returned. Currently, only one option is supported, which is IP_PKTINFO for IPv4 and IPV6_PKTINFO for IPv6. These options return information about which local interface the packet was received on. See Chapter 7 for more information about setting these options.
Once the appropriate socket option is set and the WSARecvMsg completes, the control information requested is returned via the Control buffer specified in the WSAMSG parameter. Each type of information requested is preceded by a WSACMSGHDR structure that indicates the type of information following as well as its size. This header structure is defined as :
- typedef struct _WSACMSGHDR {
- SIZE_T cmsg_len;
- INT cmsg_level;
- INT cmsg_type;
- /* followed by UCHAR cmsg_data[] */
- } WSACMSGHDR, *PWSACMSGHDR, FAR *LPWSACMSGHDR;
沒有留言:
張貼留言