Connecting a socket in C

From LQWiki
Jump to navigation Jump to search

A socket is like a file descriptor, it can be for writing or for reading (listening socket). The program listed here creates a Socket and connects it to a remote computer. It's assumed that we connect to a web-server at the other end. Then when/if the socket is connected it reads what the remote machine outputs over the socket and prints it to the screen. Then the program exits.

connect_socket.c

 /*1*/ #include <stdio.h>
 /*2*/ #include <string.h>
 /*3*/ #include <stdlib.h>
 /*4*/ #include <unistd.h>
 /*5*/ #include <fcntl.h>

 /*6*/ #include <netinet/tcp.h>
 /*7*/ #include <sys/socket.h>
 /*8*/ #include <sys/types.h>
 /*9*/ #include <netinet/in.h>
 /*10*/ #include <netdb.h>

 /*11*/ int socket_connect(char *host, in_port_t port){
 /*12*/         struct hostent *hp;
 /*13*/         struct sockaddr_in addr;
 /*14*/         int on = 1, sock;     

 /*15*/         if((hp = gethostbyname(host)) == NULL){
 /*16*/                 herror("gethostbyname");
 /*17*/                 exit(1);
 /*18*/         }
 /*19*/         bcopy(hp->h_addr, &addr.sin_addr, hp->h_length);
 /*20*/         addr.sin_port = htons(port);
 /*21*/         addr.sin_family = AF_INET;
 /*22*/         sock = socket(PF_INET, SOCK_STREAM, IPPROTO_TCP);
 /*23*/         setsockopt(sock, IPPROTO_TCP, TCP_NODELAY, (const char *)&on, sizeof(int));
 /*24*/         if(sock == -1){
 /*25*/                 perror("setsockopt");
 /*26*/                 exit(1);
 /*27*/         }
 /*28*/         if(connect(sock, (struct sockaddr *)&addr, sizeof(struct sockaddr_in)) == -1){
 /*29*/                 perror("connect");
 /*30*/                 exit(1);
 /*31*/         }
 /*32*/         return sock;
 /*33*/ }
 
 /*34*/ #define BUFFER_SIZE 1024

 /*35*/ int main(int argc, char *argv[]){
 /*36*/         int fd;
 /*37*/         char buffer[BUFFER_SIZE];

 /*38*/         if(argc < 3){
 /*39*/                 fprintf(stderr, "Usage: %s <hostname> <port>\n", argv[0]);
 /*40*/                 exit(1); 
 /*41*/         }
       
 /*42*/         fd = socket_connect(argv[1], atoi(argv[2])); 

 /*43*/         write(fd, "GET /\r\n", strlen("GET /\r\n"));  
 
 /*44*/         bzero(buffer, BUFFER_SIZE);

 /*45*/         while(read(fd, buffer, BUFFER_SIZE - 1) != 0){
 /*46*/                 fprintf(stderr, "%s", buffer);
 /*47*/                 bzero(buffer, BUFFER_SIZE);
 /*48*/         }

 /*49*/         shutdown(fd, SHUT_RDWR); 
 /*50*/         close(fd); 

 /*51*/         return 0;
 /*52*/ }

We first state that the line numbers in the beginning of all lines are just there make it easier to go through the program. If you intend to compile this program remove all the line numbers first.

To compile the program you need a C Compiler like GCC. In order to do so you type the following in your shell, provided that you saved the code in a file called connect_socket.c

$ gcc -o connect_socket connect_socket.c 

This will produce a program called connect_socket, you can try it using:

$ ./connect_socket linuxdocs.org 80

Web-servers typically listens to port 80 so if there's any web-server at linuxdocs.org then you would get the index page.

Now to the program, what does it do? First we start at line 35. This is where all C programs starts (this is not always true, but in most cases at least). This is a function called main that takes two arguments. int argc and char *argv[]. argc is the number of arguments to the program. Arguments are the ones you provided from the command line, for example "linuxdocs.org" and "80" in the case above. These arguments are stored in *argv[], which is of the type pointer to char arrays. The program name is always stored in the first argument so if you would want to write out the programs name you can try:

pritnf("Hello I'm a program called: %s\n", argv[0]);

Further at line 36 we declare a file descriptor called fd. We will use it later to read from the socket we are about to create. Also, a char array named buffer is declared to store the incoming data in, more about that later. We see that we want the array to be of size BUFFER_SIZE. Previously we declared BUFFER_SIZE to be 1024, that is on line 34. The #define statement is a pre processor directive. For now we just state that: at ever place where we use the word BUFFER_SIZE we get the value 1024.

At line 38 we check that the user have supplied the program with the correct number of arguments.

/*38*/         if(argc < 3){
/*39*/                 fprintf(stderr, "Usage: %s <hostname> <port>\n", argv[0]);
/*40*/                 exit(1); 
/*41*/         }

That is, argc < 3, recall that the array called argv stores three arguments. First the name of the program, then the user supplied. So we are really interested in argv[1] and argv[2]. If the user fails to provide enough arguments we print some error message at line 39 and exits the program at line 40. What is this stderr you can see on line 39, the first argument to fprintf. There are three standard file streams that most operating systems provide, stdin, stdout and stderr. stderr in this case is an un buffered output. Everything written to this filestream is printed directly to the terminal. stdout however often wait for a while. If the program terminates in some bad way, like a segmentation fault it's not guaranteed that things written to stdout will be displayed, therefore it's often the case that error messages are written to stderr so that we are guaranteed to see then. If you're interested in fprintf you should check it's man page. man fprintf. Please note that it's often considered good conduct to check the arguments provided to a program, can't really hurt.

On line 42 we call the function socket connect. It looks like this:

/*42*/         fd = socket_connect(argv[1], atoi(argv[2])); 

The first argument provided to the program is the host we wish to connect to, we simply pass that argument to the function. The second argument is the port. Here, however we get a string i.e. "80" but would like to convert it to an integer. This is done by the function atoi(). atoi() takes as argument a string and tries to convert it to an integer. atoi is not guaranteed to return something senseful, if you would provide it with "hello" the returned value is somewhat undefined, what you actually get is depending on the actually implementation of atoi. Just be careful so that you don't assume anything about the arguments to atoi.

Now back to line 11 where the function socket_connect is declared. The first part looks this:

/*11*/ int socket_connect(char *host, in_port_t port){
/*12*/         struct hostent *hp;
/*13*/         struct sockaddr_in addr;
/*14*/         int on = 1, sock;     

We have here declared a variable *hp of type struct hostent. This variable is used later when we tries to figure out the host address associated with the hostname that we provided the program. We'll look at a bit later. addr is a variable of type struct sockaddr_in. This variable is used later when we opens the connection to the remote host. Further we use a variable called on, that helps us later. And last a variable called sock, this is the actual file descriptor that we will associate the opened socket with later on.

Now we come to the part where we actually try to resolve the host address accociated with the host name. This is done on line 15.

/*15*/         if((hp = gethostbyname(host)) == NULL){
/*16*/                 herror("gethostbyname");
/*17*/                 exit(1);
/*18*/         }

The function called gethostbyname takes a char* as argument which might be something like "www.google.com" or "192.168.0.1". It returns a pointer to a struct of type struct hostent. We check that we get something senseful out of it, i.e. is the pointer was assigned the value NULL we have something of an error. Exactly what happened is unknown, but we assume that the function herror can tell us. Therefore we call herror with an argument "gethostbyname". For example we might try to lookup a hostname that does not exist.

$ ./socket_connect www.hshsasjdhas.dfhsaj 80
gethostbyname: Unknown host

The manpage for gethostbyname will get you additional useful information.

On line 19 we take the result from gethostbyname, that is hp, and use a part of the struct called h_addr. This part contains the IP number to the host. Typically encoded as 4 bytes. This is not always the case so rather than assuming anything about the length we use hp->h_length, a variable that indicates exactly the length of the IP. You should check, again, the manpages for gethostbyname if you're interested in what the struct contains. Anyway, we use bcopy to copy the address to the part of addr called sin_addr. As we can see here addr is stack allocated, which means that we have to use a pointer to the struct rather than the struct itself. That is done by using teh & operator. This might be very confusing at first, but rather than covering all the details here we just state that you have to do like that. Anyway, now the address is copied into addr.

/*19*/         bcopy(hp->h_addr, &addr.sin_addr, hp->h_length);

What about the port? Internet connected machines typically can listen to at most 65536 different ports. But mostly they listen to just a few of them. We might decide that the typical webserver listens to port 80, so that is what we try. Now comes an interesting function called htons.

/*20*/         addr.sin_port = htons(port);

Some architectures like SPARC use something called BIG ENDIAN byte order, and some like Intel and clones uses LITTLE ENDIAN byte order. What is the difference? Everything has to do with the way they have chosen to encode bits in an integer. As an example we assume a 32 bit integer then we can think of the integer as being built out of 4 bytes. one byte = 8 bits, thus 8*4 = 32. Something like this A,B,C,D. Where A corresponds to the first byte, B the second and so on. If A,B,C,D is the case with BIG ENDIAN, then LITTLE ENDIAN encodes it like this: D,C,B,A. Alright, so they have different ways to encode the same number. What's kind of interesting here is that it's not simply a revere order of the bits, but rather the bytes. However, if we wish to send binary data from one machine to another it might be very useful to know how the interpreter and encode integers. And now we are going to send a package over the Internet to a host of unknown architecture. We better take some precautions. To deal with this matter it's decided that Internet is BIG ENDIAN byte order. Simple as that. The htons function which is short for host-to-network, change the byte order if necessary. A way to check what byte order your machine has is to run the following test:

printf("%d\n", htons(666)); 

If it prints 39462 you're on a machine that uses LITTLE ENDIAN and if it prints 666 you'r on a BIG ENDIAN machine. Continuing with line 21 we simply tell the addr struct that we are interested in the Address Family InterNET, AF_INET.

/*21*/         addr.sin_family = AF_INET;

When this is done we create a socket, as told by the sock manpage (try man -S 2 socket if get nothing, or unrelated info) we simply creates a communication endpoint. This socket is not connected to anything yet, but we specify some interesting attributes for the socket. First we use something called PF_INET which specifies which protocol we what to use. PF_INET corresponds to Protocol Family IPV4. You could for instance use PF_INET6 which corresponds to IPV6 or PF_IPX which is the Novel protocol, and so on. Then we tell the socket function that we are interested in using SOCK_STREAM, this argument corresponds to the type of communication. SOCK_STREAM typically corresponds to two way reliable communications. You could for example use SOCK_DGRAM here if you want to send datagram packages. Last we specify that we are interested in an TCP connection by giving IPPROTO_TCP as argument. Again, check the socket manpage for more details.

/*22*/         sock = socket(PF_INET, SOCK_STREAM, IPPROTO_TCP);

More options, this is getting more and more complex. :)=. We use setsockopt to tell set some options for the socket. first we tell it to use IPPROTO_TCP, again. Then we specify some options for this protocol, namely that we are interested in no delay communications using TCP_NODELAY. If you recall the variable on before, we did set it to 1. When sending it to the function setsockopt it means that we are interested in enabling TCP_NODELAY rather then disabling it, 0 would do that. Interesting enough we send a pointer, recall that & gets the address to a variable, in this case on. We also tell how large this variable is by sending in the last argument sizeof(int). setsockopt is a quiet useful function that can manipulate a lot of properties that sockets have. Check out the manpage for setsockopt to get more details.

/*23*/         setsockopt(sock, IPPROTO_TCP, TCP_NODELAY, (const char *)&on, sizeof(int));
/*24*/         if(sock == -1){
/*25*/                 perror("setsockopt");
/*26*/                 exit(1);
/*27*/         }

If this option manipulation for some reason fails, maybe because some option we try to enable is not available for the type of communication we want to use then we get the return value -1. This is checked at line 24 to 27. The general idea is the same as with gethostbyname before, but we use another error function here. Check manpages for perror if you're interested.

Now at last, we are ready to connect out socket to a remote machine. The function connect does this. We use the sock variable that we have done a lot of things with. Also we user the addr variable which have some information about where we wish to connect. Observer that we cast the addr variable to a pointer of type struct sockaddr. Again we use the & operator to get the address of the struct. We also tell how large this struct is by using sizeof(struct sockaddr_in). Then we check the return value, if -1 we have problems. For instance we might want to connect to a port on a machine that didn't listen. For example:

$ ./connect_socket localhost 6677
connect: Connection refused
/*28*/         if(connect(sock, (struct sockaddr *)&addr, sizeof(struct sockaddr_in)) == -1){
/*29*/                 perror("connect");
/*30*/                 exit(1);
/*31*/         }
/*32*/         return sock;

Error handling here is very similar to the above examples. Now we return the just created socket at like 32. Back to the main function, we want to read things from this socket also.

But how do we make the web-server at the other end send us anything? Well luckily the procedure is very simple. Sending "GET /\r\n" to a web-server just tells the server that we want the root of the server, this often defaults to the index page. The "\r\n" is just a standard way of telling the server at the other end that we won't send anything else on the same line, so it's safe to interpreter the line as is. The function write does this for ur. It takes 3 parameters, the file-descriptor fd, that is the socket. Further the string we want to send, that is "GER /\n\r", and last the length of that string.

/*43*/         write(fd, "GET /\r\n", strlen("GET /\r\n"));  

After that we take the buffer we declared before and set all bytes in this buffer to 0. This is to avoid junk data that the buffer might contain.

/*44*/         bzero(buffer, BUFFER_SIZE);

Then while read indicates that there are still things to read we read from the socket. read returns the number of bytes that have been read. We simply assume that if we get the result 0 bytes read then we have read all available data. This is generally true when we work with blocking IO. That is the read function waits till it can read something, something that is good since it might take some time for the data to travel over the Internet. Arguments to read are the file-descriptor fd, i.e. the one we are reading from. The char array buffer in which we stores the data. And lastly the number of bytes we want to read every time. But why not read exactly BUFFER_SIZE bytes? we just read BUFFER_SIZE - 1 bytes. This is because the last byte in this char array is 0, due to the call to bzero before. When we print the contents of the buffer using fprintf on line 46 fprintf must know when to stop printing. The case is that fprintf stops printing when it sees 0, or '\0' if you want the char value for 0. Otherwise we would print other things in memory that comes after the buffer. Something that might end up with an Segmentation fault, when trying to read memory we have no access to. After we have printed the message we bzero the buffer again and continue until no data is left to print.

/*45*/         while(read(fd, buffer, BUFFER_SIZE - 1) != 0){
/*46*/                 fprintf(stderr, "%s", buffer);
/*47*/                 bzero(buffer, BUFFER_SIZE);
/*48*/         }

When we're done we close the socket suing shutdown, and specify that we are not interested in reading (RD) nor writing (WR) using SHUT_RDWR. After than we close the file descriptor using close and return 0 to the shell, just for the sake of good conduct.

/*49*/         shutdown(fd, SHUT_RDWR); 
/*50*/         close(fd); 

/*51*/         return 0;

That's it about it, quiet frankly this example might be a bit hard to begin with since it's lengthy and contains a lot of socket yadda yadda. But I assume that most people would want something more 'useful' than another hello world described in great detail.