CopyDisable

Tuesday 23 February 2016

MongoDB Recipes: Securing MongoDB using x.509 Certificate

Enabling SSL x.509 certificate to authenticate the members of a replica set

Instead of using simple plain-text keyfiles, MongoDB replica set or sharded cluster members can use x.509 certificates to verify their membership.
The member certificate, used for internal authentication to verify membership to the sharded cluster or a replica set, must have the following properties (Source: https://docs.mongodb.org/manual/tutorial/configure-x509-member-authentication/#member-x-509-certificate) :
  • A single Certificate Authority (CA) must issue all the x.509 certificates for the members of a sharded cluster or a replica set.
  • The Distinguished Name (DN), found in the member certificate’s subject, must specify a non-empty value for at least one of the following attributes: Organization (O), the Organizational Unit (OU) or the Domain Component (DC).
  • The Organization attributes (O‘s), the Organizational Unit attributes (OU‘s), and the Domain Components (DC‘s) must match those from the certificates for the other cluster members.
  • Either the Common Name (CN) or one of the Subject Alternative Name (SAN) entries must match the hostname of the server, used by the other members of the cluster.
  • If the certificate includes the Extended Key Usage (extendedKeyUsage) setting, the value must include clientAuth (“TLS Web Client Authentication”).

We have an existing setup of MongoDB 3.2.1 replica set running on three Ubuntu 14.04 VMs. The VMs are named as server2, server3 and server4. This mongodb replica set do not have authentication enabled. We are going to use this replica set and update it to use x.509 certificates for member authentication.
For this document I am going to use self-signed certificates, but self-signed certificates are insecure as we can’t trust the authenticity of the self-signed certificates. For our internal private setup, use we can use self-signed certificates.
First we will create our own CA (certificate authority) which will sign the certificates of our mongodb servers.
First I am going to create private key for my CA server.
openssl genrsa -out mongoPrivate.key -aes256
image

Next I will create the CA certificate:
openssl req -x509 -new -extensions v3_ca -key mongoPrivate.key -days 1000 -out mongo-CA-cert.crt
image

Our CA is ready to sign our certificates, now we will create private keys for our mongodb servers and will generate CSR (certificate signing request) files for each mongodb server. We will use the CA certificate that we created above and generate  certificates for our mongod servers using the CSR files.
We will use a single command which will create the private key and CSR for a particular server. We have to create private key and CSR in all the mongodb server nodes. So we have three servers server2, server3 and server4 and we will run this command in all the three servers.

openssl req -new -nodes -newkey rsa:2048 -keyout server2.key -out server2.csr

While generating the CSR in our 3 servers, we will keep all fields in the certificate request same, except the Common Name. We will set the Common Name as hostname of that respective server on which we run the command.
We are using the -nodes option, this will leave the private key unencrypted, so that we don’t have to type/enter/configure password while starting our MongoDB server.
Creating Private Key and CSR for server2:
image

Now we will sign the server2’s CSR with our CA certificate and we will generate the public certificate of server2.
image

Creating Private Key and CSR for server3:
image

Creating Private Key and CSR for server4:
image
Generating certificates of server3 and server4:
image
Once certificates for all the three servers are generated, we will copy the server certificate and CA certificates to each server.
We have to concatenate the private key and the public certificate of a server into a single .pem file.
The cat command concatenates the private key and the certificate into a PEM file suitable for MongoDB.
image
image
image

Changes in MongoDB configuration file:

In all the three mongodb servers we have to add the below highlighted config options. For PEMKeyFile and clusterFile we have to use the respective .pem files of that server.
Config file in server3:
net:
  port: 27017
  bindIp: 0.0.0.0
  ssl:
        mode: preferSSL
        PEMKeyFile: /mongodb/config/server3.pem
        CAFile: /mongodb/config/mongo-CA-cert.crt
        clusterFile: /mongodb/config/server3.pem







security:
  clusterAuthMode: x509


Explanation:
mode: preferSSL
Connections between servers use TLS/SSL. For incoming connections, the server accepts both TLS/SSL and non-TLS/non-SSL. For details please read https://docs.mongodb.org/manual/reference/configuration-options/#net.ssl.mode
For the timebeing we are configuring SSL connection between our servers. The clients can connect to MongoDB server without TLS/SSL.

PEMKeyFile: /mongodb/config/server3.pemThe .pem file that we created using the private key and certificate of that particular server.
Note: We created unencrypted server’s private key, so we do not have to specify the net.ssl.PEMKeyPassword option. If the private key is encrypted and if we do not specify net.ssl.PEMKeyPassword option then mongod or mongos will prompt for a passphrase.

CAFile: /mongodb/config/mongo-CA-cert.crt
This option requires a .pem file containing root certificate chain from the Certificate Authority. In our case we have only one certificate of CA, so here the path of the .crt file is specified.

clusterFile: /mongodb/config/server3.pem
This option specifies the .pem file that contains the x.509 certificate-key file for membership authentication. If we skip this option then the cluster uses the .pem file specified in the PEMKeyFile setting. In our case this file is same, so we may skip this option as our .pem file is same as the one we specified in PEMKeyFile setting.

clusterAuthMode: x509The authentication mode used for cluster authentication. Please visit the link https://docs.mongodb.org/manual/reference/configuration-options/#security.clusterAuthMode for details.

Now we are ready to start the MongoDB servers of our replica set. After starting the mongod if we check the logs, we can see lines as shown below:
2016-02-24T10:20:39.424+0530 I ACCESS   [conn8]  authenticate db: $external { authenticate: 1, mechanism: "MONGODB-X509", user: "emailAddress=pranabksharma@gmail.com,CN=server2,OU=Technologies,O=Pe-Kay,L=Mumbai,ST=Maharashtra,C=IN" }
2016-02-24T10:20:40.254+0530 I ACCESS   [conn9]  authenticate db: $external { authenticate: 1, mechanism: "MONGODB-X509", user: "emailAddress=pranabksharma@gmail.com,CN=server4,OU=Technologies,O=Pe-Kay,L=Mumbai,ST=Maharashtra,C=IN" }
Seems the x.509 authentication for our replica set members are working Hot smile.
Now login to the primary of the replica set and create our initial user administrator using the localhost exception. For details of creating users you may read http://pe-kay.blogspot.in/2016/02/update-existing-mongodb-replica-set-to.html and http://pe-kay.blogspot.in/2016/02/mongodb-enable-access-control.html , I am not going to write about that in this post.


Enabling client SSL x.509 certificate authentication

Each unique x.509 client certificate corresponds to a single MongoDB user, so we have to create client x.509 certificates for all the users that we need to connect to the database. We can not use one client certificate to authenticate more than one user. First we will create the client certificates and then we will add the corresponding mongodb users.
While generating the CSR for a client, we have to remember the following points:
  • A single Certificate Authority (CA) must issue the certificates for both the client and the server.
  • A client x.509 certificate’s subject, which contains the Distinguished Name (DN), must differ from the certificates that we generated for our mongodb servers server2, server3 and server4. The subjects must differ with regards to at least one of the following attributes: Organization (O), the Organizational Unit (OU) or the Domain Component (DC). If we do not change any of these attributes, we will get error ("Cannot create an x.509 user with a subjectname that would be recognized as an internal cluster member.") when we add user for that client.
    In the below screenshot I tried to create user for one of the client. In this case I created the client certificate keeping all the attributes same as the Member x.509 Certificate, except the Common Name. So I got the error: 
    image

So for my client certificate I am going to change the Organizational Unit attribute. For our mongodb server server2, server3 and server4 certificates, we generated CSR with Organizational Unit as Technologies, for client certificate I am going to use Organizational Unit as Technologies-Client. Also I am keeping the Common Name as the name of the user root (which going to connect to the database). All the remaining attributes in the client certificate CSR are kept same as the server certificate.
image
After that using the same CA as we did for our server certificates, we signed the client certificate CSR and generated our client certificate.
image
Then we concatenated the user’s private key and the public certificate and created the .pem file for that user.
image
Our client certificate is created, we have to add user in MongoDB for this certificate. To authenticate with a client certificate, we have to add a MongoDB user using the value of subject collected from the client certificate.
The subject of a certificate can be extracted using the below command:
openssl x509 -in mongokey/rootuser.pem -inform PEM -subject -nameopt RFC2253
image
Using the subject value emailAddress=pranabksharma@gmail.com,CN=root,OU=Technologies-Client,O=Pe-Kay,L=Mumbai,ST=Maharashtra,C=IN we will add one user in the $external database (Users created on the $external database should have credentials stored externally to MongoDB). So using the subject I created one user with root role in the admin database:
rep1:PRIMARY> db.getSiblingDB("$external").runCommand({
... createUser: "emailAddress=pranabksharma@gmail.com,CN=root,OU=Technologies-Client,O=Pe-Kay,L=Mumbai,ST=Maharashtra,C=IN",
... roles: [{role: "root", db: "admin"}]
... })

image
Our client certificate is created, corresponding user is created, we are ready to connect to mongodb using SSL certificate. For that first we have to change the value of ssl mode option from existing preferSSL to requireSSL in all our mongod server config files. This will make all the servers to accept only ssl connections.
net:
  port: 27017
  bindIp: 0.0.0.0
  ssl:
        mode: requireSSL
        PEMKeyFile: /mongodb/config/server4.pem
        CAFile: /mongodb/config/mongo-CA-cert.crt

If we change the mode: preferSSL to mode: requireSSL, and try to connect to mongod in usual way, we get the following error:
Error: network error while attempting to run command 'isMaster' on host '<hostname>:<port>'
image
Now the client authentication needs x.509 certificates rather than username and password as we change the ssl mode to requireSSL.
Lets connect to mongod using our client certificate:
mongo --ssl --sslPEMKeyFile ./mongokey/rootuser.pem --sslCAFile ./ssl/CA/mongo-CA-cert.crt --host server2
After getting connected, if we try to run any query, we get the following error:
image
As the user got root role and we connected to the admin database, it should be able to get the list of collections. What is the issue Surprised smile???
If we run the connection status query, we can see that the authentication information is blank.
rep1:PRIMARY> db.runCommand({'connectionStatus' : 1})
{
        "authInfo" : {
                "authenticatedUsers" : [ ],
                "authenticatedUserRoles" : [ ]
        },
        "ok" : 1
}

Well when our mongo shell connects to our mongod using ssl certificate as shown above, only the connection part is completed. To have authenticated, we have to use the db.auth() method in the $external database as shown below:

rep1:PRIMARY> db.getSiblingDB("$external").auth(
...   { mechanism: "MONGODB-X509",
... user: "emailAddress=pranabksharma@gmail.com,CN=root,OU=Technologies-Client,O=Pe-Kay,L=Mumbai,ST=Maharashtra,C=IN"
... }
... )

image
Once we authenticated, we can run our queries and do admin tasks Nerd smile.
Now say we have another client certificate with subject emailAddress=pranabksharma@gmail.com,CN=reader,OU=Technologies-Client,O=Pe-Kay,L=Mumbai,ST=Maharashtra,C=IN and we connected to the database using this certificate. After getting connected we have to authenticate, so lets try to authenticate using another user(emailAddress=pranabksharma@gmail.com,CN=root,OU=Technologies-Client,O=Pe-Kay,L=Mumbai,ST=Maharashtra,C=IN) , and we will get error:
Error: Username "emailAddress=pranabksharma@gmail.com,CN=root,OU=Technologies-Client,O=Pe-Kay,L=Mumbai,ST=Maharashtra,C=IN" does not match the provided client certificate user "emailAddress=pranabksharma@gmail.com,CN=reader,OU=Technologies-Client,O=Pe-Kay,L=Mumbai,ST=Maharashtra,C=IN"
image
We have to authenticate with the same user, using which we connected to MongoDB server. We can’t use a different user’s certificate to connect to mongodb and later use a different user to authenticate.

Monday 22 February 2016

MongoDB Recipes: Update existing MongoDB replica set to use access control

The Story:I have a MongoDB replica set setup of three nodes running in a private environment. MongoDB was not configured to use access control, so anybody can connect and do anything (actually it was protected by firewall and only the application server can connect to this setup, so some level of protection was there). One  day I was informed that our analytics team needs access to this mongodb server and they should only be given read access to the pekay database. Also the sales team’s application needs read-write access to the pekay database. So I have to create users with proper roles and for that I have to enable access control in my MongoDB replica set.
In this story, my 3 node mongodb replica set is running mongodb 3.2.1 on three Ubuntu 14.04 VMs.
So to enable access control we have to restart each node with proper config options.
Normally with standalone mongodb server, we have to set the command line option  --auth or security.authorization configuration file setting. You may also read this related post http://pe-kay.blogspot.in/2016/02/mongodb-enable-access-control.html
Lets try that:
I have added the following line in my mongodb config file in all three replica set members and restarted the mongod servers.
security:
  authorization: enabled

Now after starting my mongod servers when I connect any of them, I get:
image
All the 3 mongod servers are in RECOVERING state and my mongod log files are getting filled with below error messages:
2016-02-22T14:43:14.400+0530 I ACCESS   [conn1] Unauthorized not authorized on admin to execute command { replSetHeartbeat: "rep1", configVersion: 3, from: "server3:27017", fromId: 1, term: 1 }
2016-02-22T14:43:16.146+0530 I ACCESS   [conn2] Unauthorized not authorized on admin to execute command { replSetHeartbeat: "rep1", configVersion: 3, from: "server2:27017", fromId: 0, term: 1 }
2016-02-22T14:43:18.341+0530 I REPL     [ReplicationExecutor] Error in heartbeat request to server2:27017; Unauthorized not authorized on admin to execute command { replSetHeartbeat: "rep1", configVersion: 3, from: "server4:27017", fromId: 2, term: 1 }
2016-02-22T14:43:18.341+0530 I REPL     [ReplicationExecutor] Error in heartbeat request to server3:27017; Unauthorized not authorized on admin to execute command { replSetHeartbeat: "rep1", configVersion: 3, from: "server4:27017", fromId: 2, term: 1 }


So all are authorization issues. What went wrong????
Well, we have to enable Internal Authentication for enabling access control on replica sets and sharded clusters. We can use keyfile or x.509 for internal authentication. In this document I am going to write about the keyfile, which is simplest to setup.
First we have to create a keyfile and the keyfile has to be present in all the nodes of my replica set. This keyfile contains a password and all the nodes should share the same password. Be sure to have a strong and long password in the keyfile. As this is my test setup, so I am choosing a simple password:
$ echo "mypassword" > /mongodb/config/key
Also you can use openssl or md5sum command to generate random password. Set the same password in all three server’s keyfile.
Change the permission of this keyfile to 600.
$ chmod 600 /mongodb/config/key
I am changing my mongod config file to set the security.keyFile option, also you can start the mongod with  --keyFile option.
Note: Enabling security.keyFile option also enables authorization: enabled option. So we do not have to set the authorization: enabled in our config file.
security:
       keyFile: /mongodb/config/key


So starting the mongod servers of our replica set with keyFile option, and this time the replica set started normally Thumbs up.
Now using the localhost exception, I will connect to the primary and create the initial user administrator.
image
Authenticate with the newly created user and create the additional users.
rep1:PRIMARY> db.auth("uadmin", "abc123");
Create a user with root role, so that using this user we can run database/cluster administration commands.
rep1:PRIMARY> db.createUser({
... user: "rootuser",
... pwd: "abc123",
... roles: [{ role: "root", db: "admin"}]
... })
Successfully added user: {
        "user" : "rootuser",
        "roles" : [
                {
                        "role" : "root",
                        "db" : "admin"
                }
        ]
}
rep1:PRIMARY>


User for analytics team and sales team on pekay database:
image
That’s it, our replica set is authentication enabled now
  connected-community-single-sign-on

Thursday 11 February 2016

MongoDB Recipes: Enabling access control in MongoDB

Enabling access control in MongoDB requires Authentication and Authorization. Authentication is the process of verifying the identity of a client. Authorization determines what a user can do once he/she is connected to the database.
To enable access control, we have to start MongoDB server with --auth or security.authorization configuration file setting.
For this example I am using mongodb config file:

image
Starting mongodb with access control:
image
Once mongodb started, we can connect to MongoDB server from localhost without any username/password and create our first user, this is called localhost exception.
Localhost Exception: With the localhost exception, after you enable access control, connect to the localhost interface and create the first user in the admin database. The first user must have privileges to create other users, such as a user with the userAdmin or userAdminAnyDatabase role.
Source: https://docs.mongodb.org/manual/core/security-users/#localhost-exception
Now we will connect to admin database of mongodb using mongo shell and create our first admin user.
root@ubuntutest:/mongodb# mongo admin
> db.createUser(
{
     user: "admino",
     pwd: "admin123",
     roles: [ { role: "userAdminAnyDatabase", db: "admin" } ]
}
)
image
After creating our first user, if we try to create another user, we will get exception:
image
We have to authenticate with the first user that we created and then we can create the next user.
> db.auth("admino","admin123")
Lets create a user who can only read data of any database in the MongoDB server. The readAnyDatabase is a special role on the admin database that grants privileges to a user to read any database.
image
Our user is created, now lets connect using this user to the callcenter database and read some data:
image
Oopss… login failed Disappointed smile???? What happened Confused smile???? We created the user with readAnyDatabase role, why the user is getting exception while connecting to callcenter database???
Well, when we created the user we gave the “db” name as “admin”, so while authenticating we have to authenticate the user against admin db, so we have to login as:
mongo callcenter -u pranabs -p "abc123" --authenticationDatabase "admin"
image

For testing I am creating one user pekay who can read/write in the callcenter database
> db.createUser(
... {
... user: "pekay",
... pwd: "pekay123",
... roles: [ { role: "readWrite", db: "callcenter" }]
... }
... )
Successfully added user: {
        "user" : "pekay",
        "roles" : [
                {
                        "role" : "readWrite",
                        "db" : "callcenter"
                }
        ]
}
User is created, now I will login with this user and update one record:
mongo callcenter -u pekay -p "pekay123"
> db.account.findOne()
{
        "_id" : ObjectId("56bc4ccfc0952e06f50b99ef"),
        "username" : "user1",
        "password" : "abc123"
}
> db.account.update({"username" : "user1"},{"username" : "user1","password" : "user123"})
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
> db.account.findOne()
{
        "_id" : ObjectId("56bc4ccfc0952e06f50b99ef"),
        "username" : "user1",
        "password" : "user123"
}

Now say I want some data from another database support, but pekay user has access only to callcenter database. To get data from support database I am using the pranabs account who has readAnyDatabase role:
image
Again I need to update a document in callcenter database, so I swiched to callcenter database:
image
Line1: > db
Line2: callcenter
Line3: > use admin
Line4: switched to db admin
Line5: > db.auth("pranabs", "abc123")
Line6: 1
Line7: > use support
Line8: switched to db support
Line9: > db.issues.findOne()
Line10: { "_id" : ObjectId("56bc5476d509294313459003"), "issueID" : 1 }
Line11: > use callcenter
Line12: switched to db callcenter
Line13: >  db.account.update({"username" : "user1"},{"username" : "user1","password" : "newpassword"})
Line14: WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
Line15: > db.account.findOne()
Line16: {
        "_id" : ObjectId("56bc4ccfc0952e06f50b99ef"),
        "username" : "user1",
        "password" : "newpassword"
}
Few steps back I have authenticated myself as pranabs user (in Line5), I have switched myself back to callcenter database (in Line11), but still I could update a document of callcenter database (in Line13). This should not happen as the user pranabs has only readAnyDatabase role, so this user should not be able to update a document. Why this happened Surprised smile????
Lets investigateJust kidding:
So get the connection status:
> db.runCommand({"connectionStatus": 1})
{
        "authInfo" : {
                "authenticatedUsers" : [
                        {
                                "user" : "pekay",
                                "db" : "callcenter"
                        },
                        {
                                "user" : "pranabs",
                                "db" : "admin"
                        }
                ],
                "authenticatedUserRoles" : [
                        {
                                "role" : "readWrite",
                                "db" : "callcenter"
                        },
                        {
                                "role" : "readAnyDatabase",
                                "db" : "admin"
                        }
                ]
        },
        "ok" : 1
}
Here we can see that our connection is authenticated for two users pekay and pranabs, so using this connection we can get all the permissions of pekay and pranabs users and we can do all the operations permitted to these two users. Each time we run the method db.auth from an existing connection, the user and roles are added to the list of authenticated users and roles for that connection.
Continuing the above example, now I will authenticate using another user with name support. After running the db.auth method, we can see the support user has been added to the authenticatedUsers list and roles of support user also added to the authenticatedUserRoles list.
image
So this was just a small introduction into MongoDB’s access control, for details you can always visit official MongoDB documents.

Tuesday 9 February 2016

Dead datanode detection

The Namenode determines whether a datanode dead or alive by using heartbeats. Each DataNode sends a Heartbeat message to the NameNode every 3 seconds (default value). This heartbeat interval is controlled by the dfs.heartbeat.interval property in hdfs-site.xml file.
If a datanode dies, namenode waits for almost 10 mins before removing it from live nodes. Till the time datanode is marked dead, if we try to copy data in HDFS we may get error if the dead datanode is selected for storing block of the data. In the below screenshot, our datanode running on 192.168.10.75 has actually died, but namenode has not marked it dead yet. So while we copying a file to HDFS, the block write operation to the datanode 192.168.10.75 will fail and we got the below error:
image
The time period for determining whether a datanode is dead is calculated as:
2 * dfs.namenode.heartbeat.recheck-interval + 10 * 1000 * dfs.heartbeat.interval
The default values for dfs.namenode.heartbeat.recheck-interval is 300000 milliseconds (5 minutes) and dfs.heartbeat.interval is 3 seconds.
So if we use the default values in the above formula:
2 * 300000 + 10 * 1000 * 3 = 630000 milliseconds
Which is 10 minutes and 30 seconds. So after 10 minutes and 30 seconds, the namenode marks a datanode as dead.

For some cases, this 10 minutes and 30 seconds interval is high and we can adjust it by mainly adjusting the dfs.namenode.heartbeat.recheck-interval property. For example, suppose we want to adjust this timeout to be around 4 to 5 minutes (i.e. around 4-5 minutes interval before a datanode is marked dead). We can set the dfs.namenode.heartbeat.recheck-interval property in hdfs-site.xml file:
<property>
    <name>dfs.namenode.heartbeat.recheck-interval</name>
    <value>120000</value>
</property>


Now lets calculate the timeout again:
2 * 120000 + 10 * 1000 * 3 = 270000 milliseconds
This is 4 minutes and 30 seconds. So now my namenode will mark a datanode as dead in 4 mins and 30 secs.

Friday 5 February 2016

Configuring Federated HDFS Cluster with High Availability (HA) and Automatic Failover

Requirement: We have to configure federated namenodes for our new Hadoop cluster. We need two federated namenodes each for our departments sales and analytics. In my last post http://pe-kay.blogspot.in/2016/02/change-single-namenode-setup-to.html I had written about federated namenode setup, you can refere that also. The HDFS federated cluster that we are creating is of high priority and users will be completely dependent on our cluster, so we can not afford to have downtime. So we have to enable HA for our federated namenodes. So there will be total 4 namenodes, two namenodes (1 active and 1 standby) for sales namespace and two for analytics.
Slide1

For demonstration of this requirement, I am going to use 4 virtual box with Ubuntu 14.04 VMs. The VMs are named as server1, server2, server3 and server4.
There are two ways we can configure HDFS High Availability, using Using the Quorum Journal Manager or using Shared Storage. Using shared storage in HA design, there is always a risk of failure of the shared storage. As I am going to use more latest version of hadoop (i.e. 2.6), so I will use Quorum Journal Manager which provides additional level of HA by having a group of JournalNodes. When any namespace modification is performed by the Active node, it durably logs a record of the modification to a majority of these JNs. So if we configure multiple journalnodes, then we can afford jornalnode failures also. For details you can read offical hadoop documents.
For configuring HA with automatic failover we also need Zookeeper. For details of Zookeeper, you can visit the link https://zookeeper.apache.org/

Placement of different Services

Namenodes:
1) server1 : Will run active/standby namenode for namespace sales
2) server2: Will run active/standby namenode for namespace analytics
3) server3: Will run active/standby namenode for namespace sales
4) server4: Will run active/standby namenode for namespace analytics

Datanodes:
For this demo, I will run datanodes on all the four VMs.

Journal Nodes:
Journal nodes will run on three VMs server1, server2 and server3

Zookeeper:
Jookeeper will run on three VMs server2, server3 and server4


Zookeeper Configuration

First I am going to configure a Zookeeper three node cluster. I am not going to write details about Zookeeper and its configuration, I will only write the configs that is necessary for this document. I have downloaded and deployed Zookeeper in server2, server3 and server4 servers. I am using the default Zookeeper configuration, for our zookeeper ensemble I have added the below lines in Zookeeper configuration file:

server.1=server2:2222:2223
server.2=server3:2222:2223
server.3=server4:2222:2223


Create directory for storing Zookeeper data and log files:
hadoop@server2:~$ mkdir /home/hadoop/hdfs_data/zookeeper
hadoop@server3:~$ mkdir /home/hadoop/hdfs_data/zookeeper
hadoop@server4:~$ mkdir /home/hadoop/hdfs_data/zookeeper
Create Zookeeper ID files:
hadoop@server2:~$ echo 1 > /home/hadoop/hdfs_data/zookeeper/myid
hadoop@server3:~$ echo 2 > /home/hadoop/hdfs_data/zookeeper/myid
hadoop@server4:~$ echo 3 > /home/hadoop/hdfs_data/zookeeper/myid

Start Zookeeper cluster
hadoop@server2:~$ zkServer.sh start
hadoop@server3:~$ zkServer.sh start
hadoop@server4:~$ zkServer.sh start


HDFS Configuration

core-site.xml:
We are going to use ViewFS and define which paths map to which namenode. For details about ViewFS you may visit this https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/ViewFs.html link. Now the clients will load the ViewFS plugin and look for mount table information in the configuration file.
<property>
    <name>fs.defaultFS</name>
    <value>viewfs:///</value>
</property>

Here we are mapping a folder to a namespace
Note: In last blog post on HDFS federation http://pe-kay.blogspot.in/2016/02/change-single-namenode-setup-to.html, I mapped the path to a namenode’s URL, but as we are now into a HA configuration so I have to set the mapping to the respective nameservice.
<property>
    <name>fs.viewfs.mounttable.default.link./sales</name>
    <value>hdfs://sales</value>
</property>
<property>
    <name>fs.viewfs.mounttable.default.link./analytics</name>
    <value>hdfs://analytics</value>
</property>


We have to give a directory name in JournalNode machines where the edits and other local state used by the JournalNodes will be stored. Create this directory in all the nodes where the journalnodes will run.

<property>
  <name>dfs.journalnode.edits.dir</name>
  <value>/home/hadoop/hdfs_data/journalnode</value>
</property>



hdfs-site.xml
Note: I have included only those configurations which are necessary to configure federated HA cluster.

<!-- Below properties are added for NameNode Federation and HA -->
<!-- Nameservices for our two federated namespaces sales and analytics -->
<property>
    <name>dfs.nameservices</name>
    <value>sales,analytics</value>
</property>

In each nameservice we will define 2 namenodes, one will be active namenode and the other one will be standby namenode.
<!-- Unique identifiers for each NameNodes in the sales nameservice -->
<property>
  <name>dfs.ha.namenodes.sales</name>
  <value>sales-nn1,sales-nn2</value>
</property>

<!-- Unique identifiers for each NameNodes in the analytics nameservice -->
<property>
  <name>dfs.ha.namenodes.analytics</name>
  <value>analytics-nn1,analytics-nn2</value>
</property>


<!-- RPC address for each NameNode of sales namespace to listen on -->
<property>
    <name>dfs.namenode.rpc-address.sales.sales-nn1</name>
        <value>server1:8020</value>
</property>
<property>
    <name>dfs.namenode.rpc-address.sales.sales-nn2</name>
        <value>server3:8020</value>
</property>

<!-- RPC address for each NameNode of analytics namespace to listen on -->
<property>
    <name>dfs.namenode.rpc-address.analytics.analytics-nn1</name>
        <value>server2:8020</value>
</property>
<property>
    <name>dfs.namenode.rpc-address.analytics.analytics-nn2</name>
        <value>server4:8020</value>
</property>



<!-- HTTP address for each NameNode of sales namespace to listen on -->
<property>
    <name>dfs.namenode.http-address.sales.sales-nn1</name>
        <value>server1:50070</value>
</property>
<property>
    <name>dfs.namenode.http-address.sales.sales-nn2</name>
        <value>server3:50070</value>
</property>

<!-- HTTP address for each NameNode of analytics namespace to listen on -->
<property>
    <name>dfs.namenode.http-address.analytics.analytics-nn1</name>
        <value>server2:50070</value>
</property>
<property>
    <name>dfs.namenode.http-address.analytics.analytics-nn2</name>
        <value>server4:50070</value>
</property>


A single set of JournalNodes can provide storage for multiple federated namesystems. So I will configure the same set of JournalNodes running on server1, server2 and server3 for both the nameservices sales and analytics.
<!-- Addresses of the JournalNodes which provide the shared edits storage, written to by the Active nameNode and read by the Standby NameNode –>
Shared edit storage in JournalNodes for sales namespace.
<property>
  <name>dfs.namenode.shared.edits.dir.sales</name>
  <value>qjournal://server1:8485;server2:8485;server3:8485/sales</value>
</property>

Shared edit storage in JournalNodes for analytics namespace.
<property>
  <name>dfs.namenode.shared.edits.dir.analytics</name>
  <value>qjournal://server1:8485;server2:8485;server3:8485/analytics</value>
</property>


<!-- Configuring automatic failover -->
<property>
  <name>dfs.ha.automatic-failover.enabled</name>
  <value>true</value>
</property>

<property>
   <name>ha.zookeeper.quorum</name>
   <value>server2:2181,server3:2181,server4:2181</value>
</property>

<!-- Fencing method that will be used to fence the Active NameNode during a failover -->
<!-- sshfence: SSH to the Active NameNode and kill the process -->
<property>
  <name>dfs.ha.fencing.methods</name>
  <value>sshfence</value>
</property>
<property>
  <name>dfs.ha.fencing.ssh.private-key-files</name>
  <value>/home/hadoop/.ssh/id_rsa</value>
</property>


<!-- Configure the name of the Java class which will be used by the DFS Client to determine which NameNode is the current Active -->
<property>
  <name>dfs.client.failover.proxy.provider.sales</name>  <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>


<property>
  <name>dfs.client.failover.proxy.provider.analytics</name>  <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>


For more about the configuration properties, you can visit https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html

ZooKeeper Initialization

For creating required znode of sales namespace, run the below command from one of the Namenodes of sales nameservice
hadoop@server1:~/.ssh$ hdfs zkfc –formatZK

For creating required znode of analytics namespace, run the below command from one of the Namenodes of analytics nameservice
hadoop@server2:~/.ssh$ hdfs zkfc –formatZK

Zookeeper znode tree before running above commands:
image
Zookeeper znode tree after running above commands:
image

Start JournalNodes

hadoop@server1:~$ hadoop-daemon.sh start journalnode
hadoop@server2:~$ hadoop-daemon.sh start journalnode
hadoop@server3:~$ hadoop-daemon.sh start journalnode

Format Namenodes

While formating the namenodes, I will use the -clusterID option to provide a name for the hadoop cluster we are creating. This will enable us to provide the same clusterID for all the namenodes of my cluster.
Formating the namenodes of sales nameservice:We have to run the format command (hdfs namenode -format) on one of NameNodes of sales nameservice. Our sales nameservice will be on server1 and server3. I am running the format command in server1.
hadoop@server1:~$ hdfs namenode -format -clusterID myCluster
One of the namenodes (in server1) of sales nameservice has been formated, so we should now copy over the contents of the NameNode metadata directories to the other, unformatted NameNode server3.
Start the namenode in server1 and run the hdfs namenode –bootstrapStandby command  in server3.
hadoop@server1:~$ hadoop-daemon.sh start namenode
hadoop@server3:~$ hdfs namenode –bootstrapStandby

Start the namenode in server3:
hadoop@server3:~$ hadoop-daemon.sh start namenode

Formating the namenodes of analytics nameservice:We have to run the format command (hdfs namenode -format) on one of NameNodes of analytics nameservice. Our analytics nameservice will be on server2 and server4. I am running the format command in server2.
hadoop@server2:~$ hdfs namenode -format -clusterID myCluster
One of the namenodes (in server2) of analytics nameservice has been formated, so we should now copy over the contents of the NameNode metadata directories to the other, unformatted NameNode server4.
Start the namenode in server2 and run the hdfs namenode –bootstrapStandby command  in server4.
hadoop@server2:~$ hadoop-daemon.sh start namenode
hadoop@server4:~$ hdfs namenode –bootstrapStandby
Start the namenode in server4:
hadoop@server4:~$ hadoop-daemon.sh start namenode

Start remaining services


Start the ZKFailoverController process (zkfs, it is a ZooKeeper client which also monitors and manages the state of the NameNode.) in all the VMs where the namenodes are running.

hadoop@server1:~$ hadoop-daemon.sh start zkfc
hadoop@server2:~$ hadoop-daemon.sh start zkfc
hadoop@server3:~$ hadoop-daemon.sh start zkfc
hadoop@server4:~$ hadoop-daemon.sh start zkfc


Start DataNodes
hadoop@server1:~$ hadoop-daemon.sh start datanode
hadoop@server2:~$ hadoop-daemon.sh start datanode
hadoop@server3:~$ hadoop-daemon.sh start datanode
hadoop@server4:~$ hadoop-daemon.sh start datanode



Checking our cluster

sales namespace:
Lets check the namenodes of sales namespace
Open the URLs in web-browser http://server1:50070 and http://server3:50070
We can see that the namenode in server1 is active right now.
image
Namenode in server3 is standby
image


analytics namespace:
Open the URLs in web-browser http://server2:50070 and http://server4:50070
server2 is active nowimage

server4 is standbyimage

Now I will copy two files into our two namespace folders /sales and /analytics and will check if they are in correct namenode:
image
Checking in server1 (active namenode for /sales nameservice), we can see that namenode running on server1 has files only related to sales namespace.
image

Similarly, server2 (active namenode for /analytics nameservice), we can see that namenode running on server2 has files only related to analytics namespace.
image

If we try to read from a standby namenode, we will get error. In the below screenshot I tried to read from server3 (standby namenode for /sales nameservice) and got error.

image

Lets put few more files and check if the automatic failover is working:
image
I am killing the active namenode of /sales namespace running on server1
image
If we check the cluster health, we can see that serve1 is down.
image
Now if we check the standby namenode of sales namespace running on server3, we can see that it has become active now:
image
Lets check the files, we can see the files are also available, Smile HA is working and so automatic failover.
image
I am starting the namenode in server1 again.
image

Now if we check the status of the namenode in server1, we can see that it has become standby as expected.
image


One Final Note:
Once initial configurations are done, you can start the cluster in the following order:
First start the Zookeeper services:
hadoop@server2:~$ zkServer.sh start
hadoop@server3:~$ zkServer.sh start
hadoop@server4:~$ zkServer.sh start
After that start journalnodes
hadoop@server1:~$ hadoop-daemon.sh start journalnode
hadoop@server2:~$ hadoop-daemon.sh start journalnode
hadoop@server3:~$ hadoop-daemon.sh start journalnode
Finally all the namenodes, datanodes and ZKFailoverController processes using the start-dfs.sh script.
image