By Manish Kumar – Big Data Architect at DataMetica
Kerberos has evolved over the period of time to become the de facto standard for strong authentication of large as well as relatively small Hadoop systems. Hadoop implementers have chosen Kerberos over SSL owing to some of its amazing features like authentication without transferring the password over the network, use time-sensitive tickets that are generated using the symmetric key cryptography etc.. However, in the practical world when you are dealing with hundreds of nodes just enabling Kerberos is not enough. You need to avoid some of the common pitfalls that make Authentication in Hadoop more vulnerable.
This blog post is about avoiding some of the common pitfalls and also intended to enable users analysing Kerberos authentication as per their enterprise systems and processes and not to leave it just enabling it.
Pitfall 1: Relying completely on Automated Kerberos Wizards for enabling Kerberos
Automated Kerberos wizards provided by Hortonworks and Cloudera’s of the world are pretty good in enabling Kerberos authentication across your cluster nodes, but not enough. One has to take few additional steps as well. One of such important step is enabling Kerberos for components that are not directly managed by Ambari or Cloudera Manager. Some of the prominent examples can be Hue, SOLR on HDFS, HTTPFS service etc. One will have to first understand what kind of services are being used in the production that is not managed by your vendor distribution. You need to factor that in your security upgrades of your cluster because at times manually enabling Kerberos for these components can be error prone or time-consuming.
Pitfall 2: Managing Kerberos Principals for users without security policies
In any Hadoop cluster, there would several users like developers accessing Hadoop services. You need to have those users set up in Kerberos. It is imperative that these users are created with security policies around it. These policies include password expiry after some days/months, strong password rules and ticket lifetime of not more than few hours. End user accounts are more prone to attacks and therefore Kerberos ticket lifetime for such users should not be more that few hours. Most of the users do not tend to destroy the tickets after their work is complete. Hence, you should have provisions for expiring the tickets in some time.
Pitfall 3: Whether Deployed code is compliant to Kerberos
Any Hadoop cluster has codes deployed for running specific business logic. Just making Hadoop Kerberos compliant is not going to solve the purpose if deployed codes cannot run. Your deployed code need to have relevant changes for getting Kerberos ticket and authenticating themselves with Kerberos before using Hadoop components. Your security upgrade should have plans for such code development and deployment. Moreover, every ecosystem has different ways of using Kerberos. For example, Flume HDFS sink would have different ways for Kerberos authentication than Flume Kafka source or sink. In our opinion, this is very important and complex piece. Therefore, this has to be properly tested and planned for.
Pitfall 4: Generating long duration Kerberos tickets without frequent renewal for your Application accounts
Any code, for running on Hadoop cluster, uses application accounts created at Unix and Kerberos level. Most deployment first authenticates this application account with Kerberos and then uses the generated tickets for Hadoop services. By default this generated ticket has 24 hours’ validity. In reality, application codes would not run for 24 hours. That means the valid ticket is not used for a complete time period of 24 hours. This is the problem as if hackers get hold of this ticket by analysing network traffic or by hacking into the application account they can exploit it for next 24 hours. Hence it is very important that you set Kerberos ticket lifetime to the minimum required duration and later renew it for some additional time.
Pitfall 5: Having only one KDC server for your production environment
High availability is key to any production Hadoop cluster. You would want to avoid a single point of failures. Having just one KDC would introduce a single point of failure in your production environment. If your KDC is down, none of the Hadoop services would work. Therefore, it is imperative that you have Kerberos setup as Master-slave with at least ability to handle one KDC server failure.
Pitfall 6: Using default “max_retries” and “kdc_timeout” Kerberos configuration
By default, max_retries has a value of 3 and kdc_timeout has a value of 30 seconds. Now this means if KDC is down Hadoop services would try to connect to KDC server for 3 times with a time lag of 30 seconds. That means your services would be down for approximately more than 90 seconds before switching to slave KDC. This will also cause your jobs to break. Now this is not desirable in a production environment. Ideally, you would set “max_retries” to 1 or 0 and kdc_timeout to as minimum as possible.
Pitfall 7: Backing up Kerberos Database files
Kerberos database files are encrypted Unix files containing user name and passwords. If you make it part of your standard backing up processes, then you are distributing usernames and passwords to multiple
systems and back-up systems would have multiple copies of the same. In a nutshell, by backing up you are creating more access points for hackers.
Pitfall 8: No strategy for secure clusters’ data migration
Data migration between clusters is a very common scenario. You need to have strategies for moving data from secure to non-secure cluster or secure to secure cluster. This may involve tweaking Kerberos rules or even can have a performance impact.
Pitfall 9: No strategy for identifying potential authentication breaches using Kerberos logs
You need to have a way to integrate Kerberos logs with your centralised security and monitoring tools. Kerberos logs have a useful information about failed login attempts and tickets reuse.
Pitfall 10: Not provisioning all users of your cluster on all your cluster nodes
One of the requirements of Kerberos on Hadoop cluster is that all users of a cluster must be provisioned on all servers in the cluster. Hadoop lets you submit and execute arbitrary code across a cluster of machines. Every individual task on the cluster use the username and UID of the user. Hadoop looks for user details on the nodes where those tasks are running. It assumes that as administrators you don’t trust your users, you need to restrict their access to any and all services running on those servers, including standard Linux services such as the local filesystem. Users can either exist in the local /etc/passwd password file or, more commonly, can be provisioned by having the servers access a network-based directory service, such as OpenLDAP or Active Directory.
Pitfall 11: Enabling Remote SSH Access of end users on all cluster nodes
While it is necessary for all the users of the cluster to be provisioned on all of the servers in the cluster, it is not necessary to enable local or remote shell access to all of those users. A best practice is to provision the users with a default shell of /sbin/nologin and to disable SSH access using the AllowUsers, DenyUsers, AllowGroups, and DenyGroups settings in the /etc/ssh/sshd_config file.
Pitfall 12: Not using Hadoop Group Mapping Cache wisely
Hadoop Cache user group mappings and only call the group mapping implementation when entries in the cache expire. By default, the group cache is configured to expire every 300 seconds (5 minutes). This should be set small enough for updates to be reflected quickly, but not so small as to require unnecessary calls to your LDAP server or another group provider.
Remember by doing group lookup from cache improves performance.
Pitfall 13: Not judicially deciding maximum renewal lifetime of Kerberos Tickets for 24 X 7 running jobs
Hadoop streaming jobs in Kerberos secured cluster are running 24 X 7. Every Kerberos has a lifetime associated with it. Streaming jobs at start obtain the Kerberos ticket and when it expires they renew that ticket. The renewal happens till the streaming jobs are restarted. The problem with this kind of set up is that every Kerberos ticket can only be renewed for only certain amount of time period. The default is 7 days’ period, After that renew would not work and streaming jobs would fail. Once you again start those jobs, the cycle starts again. So while planning security upgrades you need to have a strategy for streaming jobs. Some common techniques are increasing the renew period of tickets to 1 year or so. This is not very clean approach, though.
By avoiding above pitfalls, you can achieve smoother Kerberos security implementation while improving your cluster’s availability, performance, security and also maintainability.