New Arrival

We welcomed our newest family member in late July – JAJ, already a smart and expressive little girl.  She hasn’t seen much yet but Mom and Dad are counting the days til she can enjoy the local greenery and other attractions.  So far she has gotten a quick visit to a family farm stand – sneak preview of a much bigger world!

 

 

Scrum Master Class

I have a new mission at Acquia – I am taking over as Scrum Master for my team. Bwahahahaha …

I’ve been doing Scrum ever since I started in tech full time at Axeda. Stand-ups and sprints are all old hat, but for the first time thanks to Acquia, I have received formal training in Scrum and a certification from Scrum Inc in Boston.

board
Scrum Inc.

Effectiveness
How does Scrum help? Why do it?

You use Scrum because you want to improve team velocity without increasing team resources.

I liked the fact that the Scrum Master class itself was organized as a Sprint using Scrum. This let us “live the example” and see first-hand the effectiveness of the techniques.

History
What is Scrum? What is its relationship to Lean and Agile?

Scrum is different from Lean and Agile but it derives from both. It came about as an evolution in workflow process based on techniques pioneered at Toyota. Scrum is an adaptation of manufacturing floor process to software engineering.
Lean

  • Eliminate waste
  • Understand Value Stream Analysis
  • Implement Single Piece Continuous Flow

Agile competition – rapid prototyping and fail fast mentality, permit the customer to determine what the product will be jointly with the producer

Requirements
What do you have to have and do in order to be doing Scrum authentically?

In order to do authentic Scrum, you must satisfy these requirements – these particular 3 artifacts, 5 events, and 3 roles as part of the team’s process.

3 Artifacts

  • Product Backlog – vision, priorities
  • Sprint Backlog – known work, capacity
  • Product Increment – scrum board, burndown, velocity

5 Events

  • Backlog Refinement
  • Sprint Planning
  • Daily Scrum
  • Sprint Review
  • Retrospective

3 Roles

  • Product Owner
  • Scrum Master
  • Team

Values
What do these workflow philosophies consider worthwhile?

Scrum inherits values from Agile. The five values of scrum are focus, courage, openness, commitment and respect.
Agile Manifesto

  • Individuals and interactions over processes and tools
  • Working software over comprehensive documentation
  • Customer collaboration over contract negotiation
  • Responding to change over following a plan

Happiness is important because it is a precursor to great team performance and high velocity. Team velocity will predictably go down after happiness does. Scrum practice works to improve happiness by relying on intrinsic motivators over external ones – purpose, mastery, and autonomy over money, power and status. The workplace should ensure the external rewards are present to the extent that they are no longer preoccupying thought, but the flow of the team should emphasize those internal motivators over the external ones.

Best Practices
What are the best practices included in Scrum that help the teams that adopt them?

There were numerous best practices discussed during the class.  The first one that I’ll be introducing to my team is the observance of strict scrum rules during our standup.

During our standup, we will kick it off with a short inspirational music clip and physically stand up to report our answers to “What did you do yesterday”, “What will you do today” and “What are your impediments if any.”

Since we are a remote team doing our standup over a call, we will determine the order of statuses according to the order that the team members join the call. Since not everyone joins right at 10:30, the standup will start at 10:35, and the first five minutes will be used for team sync and/or a single ticket triage. At 10:35 the alarm goes off, and the team gets up for standup! Post scrums optional, anyone with a post scrum says who they need and everyone else should drop off.

One further thing I learned that I will mention is about measuring velocity. While you can’t compare velocity across teams, you can compare acceleration. Subtle but important.

Terminology
What Scrum terminology provides useful ideas?

Kaizen (evolution) and Kaikaku ( revolution).Kaikaku means the philosophy of challenging entrenched dogma, refusing to accept waste no matter how it is disguised

Forms of waste

  • Muda – work in progress but not finished
  • Mura – inconsistency
  • Muri – unreasonable demands

Types of Waste and their Mnemonic
DOWNTIME – Defects, Overproduction, Waiting, Not utilizing talent, Transportation, Inventory, Movement, Extra processing

“Don’t take my word for it” – any change we make should have a follow up to ensure that the data proves the change is helping team velocity

JJ Sutherland, Alex Sheive, Sara Jarjoura
JJ Sutherland, Alex Sheive, Sara Jarjoura

All in all, I had a fantastic time and learned a ton. I am stoked to bring this to the team and see how we do.

Sara Jarjoura-ScrumAlliance_CSM_Certificate
Certified Scrum Master

First On Call And It Was Fun

This week I went on call for the first time at Acquia and it was actually a lot of fun. As a Cloud Engineer, I’m on the escalation path for issues with our Amazon services – an example would be something goes wrong with an Amazon instance, I put in the ticket to Amazon customer support and handle its resolution.

What’s fantastic is that we have 24 hour support, but I am only on call during my work hours. Our support shifts and on call rotation “follow the sun.” I have colleagues on my team who work in Europe and Australia. In the morning (from my perspective) my European colleague passes the shift to me, and in the evening I pass the shift to my Australian colleague. Neat!

One of my buddies on the team is taking a vacation that happened to have an on call week right in the middle. I am enjoying it so much that I volunteered to take his shift. Looking forward to getting more exposure to the “hot seat” – so far so good!

Resiliency and Game Day Exercises at Acquia

In March of 2017 I came across the idea of “Game Day” in the DevOps Handbook by Gene Kim and others. Game Day is brilliantly advocated by Jesse Robbins in his presentation from 2011. It’s the idea that deliberately staging periodic system outages forces engineers to think about and design for resiliency in those systems. The extreme programming example is Chaos Monkey, which operates under only the one constraint that the outages should happen during working hours. Other than that, the outages caused by Chaos Monkey can happen anywhere in the system (even production!) and at any time.

Game Day is a step removed from Chaos Monkey, conceived of as a planned activity for engineers to resolve systemic outages. The resiliency exercises held at Acquia were yet another step away from the extreme towards the approachable. Our exercise included two activities, one geared for non-support engineers and the other for support. The non-support engineers had to bring back up a down site, and the support engineers had to attack and compromise an insecure site. The idea was to challenge engineers to step outside their comfort zone, and attempt to resolve technical challenges beyond the requirements of their every-day work.

The Team
The personalities involved in Game Day were a strong influence on the event. There’s Amin Astaneh, an Ops manager with the temperament of the proverbial town crier, faithfully and urgently supporting us in our DevOps transformation. Then there’s Apollo Clark, expert in secure systems who contributed the idea of doing a security vulnerability exercise. Finally there’s James Goin, seasoned Ops warrior relentlessly invested in the improvement of systems administration, including resiliency and disaster recovery training.

Constraints
It just so happened that the idea for Game Day came two months in advance of Acquia’s annual engineering-wide event called Build Week, a truly awesome gathering of the entire team at Acquia HQ in Boston (read more on Dries’ blog!). Holding our Game Day at the same time would allow it to reach a broader audience across the company, so we requested a slot on the calendar. We ended up with 8-9pm on the Tuesday during Build Week. We had our opportunity!

Build Week imposed two constraints that had a significant and positive influence on our interpretation of Game Day. The whole event needed to fit in a single hour, and the event had to be accessible to engineers other than just the Ops subject matter experts. A Game Day exercise typically involves only the core engineering team which works directly with critical systems, and it takes however long they need to bring the systems back up. These constraints made the whole thing more approachable, and inspired the introduction of an Easy Mode and a Hard Mode.

Game Day as Exercise
The original idea was to have a trouble-shooting session with an Acquia development installation of a Drupal site (managed Enterprise-grade Drupal being the chief product of Acquia). The site would have some failure that either smaller teams or the whole group would have to resolve. Since we needed to accommodate varying levels and areas of expertise in the product, we settled on two “modes”, Easy Mode and Hard Mode, that participants would opt into based on their familiarity with troubleshooting techniques. The difference between the modes would only be in the level of difficulty. Easy Mode would be for those who don’t handle troubleshooting support calls as part of their regular day-job, Hard Mode for those who do.

The Identity Crisis
At this point, it hit home for me that the exercise was not going to be what I had originally intended – it wasn’t going to be a cookie-cutter Game Day. Although this seemed disappointing at the time, looking back it was a blessing in disguise, since it motivated us to create a new idea instead of copying someone else’s.

Apollo’s suggestion which we ended up following was to stage a Hard Mode Capture the Flag exercise instead of a site outage. Capture the Flag in a security context is an exercise where teams gain access to privileged resources in a system by leveraging security vulnerabilities. We could hide hashes – randomized strings of a fixed length – throughout the site. The winner of the competition would be the team that found all the hashes first.

The exercise would demonstrate that a site that works from a user perspective can still need work to become secure and performant. We would have Easy Mode to include some troubleshooting, which would then flow directly into the Capture the Flag exercise.

Trying It Out
We ran through the whole event a few weeks before Build Week. Easy Mode troubleshooting took up the first half hour, transitioning to Hard Mode Capture the Flag for the second half hour. This was pure thought experiment at this stage, and shockingly for me, it worked really really well.

During Easy Mode, non-Ops engineers drove the resolution with Ops experts only acting as consultants. Once the site was back up, we switched over to Capture the Flag. For this run through we only had one shared site for all the Hard Mode participants. One mischievous participant who found the site credentials deliberately locked out everyone else. This incident motivated much of the end-game setup for prevention of cross-site hacking.

Game Day!
Our Game Day-inspired exercise followed the flow established in our run through, with the addition of the isolated environments for Capture the Flag.

The Easy Mode troubleshooting took less time than we had allowed for, putting the start of Hard Mode right on time. The teams dove in, probing their environment – a Drupal site – for weaknesses. The narrative revolved around a fictional user submitting a question to the forum about how to enable the PHP module in Drupal, which would allow access to the bash shell on the server. The fictional admin replied that she had enabled the module for him, and reset his login to a “temporary password”. These were the credentials the participants were expected to use to hack the site. Since the user had access to the PHP module, they could also use it to gain shell access. Using this shell access to the server, they had easy access to the privileged resources and opportunities to discover the hashes.

When time ran out at 8:55, three of our twelve teams and forty participants had found all five of their hashes. The first team with all five hashes won the grand prize, an invitation for morning coffee with our resident tech celebrity, Drupal founder and Acquia CTO Dries Buytaert. As an aside, when I thanked Dries for agreeing to have coffee with our winners, he graciously replied, “No, thank you – now I get to have coffee!”

Epilogue
The decision to pivot from the established Game Day resulted in a new kind of learning in the spirit of Game Day. This learning was more accessible for our engineers and bridged the gap between where we are and where we are headed. While this isn’t the end of the story, I think it’s a fantastic start. Game Day, Day 2, here we come …

In All Sincerity, Wow Aristotle

I’m reading a book History of Philosophy and Aristotle is blowing me away so much that I have to share here.

Have you spent your whole life hearing about the “soul” as if it were basically Casper the Friendly Ghost overlaid onto your body?
For Aristotle, the life of an entity consists of its nourishment, growth and self-consumption. Thus the soul is the form or realization of a living body. The soul “informs,” or gives form to, the matter of a living thing, giving it its corporal being and making it a live body; that is, it is not a question of the soul’s being superimposed on the body or added to it; rather, the body is a living body because it has a soul. According to Aristotle’s definition (De Anima, II, 1), the soul is the realization or first entelechy of a natural organic body. If the eye were a living creature, Aristotle says, its soul would be its sight. The eye is the matter of sight, and if sight is lacking there is no eye; and just as the eye, strictly speaking, is the physical eye united with the power of sight, so the soul and the body make up the living thing.

Marias, Julian (2012-10-02). History of Philosophy (pp. 78-79). Dover Publications. Kindle Edition.
That’s awesome. My reading of this is that Aristotle’s sense of soul was closer to our metaphorical use of the word –
as in the phrase, “the soul of poetry.” For him the soul is that thing which is capable of arriving at his definition of perfection, a state of being so utterly oneself that it could be considered an archetype.

Here’s this guy surrounded by people who sacrifice animals to ensure a good harvest, and he goes and creates a system of reason that is so non-superstitious that it forms the basis of philosophy and logic to this day. It takes a mind that is not only precise, but incredibly alert and discriminating as to his own thought to be able to shut out the contexts around him.  What a context too – the mythology of the Greeks, a highly influential artifact of culture on its own merit.  

Instead of allowing his unique mind to become submerged in these beliefs, he rejects superstition and the ephemeral contexts around him.  
This philosophy, while influenced by his predecessors and contemporaries, is the product of a painstaking selectivity about which influences to permit.

Aristotle is an artisan of the mind, whittling away excess thought, as if he were Michelangelo chipping away the unneeded plaster.

The resulting cognitive construct is pure and uncluttered, universal and timeless.  What a superlative gift to offer to the rest of us humans …

Next Actions – Stand Against the Trump Regime

Now that Trump has declared his intention to act against the people of the United States, I reached out to a friend of mine who gave me a list of next actions she and her wife have already taken for fighting the regime.
– Attend the protests & marches
– Sign the petitions on http://front.moveon.org
– Join your local Indivisible chapter. These groups are pretty new and disorganized, so not a lot of unified action yet, but these are providing opportunities to fight the travel ban
– Become a monthly donor to the ACLU, Planned Parenthood, CAIR, MIRA, and others.
– Subscribe to protesting journalists and periodicals – NYT, Boston Globe, Christian Science Monitor
– Find your Congressional representatives https://www.senate.gov/general/contact_information/senators_cfm.cfm and send them daily calls/emails voicing your concern.
For Massachusetts these are:
Ask friends & family in other states to call their Republican representatives
– Commit to the 10 actions / 100 days campaign from the women’s march: https://www.womensmarch.com/100/.
From Chelsea:  My wife & I had a few friends over on Saturday and we made postcards together.
Clearly we are not artists 🙂 But it was a great way to bond while taking action.

​- Sign up for SwingLeft
– Donate to a local mosque and write them a letter expressing support.  Send care packages and letters to Muslim friends, neighbors, students.
– Get involved in our community and get more involved in local politics.   Have you ever thought about running for local office?
– For the spiritual minded, attend church, for support and as a way to become more involved with like-minded local people.
– Speak out on social media and stay tuned for news about protests and other events.
With all the liberties I have enjoyed throughout my lifetime, this is a chance to pay them forward for the next generation.

 

DevOps Insights from REDtalks 14

I recently had the good fortune to encounter Tom McGonagle, SE with F5, via the Boston DevOps chatroom, moderated by Dave Fredricks.  I had been invited to post in Dave’s newly inspired mentor/mentee topic channel, which I welcomed as I had been looking for guidance around a side project of mine.  Tom contacted me through chat, and before the morning was out, we were enjoying a crisp pair of pizzas, the artful pies you can only get in downtown.

We exchanged impressions on working in the tech industry, on the big-hearted, quirky and iconic culture that makes being an engineer among engineers so incredibly rewarding.  We concluded with an invite from Tom to one of the meetups he co-organizes, Hackernest in Artisan’s Asylum, so I marked my calendar and went on my way.

Before the week was out, Tom sent me a link to REDtalks #14: Tom & David on the Principles & Practices of DevOps with host Nathan Pearce, featuring Tom along with fellow DevOps specialist and Bentley U alumnus David Yates.

When I sat down to listen, I expected an informative piece with some new-to-me tidbits here and there.

This podcast captivates me.  Rather than listening passively from one end to the other, I found myself skipping back and forth to make sure I was getting exactly what is being said.  For Tom specifically as the one who reached out to me, congratulations – this is fantastic.

Here are my (extensive!) notes from this most excellent podcast.

Yates – 6:10DevOps Handbook by Gene Kim and the three ways

  1. continuous delivery – testing and QA as a first class object, how do you pull that left in the pipeline and do it early, often, iteratively and incrementally
  2. continuous intelligence – how do you pull it all into a central location and make sense of what was happening in your application and infrastructure
  3. continuous learning – “fail early and fail often”, don’t be afraid to take risks, you can only learn by practicing and getting better, experimentation as culture, that includes getting the components of the infrastructure to harmonize with each other

Yates – 11:30 – teams uniting around a common mission

  • Quarter over quarter, having a common goal as to how the team can get better.  One of those goals can be customer education.

OKRs – Google’s term, objectives and key results

McGonagle – 12:31 – CAMS

  • CAMS are culture, automation, monitoring, and sharing.  Sharing is critical as a devops engineer, devops consultant, or Devos SME at F5, there is a fiduciary responsibility to share these idea viruses.  One of the idea viruses that I’m hot on right now is the idea of agile networking, it’s my language around the application of agile and devops principles to the field of network engineering … it’s part and parcel of being part of the devops community, you have to share.  As part of my sharing, David and I organize the Boston area Jenkins Meetup group – largest area Jenkins meetup group in the world.  It’s part of getting out into the community and getting people aware and interested in DevOps.

McGonagle – 14:00 – 9 Practices of DevOps

Practice 1: 14:15 – Configuration Management – you can templatize your configurations and drive your autonomic infrastructures that self-build, self-configure and self-automate

  • Question from Yates on Practice #1: 16:20 – What are the best practices around Configuration management?
  • Answer about best practices from McGonagle at 16:40 –  use facts to drive your configuration, intelligence gathering about the server, self-identifying and self-configuring

Yates – 21:00 – the big motivators for devops is that it’s the marriage of modern management and IT best practices, positive feedback between business requirements and IT delivery

Yates – 21:31 – business reasons that gives DevOps legs

Yates – 21:45 – DevOps from all points of view, IT best practices

Practice 2: 22:59 – Continuous integration – a robot such as Jenkins that takes your code from a source code management repository and builds it and tests it in a continuous way, every time a developer commits code the robot tests it against the functional and unit tests, it enables the developers to have awareness of the quality of the code

  • McGonagle – 25:40 – Linting – check the code for the appropriate format, which eliminates an enormous amount of errors, a test that can be orchestrated through a tool like Jenkins

Practice 3: 26:40 Automated testing – TDD, test driven development, build the test into your CI infrastructure, “write the unit test before the code”

  • Yates – 27:53 – TDD is one of the core principles of the XP Agile framework, make sure you know it works before you roll it out, especially for security

Practice 4: 29:15 – Infrastructure as Code – software project for your infrastructure with all the benefits applied to infrastructure, infrastructure is programmable and extensible, saves time and validates the process

  • Yates – 34:14 – canary release – don’t put out a new release everywhere at once, put it out in an isolated deployment so it can be rolled back quickly, if it succeeds then roll it out more widely

Practice 5: 35:40 – Continuous delivery – the way the code is rolled out, there’s a button that’s pushed to release – do you push a button to release?

Practice 6: 35:40 – Continuous deployment – the code contantly goes to production – do you create a button to release?

Practice 7: 18:16 – Continuous monitoring – metrics driven devops, APM – application performance monitoring, instrumenting your code to expose various qualities about your code and infrastructure to a metrics gathering tool

  • McGonagle – 39:27 – ACAMS+ -> add in Agile to culture, automation, monitoring and sharing and what is important to you

Practice 8: 40:30 – Develop an engaged and inclusive culture to encourage collaboration and shared ownership

  • Tom’s Amish barn raising post , culture in which all teams are working toward the same goal
  • Yates – 41:44 – students run three sprints using scrum, the most important thing you can do is own the product you’re going to deliver, having empathy for teammates, easier to say than do

Practice 9: 43:47 – Actively participate in communities of practice to become a lifelong learner of technology development (don’t be a jerk!) – going to conferences, being a speaker, a good participant, a nice person, a listener, the benefit is the learning opportunities it creates

My final takeaway is I am humbled by the privilege of being able to work in an industry distinguished by a culture of enthusiasm, passion and ownership.

While no profession can be exempt from drudgery, the devops culture of cheerful collaboration has, by virtue of its effectiveness, become an accepted prerequisite for deploying a successful product.  As a result, the typical corporate cynicism is mitigated and even replaced by an expressive and generous optimism.  Innovative and disruptive indeed.

Darkness, Redemption and the President

I was devastated on Wednesday by the election turnout. When I came home from work I curled up in bed with the lights out. I yelled at Jeff because he wasn’t angry enough. It was painful to me that he didn’t seem to be hurting like I was.

For hours that night I let the darkness shrivel me up and push him away.  I refused to give him our customary good night kiss. That hurt him, and it made me feel good that I hurt him.

At some point that night, in the dark, I realized this was not a path I wanted to go down.
[pullquote class=”left”]Terrorists win if we are terrified to live our lives. Hatred wins if we hate the people who share our lives.[/pullquote]I love many people who disagree with me. A point of pride, one I brag about, is that Jeff and I can love each other while disagreeing on most things. But what can unite us when we disagree on something so fundamental? If I question his conscience, are we even compatible anymore?

Terrorists win if we are terrified to live our lives. Hatred wins if we hate the people who share our lives.  I can’t love only part of Jeff, or cherry-pick what parts of him I think are ok to love.  And I can’t do that to them either.

I’m not speaking hypothetically, or generically. These are actual people, family, friends, who depend on me and who love me. How can I let them down by blaming them for a situation I already refused to own? I let Trump get elected, this is ultimately on me. Am I speaking figuratively or collectively? Probably not as much as I’d like to think.


TO PROTEST

Let’s talk protests. Jeff offered to go with me. I ultimately decided against protesting … for now.

[pullquote class=”left”]By protesting an outcome we recognize as fair, we are weakening the impact of protests to come.[/pullquote]Why not protest? What is there to protest right now? If Hillary Clinton had been elected, there would be no basis for protesting, so why is there one now? I want to protest injustice in the system, not outcomes. We should not protest the rules because our team lost by them, and no one should protest the mere fact of Trump being president.  By protesting an outcome we recognize as fair, we are weakening the impact of protests to come.

Ultimately, Jeff and I care about the same things. We disagree on how to get there, but we are fundamentally united in our agreement on principles of behavior, government and ethics.

Not everyone who supported Donald Trump agreed with him on principle.  Some supported him because they considered Hillary Clinton to be a worse threat to the United States, or because they considered Trump’s economic policies to be beneficial (whatever those might be).  For those people, the above statement applies, because the underlying principles of their decision were aligned with my own.

For the others who did agree with Trump’s principles, as far as I am concerned, it comes back to the Christian principle of love thy neighbor as thyself.  This is purely because my personal spirituality includes having faith in that principle. I’ve looked down the other path, and it’s not something I would want for myself or anyone else I care about.

In the meantime, regroup, reorganize, blog, let your voice be heard. Save your strength, because the times are coming when we will have injustices to protest, and targeted lives to defend.  Let them come.

Resolving Hadoop Problems on Kerberized CDH 5.X

I ran into a problem in which I had a Kerberized CDH cluster and couldn’t run any hadoop commands from the command line, even with a valid Kerberos ticket.

So with a valid ticket, this would fail:
hadoop fs -ls /
WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]

Here is what I learned and how I ended up resolving the problem. I have linked to Cloudera doc for the current version where possible, but some of the doc seems to be present only for older versions.

Please note that the problem comes down to a configuration issue but that Kerberos itself and Cloudera Manager were both installed correctly. Many of the problems I ran across while searching for answers came down to Kerberos or Hadoop being installed incorrectly. The problem I had occurred even though both Hadoop and Kerberos were functional, but they were not configured to work together properly.

TL;DR

MAKE SURE YOU HAVE A TICKET

Do a klist from the user you are trying to execute the hadoop command.

sudo su - myuser
klist

If you don’t have a ticket, it will print:

klist: Credentials cache file '/tmp/krb5cc_0' not found

If you try to do a hadoop command without a ticket you will get the GSS INITIATE FAILED error by design:
WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]

In other words, that is not an install problem. If this is your situation, take a look at http://www.roguelynn.com/words/explain-like-im-5-kerberos/ . For other troubleshooting of Kerberos in general, check out https://steveloughran.gitbooks.io/kerberos_and_hadoop/content/sections/errors.html


CDH Default HDFS User and Group Restrictions

A default install of Cloudera has user and group restrictions on execution of hadoop commands, including a specific ban on certain users ( more on page 57 of http://www.cloudera.com/documentation/enterprise/5-6-x/PDF/cloudera-security.pdf ).
There are several properties that deal with this:

Specifically for user hdfs, make sure you have removed hdfs from the banned.users configuration property in hdfs-site.xml configuration if you are trying to use it to execute hadoop commands.

1) Unprivileged User and Write Permissions

The Cloudera-recommended way to execute Hadoop commands is to create an unprivileged user and matching principal, instead of using the hdfs user. A gotcha is that this user also needs its own /user directory and can run into write permissions errors with the /user directory. If your unprivileged user does not have a directory in /user, it may result in the WRITE permissions denied error.

Cloudera Knowledge Article

http://community.cloudera.com/t5/CDH-Manual-Installation/How-to-resolve-quot-Permission-denied-quot-errors-in-CDH/ta-p/36141

2) Datanode Ports and Data Directory Permissions
Another related issue is that Cloudera sets dfs.datanode.data.dir to 750 on a non-kerberized cluster, but requires 700 on a kerberized cluster. With the wrong dir permissions set, the Kerberos install will fail. The ports for the datanodes must also be set to values below 1024, which are recommended as 1006 for the HTTP port and 1004 for the Datanode port.

Datanode Directory

http://www.cloudera.com/documentation/enterprise/5-6-x/topics/cdh_ig_hdfs_cluster_deploy.html

Datanode Ports

http://www.cloudera.com/documentation/archive/manager/4-x/4-7-2/Configuring-Hadoop-Security-with-Cloudera-Manager/cmchs_enable_security_s9.html

3) Service Specific Configuration Tasks

On page 60 of the CDH security doc, there are steps to kerberize Hadoop services. Make sure you did these!

MapReduce

sudo -u hdfs hadoop fs -chown mapred:hadoop
${mapred.system.dir}

HBase

sudo -u hdfs hadoop fs -chown -R hbase ${hbase.rootdir}

Hive

sudo -u hdfs hadoop fs -chown hive /user/hive

YARN

rm -rf ${yarn.nodemanager.local-dirs}/usercache/*

All of these steps EXCEPT for the YARN one can happen at any time. The step for YARN must happen after Kerberos installation because what it is doing is removing the user cache for non-kerberized YARN data. When you run mapreduce after the Kerberos install it should populate this with the Kerberized user cache data.

YARN User Cache
http://stackoverflow.com/questions/29397509/yarn-application-exited-with-exitcode-1000-not-able-to-initialize-user-directo

Kerberos Principal Issues

1) Short Name Rules Mapping
Kerberos principals are “mapped” to the OS-level services users. For example, hdfs/WHATEVER@REALM maps to the service user ‘hdfs’ in your operating system only because of a name mapping rule set in the core-site of Hadoop. Without name mapping, Hadoop wouldn’t know which user is authenticated by which principal.

If you are using a principal that should map to hdfs, make sure the principal name resolves correctly to hdfs according to these Hadoop rules.

Good
(has a name mapping rule by default)

  • hdfs@REALM
  • hdfs/_HOST@REALM

Bad
(no name mapping rule by default)

  • hdfs-TAG@REALM

The “bad” example will not work unless you add a rule to accommodate it

Name Rules Mapping
http://www.cloudera.com/documentation/archive/cdh/4-x/4-5-0/CDH4-Security-Guide/cdh4sg_topic_19.html

2) Keytab and Principal Key Version Numbers Must Match
The Key Version Number (KVNO) is the version of the key that is actively being used (as if you had a house key but then changed the lock on the door so it used a new key, the old one is no longer any good). Both the keytab and principal have a KVNO and the version number must match.

By default, when you use ktadd or xst to export the principal to a keytab, it changes the keytab version number, but does not change the KVNO of the principal. So you can end up accidentally creating a mismatch.

Use -norandkey with kadmin or kadmin.local when exporting a principal to a keytab to avoid updating the keytab number and creating a KVNO mismatch.

In general, whenever having principal issues authentication issues, make sure to check that the KVNO of the principal and keytab match:
Principal
kadmin.local -q 'getprinc myprincipalname'

Keytab
klist -kte mykeytab

Creating Principals
http://www.cloudera.com/documentation/archive/cdh/4-x/4-3-0/CDH4-Security-Guide/cdh4sg_topic_3_4.html

Security Jars and JAVA Home

1) Java Version Mismatch with JCE Jars
Hadoop needs the Java security JCE Unlimited Strength jars installed in order to use AES-256 encryption with Kerberos. Both Hadoop and Kerberos need to have access to these jars. This is easy to miss because you can think you have the security jars installed when you really don’t.

JCE Configurations to Check

  • the jars are the right version – the correct security jars are bundled with Java, but if you install them after the fact you have to make sure the version of the jars corresponds to the version of Java or you will continue to get errors.
    To troubleshoot, check the md5sum hash of the JCE jars from a brand new download of the same exact JDK that you’re using against the md5sum hash of the ones on the Kerberos server.
  • the jars are in the right location ( JAVA_HOME/jre/lib/security )
  • Hadoop is configured to look for them in the right place. Check if there is an export statement for JAVA_HOME to the correct Java install location in /etc/hadoop/conf/hadoop-env.sh

If Hadoop has JAVA_HOME set incorrectly it will fail with GSS INITIATE FAILED. If the jars are not in the right location, Kerberos won’t find them and will give an error that it doesn’t support the AES-256 encryption type (UNSUPPORTED ENCTYPE)

Cloudera with JCE Jars
http://www.cloudera.com/documentation/enterprise/5-5-x/topics/cm_sg_s2_jce_policy.html

Troubleshooting JCE Jars
https://community.cloudera.com/t5/Cloudera-Manager-Installation/Problem-with-Kerberos-amp-user-hdfs/td-p/6809

Ticket Renewal with JDK 6 and MIT Kerberos 1.8.1 and Higher

Cloudera has an issue documented at http://www.cloudera.com/documentation/archive/cdh/3-x/3u6/CDH3-Security-Guide/cdh3sg_topic_14_2.html in which tickets must be renewed before hadoop commands can be issued. This only happens with Oracle JDK 6 Update 26 or earlier and package version 1.8.1 or higher of the MIT Kerberos distribution. To check the package, do an rpm -qa | grep krb5 on CentOS/RHEL or aptitude search krb5 -F "%c %p %d %V" on Debian/Ubuntu.

The workaround given by Cloudera is to do a regular kinit as you would, then do a kinit -R to force the ticket to be renewed.
kinit -kt mykeytab myprincipal
kinit -R

And finally, the issue I actually had which I could not find documented anywhere …

Configuration Files and Ticket Caching


There are two important configuration files for Kerberos, the krb5.conf and the kdc.conf. These are configurations for the krb5kdc service and the KDC database. My problem was the krb5.conf file had a property:
default_ccache_name = KEYRING:persistent:%{uid}

This set my cache name to KEYRING:persistent and user uid ( explained https://web.mit.edu/kerberos/krb5-1.13/doc/basic/ccache_def.html ). When I did a kinit, it created the ticket in /tmp because the cache name was being set elsewhere as /tmp. Cloudera services obtain authentication with files generated at runtime in /var/run/cloudera-scm-agent/process , and these all export the cache name environment variable ( KRB5CCNAME ) before doing their kinit. That’s why Cloudera could obtain tickets but my hadoop user couldn’t.

The solution was to remove the line from krb5.conf that set default_ccache_name and allow kinit to store credentials in /tmp , which is the MIT Kerberos default value DEFCCNAME ( documented at https://web.mit.edu/kerberos/krb5-1.13/doc/mitK5defaults.html#paths )

Liked this post and want to hear more? Follow me at https://twitter.com/saranicole and connect at https://www.linkedin.com/in/sarastreeter

Cloudera and Kerberos installation guides

Step-by-Step
https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_sg_intro_kerb.html
Advanced troubleshooting
http://www.cloudera.com/documentation/enterprise/5-6-x/PDF/cloudera-security.pdf , starting on page 48