What’s the impact of naming?
Before we dig into the naming problem itself, I’d like to stop for a bit and think about the impact of naming. We know that identifiers take around 70% of source code, but do we know how much time can our team save by having good identifier names?
Secondly, can we use the information about time savings to make better decisions while programming or reviewing someone else’s code? I’d like to have a strategy for making a quick decision about names – without depending on my faulty intuition.
Uncle Bob wrote that “the ratio of time spent reading versus writing is well over 10 to 1”4. This is probably true for some codebases. But let’s not stop there. Let’s try to find a more accurate number based on real-life data. Data can be found in this Microsoft Research paper: it is a summary of responses from several hundred professional developers working at Microsoft.
Researchers created a developer activity catalogue:
- understanding the code,
- unit testing,
- code maintenance overhead (building, compiling),
- other code related activities,
- non code activities.
It turned out that “no single activity accounts for most of developer’s time”. The median times spent on each activity were very close. The research showed that those numbers are also a subject of vast variability depending on teams and lifecycles. Based on median values from the paper, the “understanding the code” activity accounts for about 10% of average developer’s time. The same can be said about both writing and communicating activities, which are also related to identifier naming5.
Initially, I only wanted to know how much time we spend on reading the code. However there was another side of this research that grabbed my attention. Developers were asked to name the most difficult problems they face in their work. The highest scoring problem was related to the understanding the code activity. For 66% of respondents it was serious.
Most difficult problems
|This is a serious problem for me||% agree|
|Understanding the rationale behind a piece of code||66%|
|Understanding code that someone else wrote||56%|
|Understanding the history of a piece of code||51%|
|Understanding code that I wrote a while ago||17%|
Most serious problems for Microsoft developers5
Another piece of research went into even more detail. Researchers analysed the impact of different attributes of the names, on the programmer’s ability to process them 6. They concluded that “longer names take an average of 20.1 s longer to process.”
Based on the research and experiments, we can conclude that:
- We spend substantial amount of time trying to understand the code.
- Understanding the rationale behind a piece of code is the most difficult problem we face.
- Longer names take an average of 20 seconds to process. This means that if you only have long descriptive identifiers, 70% of your code takes very long to process.
Cargo cult of naming advices
Unfortunately, I think I don’t really understand the naming problem. Instead, I have always followed “good naming practices” almost blindly. The majority of “naming convention”-related advice in professional literature is very superficial. They state that I should or should not do something, and then “prove” that advice with some anecdotes.
In the “Clean Code” book4, the chapter about names includes the following advice:
- Use intention-revealing names.
- Avoid disinformation.
- Make meaningful distinctions.
- Use pronounceable names.
- Use searchable names.
- Avoid encodings.
- Avoid mental mapping.
- Don’t be cute.
- Pick one word per concept.
- Don’t pun.
- Use solution domain names.
- Use problem domain names.
- Add meaningful context.
- Don’t add gratuitous context.
Where does it come from?
Each of the above is explained using an example and some individual reasoning. The author uses phrases like “I prefer…“, “I don’t want” and “I choose”. Don’t get me wrong, this is a great book. However I can’t help feeling that those statements are based only on intuition and experience. There are no references to any research or experiments. That makes it susceptible to cargo cult, because programmers tend to follow this advice without a deep understanding of the purpose. If you have ever heard an argument like “because Uncle Bob says so”, then you definitely feel my pain.
But my feelings aside, let’s list the problems with this kind of “best practices” approach:
- Some of the advice seem to be addressed to only one programmer. It overlooks more general causes of bad naming, where more than one person changes a piece of code over time.
- They usually focus on syntactical aspects, e.g.
lowercase package names,
CamelCased class names.
- When they advise on semantical aspects of the names, they are very vague by just stating “names should be meaningful, descriptive and self-documenting”.
Clearly, only requiring “meaningful”, “descriptive”, or “self-documenting” names is insufficient. First, a name not only needs to be meaningful but reflect the correct meaning. Second, the “correct” meaning and name of a concept is naturally highly debatable.
– Deißenböck & Pizka 3
In this post I am going to give you some experimental research data and formal definitions of good names. This way, I hope you can become a better programmer by moving away from folklore of naming practices – and base your naming skills on science.
Dealing with the naming problem
We now see that naming problem has a big impact on code comprehension. The naming activity may look straightforward, but we all know it isn’t!
In this section I want to go through the main reasons explaining why is that a case.
The hardest thing about choosing good names is that it requires good descriptive skills and a shared cultural background.
“Clean Code” 4
Formal definition of a name
In the great paper about consistent and concise naming3, the authors introduced two spaces, denoted
Cis a concept space – it includes all concepts relevant within a certain scope (project, company, team),
Nis a name space – it includes all possible names that programmers use in the source code.
During the lifetime of an application, programmers build a formal relation,
R between those two spaces – they assign names from
N to concepts from
Having this definition, we can now try to find out what it means for the naming relation
R to be consistent. First, let’s look at two things that make the naming inconsistent.
Homonyms and synonyms
Homonyms are words with more than one meaning. In terms of source code, that means that programmers use one identifier (e.g.
book) to name more than one concept (e.g. a book and booking activity).
Synonyms are different words with the same meaning. A typical example is using two different identifiers, like
number, to name one concept – account number.
The mixture of synonyms and homonyms, which is commonly found in source codes, maximizes confusion and aggravates comprehension efforts enormously.
– Deißenböck & Pizka 3
Rule #1: Choose consistent names
A naming relation
R is consistent if and only if the mapping is bijective, i.e. each identifier name from
N is paired with only one concept from
C and each concept from
C is paired with only one identifier name from
Rule #2: Choose correct names
In order to define correctness, we need to introduce one more thing about concept space
C: partial ordering
⊑ which orders concepts according to their level of abstraction. If concept set
C contains both
transformation concepts, then
permutation ⊑ transformation, i.e.
transformation is a generalization of
Hence, the naming relation
R is correct if all identifier names from
N correspond to the concept they are manifesting or a generalization of this concept. So, if you have a function that calculates and returns a
permutation and you named it
transformation, it’s a correct name. However, the name is not concise enough.
Rule #3: Choose concise names
The naming relation
R is concise if all identifier names from
N have exactly the same name as the concept they stand for. That means that if your function calculates
permutation, then you should use it as a name.
Sounds easy, right? But, what if your concept space
C had two permutation-like concepts:
Unfortunately, then naming any function
permutation would not be concise (it would be correct, though). Why? Because it is a generalization of at least two concepts from your concept space
The key to keep comprehensibility and detailing of identifiers in balance is to control the content of the concept space
– Deißenböck & Pizka 3
Advice from “Clean Code” revisited
Let’s go back to the “Clean code” book advice that we tend to follow and talk about. Having the aforementioned formal definition and 3 rules of good names we can convert the advice from the book to those 3 rules!
|Naming advice from “Clean code”4||Rule from “Concise and Consistent Naming”3|
|use intention-revealing names||choose correct names|
|avoid disinformation||choose correct names|
|make meaningful distinctions||choose consistent names|
|use pronounceable names||choose correct names|
|use searchable names||choose correct names|
|avoid encodings||choose consistent names|
|avoid mental mapping||choose consistent names|
|don’t be cute||choose consistent names|
|pick one word per concept||choose consistent names|
|don’t pun||choose consistent names|
|use solution domain names||choose concise names|
|use problem domain names||choose concise names|
|add meaningful context||choose concise names|
|don’t add gratuitous context||choose concise names|
Controlling the concept space
The immediate conclusion from the formal definition of consistent and concise naming is that we need to be able to prove that any given concept is represented with only one very specific name. We can make sure that this is the case by closely controlling the concept space. Unfortunately, in the vast majority of our projects it’s practically impossible! Whenever I am adding a new identifier or changing the name of an existing one, I am using my best judgement and intuition, and focusing on best syntactic practices. I am unable to implement any efficient semantic naming practices. I am unable to follow the 3 formal rules. And the root cause is that I don’t have direct access to the set of concepts in the scope of my application.
This gets even harder when we collaborate with other people. The mapping between names in the code and concepts in the application scope is rarely documented. To sum up: in many projects we expect people to implicitly agree on all terms in the code and remember those agreements during the project’s lifetime… That’s why naming is so hard!
As you see, the problem is even bigger than expected. A set of informal advice (like those in “Clean Code”) is helpful just because we don’t have anything better.
Identifier quality and code comprehension
Now we know that in order to create names, we need to precisely name the concepts in the finite, well-defined set. This is just one side of the problem, though. Even if we had concise, correct & consistent names, that could still cause problems in code comprehension!
There are multiple research papers on identifier quality. Lawrie and others observed that longer names can make the other programmer slower (and even 20s slower for reading one identifier alone! 6). They attributed this fact to overloading a programmer’s short-term memory. They also tested and confirmed the hypothesis that using names that include ties to programmer persistent memory can vastly improve code comprehension (programmers can remember them easily and they stay longer in your memory). For example:
toString has strong ties to Java programmers’ memory, while
Maximal comprehension occurs when the pressure to create longer, more expressive names is balanced against limited programmer short-term memory resources.
– Binkley, Lawrie, Maex, Morrell 6
Rule #4: Choose shorter names
Whenever you are naming a concept, you need to choose a name that will minimise comprehension time. Choose as fewer syllables as possible and use words that have ties to programmer’s (or your coworker’s) memory.
But doesn’t it contradict the previous rules? How can we choose names that are both concise and very short? Again, the power lies in controlling the whole concept space. In order to have short & concise names, we need to make sure that our concept space is as small as possible. And this alone is a hard work, because every time someone wants to introduce a new concept, we need to be able to rethink the whole set again!
Maybe we should use abbreviations?
Takang and others run an experiment to test the following hypothesis: “programs that contain ‘full’ identifier names are more understandable than those with abbreviated identifier names”7. However, quantitative data collected doesn’t support this hypothesis. Binkley and others shed some light on this discovery by presenting the following two code snippets6. Which one do you think takes longer to process?
distance_between_abscissae = first_abscissa-second_abscissa distance_between_ordinates = first_ordinate-second_ordinate cartesian_distance = square_root( distance_between_abscissae * distance_between_abscissae + distance_between_ordinates * distance_between_ordinates)
dx = x1 - x2 dy = y1 - y2 dist = sqrt(dx * dx + dy * dy)
In a different paper Lawrie and others conclude that “abbreviations are just as useful as the full-words”8. However, please remember that we need to use only generally known abbreviations, like
dist for distance,
len for length and
char for character. When you use
char as an abbreviation for
characteristic, the above conclusion doesn’t hold.
Rename refactoring revisited
Renaming is even harder than naming: it’s just naming combined with convincing people to unlearn something… Based on a survey9, only 8% of developers think that renaming is straightforward. Even though we have IDE support in rename refactorings, it still can cause problems. The obvious problems are:
- Renaming that breaks API compatibility.
- Renaming names that are bound to runtime only (e.g. component discovery by name).
But a more general problem is that the IDE renaming tool is too constrained, because it only refactors names, not concepts. Every time a concept changes (or a new one is introduced), programmers need to somehow make sure that all names for this concept are aligned with the change. The same goes for identifiers. Every time identifier name changes, programmers needs to be sure that the name is still valid with all concepts and other related names. Knowing that human memory is very limited we can suspect that at least some of the concepts will not be renamed and this is when decay starts to spread in the codebase (compare with Broken Windows theory).
Renaming and Code Decay
Deißenböck and Pizka run an experiment with their students that included renaming a vital concept in the middle of the project lifetime. The conclusions were:
- The problem was ignored for a couple of weeks.
- After some time students working on the program were not able to comprehend its original meaning.
- They considered it a mess.
- Re-engineering was very extensive even though the program was fairly small (13,000 LOC).3
The above can also be easily confirmed by looking at a quantity of different identifiers used in open source projects. Data taken from Concise and Consistent Naming paper:
- Eclipse 3.0M7 – 94,829 different identifiers (around the same number of words as in Oxford Advanced Learner’s Dictionary) which are compounds of 7,233 unique words.
- Sun’s JDK 1.4.2 (1.3 MLOC) – 42,869 different identifiers that are compounds of 6,426 different words.
- Tomcat 5.0.30 (317 kLOC) – 11,656 different identifiers composed from 2,587 words.
As you see, some popular software projects use thousands of unique words in their source code. If you compare those numbers with the number of English words needed to understand academic papers (5,000), you can see that they are definitely too big. This means there are lots of synonyms in the source code and renaming anything would require a substantial amount of manual, error-prone work.
Naming is hard. We dived deep into the topic and extracted 4 rules of naming that all developers should follow. However, following those rules in today’s development environments is very hard and requires a lot of manual work. The biggest problem of naming is that programmers don’t have tools to analyse and stop name decay. And since names are about 70% of the source code, this contributes hugely to software decay as a whole. To get better at naming, we need to invent better tools.
This article appeared originally at http://michalplachta.com/2017/01/22/folklore-and-science-of-naming-practices/.