============================ TPAU R Scripting Style Guide ============================ Brian Gregor 2/5/09 Purpose ======= As time has gone by since TPAU began using R to script models, and as more people have learn to write R scripts, the need for script writing conventions has increased. Writing scripts according to a common style improves readability and understanding. This makes it easier for people to collaborate on a project. It also makes scripts easier to maintain and extend. While R's flexible object structures simplify the management of data and scripting of calculations, they also pose challenges for understanding scripts. R supports vectors, matrices, and arrays of any number of dimensions. Several types of iterators for these data structures are built into the language so that any number of operations may be carried out on them without having to program loops or other iterators. R also supports lists (flexible collections of data) and data frames (lists that allow data to be displayed and addressed like a matrix). While these data structures simplify programming and data management for models, their use requires that script writers and readers keep mental track of data structure dimensionality. This can pose a significant challenge for someone to understand the script written by another or to even understand a script they had written months earlier. One purpose of these R script writing guidelines is to describe a set of object naming conventions that incorporate descriptions of data structures into the object names. This greatly reduces the effort required to understand scripts. Another purpose of these guidelines is to establish conventions for laying out and commenting scripts. Script layout and commenting can contribute greatly to creating understandable scripts. The ideal that should be strived for is to write scripts that are self documenting where all one has to do is read the comments and procedures to understand what the script does and how it does it. The layout of a script should ideally be like an outline of a document which leads readers through the logic of the procedures carried out by the script. The right balance of layout, comments and programming transparency can improve readability greatly. Achieving this balance can be challenging, however, because scripts are stored as plain ascii text files and because the tendency of most people writing scripts (i.e. programs) is to stop writing when the script produces the right results. A scripting style guide can encourage writers to finish their writing so that it results in a script that doesn't just work, but is also well laid out and readable. A style guide can also increase the abilities of script writers to effectively lay out scripts despite the limitations of plain ascii text. Furthermore, rules for layout can facilitate production of nicely formatted documentation by enabling the use of programs that automatically process the script to produce nicely formatted output. Specifically, these guidelines propose the use of the Markdown text markup structure which produces nice ascii layouts and when processed using the appropriate program such as markdown.lua, creates nicely formatted html output. Since the output is html, it has the additional benefit of having the capability of embedding hyperlinks so that documentation for scripts or portions of scripts can be related to one another. Object Naming Conventions ========================= Everything in R is an object (in the sense of object-oriented programming). However, the conventions proposed in this document do not follow the those for most object-oriented languages. There are several reasons for this: 1) Although R is an object-oriented language, the syntax for calling object methods in R are different than for the common object-oriented languages such as C++, Java and Python. In these languages functions defined for an object (known as methods in object-oriented programming terms) are applied using the "object.method" syntax. For example myObject.print where print is a function being applied to myObject. R, which has strong functional programming roots, does not use this syntax. Instead the syntax is "function(object)"; for example, print(myObject). Because the "." does not have this special meaning in R, it can be used for other purposes in a naming convention. 2) Other language naming conventions don't worry about the conveying the structure or dimensionality of the data in the object names. That's because in most cases they support only very simple data structures. It is up to the programmer to create objects to represent more complex data structures. The use of multidimensional data structures is so common in R scripts, however (at least in travel modeling applications), that special notation is useful. 3) Creating a scheme for describing object data structures and dimensionality in the names requires the use of special characters in the names. Unfortunately, only two special characters may be used in R object names, the period (.) and the underscore (_). 4) Other language object naming conventions are geared to programmers and to languages that are geared to writing large systems. These conventions deal with issues that are not a large concern for the R scripts that we write. What is needed for our scripts is to express computational steps as clearly and simply as possible. Object Name Prefix and Suffix ----------------------------- Object names will have a prefix and one or more suffixes. The object name prefix is separated from the object name suffix by one or more periods or underscores. The prefix is used to describe what the object is or what the object does. For example, an object named "NumHh" would contain data on the number of households. The suffixes of an object name are used to describe the structure of data. The prefix is separated from the suffix by a period (.), a double period (..) or an underscore (_). The type of separator identifies the data structure as a vector, matrix or array, or as a data frame, or as a list. This is described in detail below. Prefix Naming ------------- It is important that object prefixes be chosen so that they are readily understood. It is common for computer programmers to use abbreviations or terse names. This practice should be avoided. It is much more important that names be chosen to be understandable to others than it is for a script writer to save a few keystrokes in the course of writing a script. For example, PmPeakVolume is preferrable to pmvol. A convention is necessary in order to facilitate the use of names that include more than one word or abbreviation. Some programmers use underscores or periods to do this. However, since these characters are used to separate the name prefix and suffix, intercapping is used instead to separate words in a name prefix. The first letter of each word or abbreviation in a name is capitalized as with the example above (PmPeakVolume). Notice that the abbreviation "PM" is included as "Pm" in this example. There is one exemption to the convention of starting every word or abbreviation in a name with a capital letter. For data objects, this should always be done, but the names of functions should start with an uncapitalized letter. A function that balances a two dimensional matrix might be called "balanceMatrix" for example. This capitalization difference helps to distinguish functions from data. Functions should be further distinguished from data by using verbs as the first word in a name. This emphasizes that the nature of functions which is to perform actions and distinguishes them from data objects that are passive. Examples of appropriate function names are "generateTrips", "distributeTrips", "chooseModes". These names are preferrable to "tripGeneration", "tripDistribution", and "modeChoice". Objects that have boolean (true/false) values should start with "Is" or some other word like "Has" which conveys the presence or absence of an attribute. This makes it easier to understand indexing operations. For example, say you have data on a large sample of trucks and you want to determine the characteristics of different truck weight classes. You define two boolean vectors to identify whether each truck is heavy (IsHeavyTruck) or light (IsLightTruck). Then if you could get the average payload value for heavy trucks with a statement such as "mean(PayloadValue[IsLightTruck])". Using this naming approach makes the the statement easier to understand. The intercapping scheme presented in these guidelines differs from the schemes used for other object-oriented languages where classes are distinguished from objects by whether the first letter is capitalized. The distinction between classes and objects is not important for the R scripts that we write because they are not written in the common object-oriented language fashion. It is more important that data be distinguished from functions. In addition the intercapping scheme presented here provides a consistent intercapping approach when new data object names are built from existing data object names. The ability to be able to operate on objects in this way is an example of what is called "computing on the language". It is a powerful capability that can greatly simplify scripts and is used fairly often in the scripts we write. The advantage of using intercapping approach in these guidelines is that data object names created by joining other data object names together will comply with the naming convention. In comparison, data object names using the object-oriented language naming approach that starts with a uncapitalized letter will not. Suffix Separators ----------------- A suffix is separated from from a prefix by either a ".", "..", or "_". Objects that have none of these are scalar values. An object that uses a "." for a suffix separator is either a vector, matrix or array. Which it is depends on abbreviations included in the following suffix. This is explained below. An object that uses a ".." for a suffix separator is a data frame. A "_" separator indicates a list. Denoting Dimensionality of Vectors, Matrices and Arrays ------------------------------------------------------- The dimensionality of objects is denoted in two-letter abbreviations which stand for the dimensions. This is best explained the use of examples. These examples will use the following abbreviations: "Zn" stands for zones (e.g. 101, 102, 103 ...) "Sz" stands for household size categories (e.g. 1 person, 2 persons, 3 persons, 4+ persons) "Wk" stands for number of workers (e.g. 0 worker, 1 worker, 2 workers, 3+ workers) "NumHh" would be a scalar value. For example, the total number of households in a model area. "NumHh.Zn" would be a vector of the number of households by zone. "NumHh.ZnSz" would be a matrix of the number of households by zone and household size where the rows of the matrix are the zone dimension and the columns are the size dimension. "NumHh.ZnSzWk" would be a three dimensional array of the number of households where the rows and columns are as in the previous example and the third dimension is the number of workers in the household. Note that since the suffix indicates the dimensionality of the object, it is unneccessary to provide this information in the prefix. For example, "HhByZoneAndSize.ZnSz" would be a redundant name. The object name prefix should only identify the quantity of the object and not the dimensionality. To keep object names from becoming overally long, abbreviations should ordinarily be no longer than two characters. The first character must be a capital and the second character must be a lower case letter. One character abbreviations may be used if it is possible to do so and create abbreviations that are readily understood and distinguishable from one another. In addition to using an abbreviation in notation to describe dimensionality, the abbreviation should be defined as an object containing the names associated with the dimension the abbreviation represents. This facilitates initalizing objects having that dimension and naming objects with that dimension. For example: Sz <- c("1person", "2person", "3person", "4+person") Wk <- c("0worker", "1worker", "2worker", "3+worker") Creating a matrix of households by size and workers and initializing it to zero, could be done as follows: NumHh.SzWk <- matrix(0, nrow=length(Sz), ncol=length(Wk), dimnames=list(Sz,Wk)) Care should be taken to choose abbreviations that are easy to remember. In addition, dimension abbreviations that refer to similar quantities should share the same initial letter. For example, the abbreviation Zn might be used to refer to all model zones while Zi and Ze might be used to refer to internal and external zones respectively. There are a number of additional benefits of creating naming vectors for each of the dimension abbreviations. The previous example showed how this approach makes it easy to name the dimensions of objects that are created. The same capabilities can also be used to check whether an object completely represents a dimension, to place an object in the correct order for a dimension, and to index a dimension. Here are examples: * Checking Completeness * if(all(Zn %in% names(NumHh.Zn))) print("All zones are accounted for.") * Ordering According to a Dimension * NumHh.ZnSz <- NumHh.ZnSz[Zn,] * Indexing a Dimension * Assume that Zn is a vector of all model zone names and that Ze is a vector of external zone names: Trips.ZeZe <- Trips.ZnZn[Ze,Ze] Using naming vectors to index a dimension also facilitates the understanding of iteration in "for" loops. Here is a trivial example: ValueToAdd <- 1 for(sz in Sz){ for(wk in Wk){ NumHh.SzWk[sz, wk] <- NumHh.SzWk[sz, wk] + ValueToAdd ValueToAdd <- ValueToAdd + 1 } } Note how the example uses the same abbreviations, but with uncapitalized letters, to refer to the individual indexes as the loops iterate across the Sz and Wk dimensions. This improves understandability and avoids the need to remember the size of dimensions when setting up a loop. Compare the alternative: ValueToAdd <- 1 for(i in 1:4){ for(j in 1:4){ NumHh.SzWk[i, j] <- NumHh.SzWk[i, j] + ValueToAdd ValueToAdd <- ValueToAdd + 1 } } Ideally the naming vectors should be defined in one section of a script at the beginning. Even better, the names should be planned out early in the design process and a list created which contains all of the naming vectors. This list can then be attached to make all of the names available. Sometimes a script will define objects that are useful for computing intermediate results, but are relatively temporary. The convention for temporary vectors is simply to end the name with a period. It will be assumed that this denotes a vector (e.g. Employment. indicates a vector of employment values of unspecified length). If the object has more than one dimension, then the number of dimensions can be indicated with the abbreviations 2d, 3d, 4d, etc (e.g. Emp.2d). Alternately, capital letters that commonly used to denote generic variables may be used to indicate the number of dimensions. For example, a three-dimensional array might be called Emp.XYZ. This latter form can be combined with named dimensions (e.g. Emp.ZnX). If it is desirable to indicate the size of a generic dimension and the size of a dimension is small (< 10) and it is desirable to keep track of the size, a number can be included in the generic dimension abbreviation. For example, Emp.ZnX3 would indicate a matrix where the rows represent zones and there are three "undefined" columns. If care has been taken to choose dimension abbreviations carefully so that dimensions that are similar share the same type, a generic notation may be used to refer to all similar types. This is particularly useful in functions to indicate that the function may be used for any of the similar types. For example, a generic set of zones might be represented by the abbreviation Zx. A function which operates on any set of zones might be written with names that use the Zx notation. This reinforces to the reader that the function is designed to work on zones and is not specific to any particular set of zones. Denoting Dimensionality of Lists -------------------------------- Lists are very flexible collections of data. The elements of lists can have different types of data. In addition, list elements can themselves be lists. Therefore, the naming of lists may require multiple suffixes. This is described in more detail below. The underscore is used to indicate that an object is a list. If the elements of the list are named, then a "dimension" representing the names are indicated by an abbreviation following the underscore. For example, if the abbreviation Ur stands for a set of urban areas then a list containing information about projects by urban area might be named Projects_Ur. If the list contains the same forms of data structures in each element (e.g. each element is a matrix of employment by zone and economic sector) then the name of the list can easily reflect the structure of the data. Building on the example in the previous sections, NumHh_Ur.ZxSz would describe the number of households by urban area and zone and size. Each element of the list contains a matrix for an urban area. The rows of the matrices represent zones and the columns represent household size. Note that the Zx notation communicates that the row names all represent zones, but that the names are not the same for all of the list elements. Lists can very easily contain very complex data structures. It can be very difficult to read and understand scripts that operate on complex lists. For that reason, the use of complex lists should be avoided in scripts when possible. One way to determine whether a list is complex is to write a name that represents its structure. There are times, however, when a complex or difficult to name list is still a very useful data structure. In such cases, the list nature of an object would be indicated by a trailing underscore in the name with no following abbreviations. One such use of a list is to collect a variety of data into one data structure to avoid cluttering of the global workspace with many object names. For example, it may be useful to collect the information on a number of different variables into a list called Vars_. Then the list may be "attached" to the workspace so that the individual elements of the list may be addressed using their names. Complex lists are also frequently used in R to contain the results of statistical analyses or to represent classes of data. When lists are used in this way, they should be carefully defined and documented. In addition, "accessor" functions should be written to pull data out of the list so that script writers and readers don't have to understand the structure of the list in detail. They only need to understand what functions to use to pull data out of the list. Denoting Dimensionality of Data Frames -------------------------------------- Data frames are a special type of list which can be represented in tabular form and can be operated on in ways like a matrix. Since they are lists, they can also be operated on with list functions. Data frames provide a convenient means of representing a set of data records. Each row represents an observation (e.g. household or trip). The columns represent variables of different types. The double period was chosen as a suffix separator for data frames because it has some visual similarity to both a period and an underscore. For data frames, the row dimension is much more likely to be named using this convention than the column dimension. Rows may be named because they represent a set of cases or observations. They might, for example, represent different TAZs. The columns, however, are likely to contain different types of data that may or not be related to one another. For example, TAZ data may contain Census block group references, population, household characteristics, and employment characteristics. This data frame could be called TazData..Zn. This tells you that the object is a data frame and that the rows are named. If there is a strong conceptual tie between the columns of a data frame then the dimension can be named, but in such cases it is likely that the data are all of the same type and should be stored in a matrix rather than a data frame. For example, if the data on employment by economic sector were extracted from the TAZ data, you could use an abbreviation representing the employment sectors. However, there would be no need to store this data in a data frame because it is all of the same type and could be stored in a matrix. Where a data frame does not have logical row or column names, the data frame would be denoted by the double periods at the end without any following suffix. For example, a data frame of household trip data might not have any logical row names. While each record would have a unique ID, it might not be worth while to name the rows with the IDs. A name like SurveyTrips.. could be used in this instance. Script Organization and Layout ============================== TO BE DONE