plasTeX 3.0 — A Python Framework for Processing LaTeX Documents: Python Classes

4.2.1 Python Classes

Both L^aT_eX command and environments can be implemented in Python classes. plasT_eX includes a base class for each one: Command for commands and Environment for environments. For the most part, these two classes behave in the same way. They both are responsible for parsing their arguments, organizing their child nodes, incrementing counters, etc. much like their L^aT_eX counterparts. There is also a variant of the Environment class called NoCharSubEnvironment which temporarily turns off character substitutions described in Section 6.3.10. The Python macro class feature set is based on common L^aT_eX conventions. So if the L^aT_eX macro you are implementing in Python uses standard L^aT_eX conventions, you job will be very easy. If you are doing unconventional operations, you will probably still succeed, you just might have to do a little more work.

The three most important parts of the Python macro API are: 1) the args attribute, 2) the invoke method, and 3) the digest method. When writing your own macros, these are used the most by far.

The args Attribute

The args attribute is a string attribute on the class that indicates what the arguments to the macro are. In addition to simply indicating the number of arguments, whether they are mandatory or optional, and what characters surround the argument as in L^aT_eX, the args string also gives names to each of the argument and can also indicate the content of the argument (i.e. int, float, list, dictionary, string, etc.). The names given to each argument determine the key that the argument is stored under in the the attributes dictionary of the class instance. Below is a simple example of a macro class.

from plasTeX import Command, Environment

class framebox(Command):
    """ \framebox[width][pos]{text} """
    args = '[ width ] [ pos ] text'

In the args string of the \framebox macro, three arguments are defined. The first two are optional and the third one is mandatory. Once each argument is parsed, in is put into the attributes dictionary under the name given in the args string. For example, the attributes dictionary of an instance of \framebox will have the keys “width”, “pos”, and “text” once it is parsed and can be accessed in the usual Python way.

self.attributes['width']
self.attributes['pos']
self.attributes['text']

In plasT_eX, any argument that isn’t mandatory (i.e. no grouping characters in the args string) is optional ¹ . This includes arguments surrounded by parentheses (( )), square brackets ([ ]), and angle brackets (< >). This also lets you combine multiple versions of a command into one macro. For example, the \framebox command also has a form that looks like: \framebox(x_dimen,y_dimen)[pos]{text}. This leads to the Python macro class in the following code sample that encompasses both forms.

from plasTeX import Command, Environment

class framebox(Command):
    """

    \framebox[width][pos]{text} or
    \framebox(x_dimen,ydimen)[pos]{text}

    """
    args = '( dimens ) [ width ] [ pos ] text'

The only thing to keep in mind is that in the second form, the pos attribute is going to end up under the width key in the attributes dictionary since it is the first argument in square brackets, but this can be fixed up in the invoke method if needed. Also, if an optional argument is not present on the macro, the value of that argument in the attributes dictionary is set to None .

As mentioned earlier, it is also possible to convert arguments to data types other than the default (a document fragment). A list of the available types is shown in the table below.

Name	Purpose
str	expands all macros then sets the value of the argument in the attributes dictionary to the string content of the argument
chr	same as ‘str’
char	same as ‘str’
cs	sets the attribute to an unexpanded control sequence
label	expands all macros, converts the result to a string, then sets the current label to the object that is in the currentlabel attribute of the document context. Generally, an object is put into the currentlabel attribute if it incremented a counter when it was invoked. The value stored in the attributes dictionary is the string value of the argument.
id	same as ‘label’
idref	expands all macros, converts the result to a string, retrieves the object that was labeled by that value, then adds the labeled object to the idref dictionary under the name of the argument. This type of argument is used in commands like \ref that must reference other abjects. The nice thing about ‘idref’ is that it gives you a reference to the object itself which you can then use to retrieve any type of information from it such as the reference value, title, etc. The value stored in the attributes dictionary is the string value of the argument.
ref	same as ‘idref’
nox	just parses the argument, but doesn’t expand the macros
list	converts the argument to a Python list. By default, the list item separator is a comma (,). You can change the item separator in the args string by appending a set of parentheses surrounding the separator character immediately after ‘list’. For example, to specify a semi-colon separated list for an argument called “foo” you would use the args string: “foo:list(;)”. It is also possible to cast the type of each item by appending another colon and the data type from this table that you want each item to be. However, you are limited to one data type for every item in the list.
dict	converts the argument to a Python dictionary. This is commonly used by arguments set up using L^aT_eX’s ‘keyval’ package. By default, key/value pairs are separated by commas, although this character can be changed in the same way as the delimiter in the ‘list’ type. You can also cast each value of the dictionary using the same method as the ‘list’ type. In all cases, keys are converted to strings.
dimen	reads a dimension and returns an instance of dimen
dimension	same as ‘dimen’
length	same as ‘dimen’
number	reads an integer and returns a Python integer
count	same as ‘number’
int	same as ‘number’
float	reads a decimal value and returns a Python float
double	same as ‘float’

There are also several argument types used for more low-level routines. These don’t parse the typical L^aT_eX arguments, they are used for the somewhat more free-form T_eX arguments.

Name	Purpose
Dimen	reads a T_eX dimension and returns an instance of dimen
Length	same as ‘Dimen’
Dimension	same as ‘Dimen’
MuDimen	reads a T_eX mu-dimension and returns an instance of mudimen
MuLength	same as ‘MuDimen’
Glue	reads a T_eX glue parameter and returns an instance of glue
Skip	same as ‘MuLength’
Number	reads a T_eX integer parameter and returns a Python integer
Int	same as ‘Number’
Integer	same as ‘Number’
Token	reads an unexpanded token
Tok	same as ‘Token’
XToken	reads an expanded token
XTok	same as ‘XToken’
Args	reads tokens up to the first begin group (i.e. {)

To use one of the data types, simple append a colon (:) and the data type name to the attribute name in the args string. Going back to the \framebox example, the argument in parentheses would be better represented as a list of dimensions. The width parameter is also a dimension, and the pos parameter is a string.

from plasTeX import Command, Environment

class framebox(Command):
    """

    \framebox[width][pos]{text} or
    \framebox(x_dimen,ydimen)[pos]{text}

    """
    args = '( dimens:list:dimen ) [ width:dimen ] [ pos:chr ] text'

The invoke Method

The invoke method is responsible for creating a new document context, parsing the macro arguments, and incrementing counters. In most cases, the default implementation will work just fine, but you may want to do some extra processing of the macro arguments or counters before letting the parsing of the document proceed. There are actually several methods in the API that are called within the scope of the invoke method: preParse, preArgument, postArgument, and postParse.

The order of execution is quite simple. Before any arguments have been parsed, the preParse method is called. The preArgument and postArgument methods are called before and after each argument, respectively. Then, after all arguments have been parsed, the postParse method is called. The default implementations of these methods handle the stepping of counters and setting the current labeled item in the document. By default, macros that have been “starred” (i.e. have a ‘*’ before the arguments) do not increment the counter. You can override this behavior in one of these methods if you prefer.

The most common reason for overriding the invoke method is to post-process the arguments in the attributes dictionary, or add information to the instance. For example, the \color command in L^aT_eX’s color package could convert the L^aT_eX color to the correct CSS format and add it to the CSS style object.

from plasTeX import Command, Environment

def latex2htmlcolor(arg):
    if ',' in arg:
        red, green, blue = [float(x) for x in arg.split(',')]
        red = min(int(red * 255), 255)
        green = min(int(green * 255), 255)
        blue = min(int(blue * 255), 255)
    else:
        try:
            red = green = blue = float(arg)
        except ValueError:
            return arg.strip()
    return '#%.2X%.2X%.2X' % (red, green, blue)

class color(Environment):
    args = 'color:str'
    def invoke(self, tex):
        a = Environment.invoke(tex)
        self.style['color'] = latex2htmlcolor(a['color'])

While simple things like attribute post-processing is the most common use of the invoke method, you can do very advanced things like changing category codes, and iterating over the tokens in the T_eX processor directly like the verbatim environment does.

One other feature of the invoke method that may be of interest is the return value. Most invoke method implementations do not return anything (or return None ). In this case, the macro instance itself is sent to the output stream. However, you can also return a list of tokens. If a list of tokens is returned, instead of the macro instance, those tokens are inserted into the output stream. This is useful if you don’t want the macro instance to be part of the output stream or document. In this case, you can simply return an empty list.

The digest Method

The digest method is responsible for converting the output stream into the final document structure. For commands, this generally doesn’t mean anything since they just consist of arguments which have already been parsed. Environments, on the other hand, have a beginning and an ending which surround tokens that belong to that environment. In most cases, the tokens between the \begin and \end need to be absorbed into the childNodes list.

The default implementation of the digest method should work for most macros, but there are instances where you may want to do some extra processing on the document structure. For example, the \caption command within figures and tables uses the digest method to populate the enclosing figure/table’s caption attribute.

from plasTeX import Command, Environment

class Caption(Command):
    args = '[ toc ] self'

    def digest(self, tokens):
        res = Command.digest(self, tokens)

        # Look for the figure environment that we belong to
        node = self.parentNode
        while node is not None and not isinstance(node, figure):
            node = node.parentNode

        # If the figure was found, populate the caption attribute
        if isinstance(node, figure):
            node.caption = self

        return res

class figure(Environment):
    args = '[ loc:str ]'
    caption = None
    class caption_(Caption):
        macroName = 'caption'
        counter = 'figure'

More advanced uses of the digest method might be to construct more complex document structures. For example, tabular and array structures in a document get converted from a simple list of tokens to complex structures with lots of style information added (see section 3.3.3). One simple example of a digest that does something extra is shown below. It looks for the first node with the name “item” then bails out.

from plasTeX import Command, Environment

class toitem(Command):
    def digest(self, tokens):
        """ Throw away everything up to the first 'item' token """
        for tok in tokens:
            if tok.nodeName == 'item':
               # Put the item back into the stream
               tokens.push(tok)
               break

One of the more advanced uses of the digest is on the sectioning commands: \section, \subsection, etc. The digest method on sections absorb tokens based on the level attribute which indicates the hierarchical level of the node. When digested, each section absorbs all tokens until it reaches a section that has a level that is equal to or higher than its own level. This creates the overall document structure as discussed in section 3.

Other Nifty Methods and Attributes

There are many other attributes and methods on macros that can be used to affect their behavior. For a full listing, see the API documentation in section 6.1. Below are descriptions of some of the more commonly used attributes and methods.

The level attribute

The level attribute is an integer that indicates the hierarchical level of the node in the output document structure. The values of this attribute are taken from L^aT_eX: \part is -1, \chapter is 0, \section is 1, \subsection is 2, etc. To create your owne sectioning commands, you can either subclass one of the existing sectioning macros, or simply set its level attribute to the appropriate number.

The macroName attribute

The macroName attribute is used when you are creating a L^aT_eX macro whose name is not a legal Python class name. For example, the macro \@ifundefined has a ‘@’ in the name which isn’t legal in a Python class name. In this case, you could define the macro as shown below.

class ifundefined_(Command):
    macroName = '@ifundefined'

The counter attribute

The counter attribute associates a counter with the macro class. It is simply a string that contains the name of the counter. Each time that an instance of the macro class is invoked, the counter is incremented (unless the macro has a ‘*’ argument).

The ref attribute

The ref attribute contains the value normally returned by the \ref command.

The title attribute

The title attribute retrieves the “title” attribute from the attributes dictionary. This attribute is also overridable.

The fullTitle attribute

The same as the title attribute, but also includes the counter value at the beginning.

The tocEntry attribute

The tocEntry attribute retrieves the “toc” attribute from the attributes dictionary. This attribute is also overridable.

The fullTocEntry attribute

The same as the tocEntry attribute, but also includes the counter value at the beginning.

The style attribute

The style attribute is a CSS style object. Essentially, this is just a dictionary where the key is the CSS property name and the value is the CSS property value. It has an attribute called inline which contains an inline version of the CSS properties for use in the style= attribute of HTML elements.

The id attribute

This attribute contains a unique ID for the object. If the object was labeled by a \label command, the ID for the object will be that label; otherwise, an ID is generated.

The source attribute

The source attribute contains the L^aT_eX source representation of the node and all of its contents.

The currentSection attribute

The currentSection attribute contains the section that the node belongs to.

The expand method

The expand method is a thin wrapper around the invoke method. It simply invokes the macro and returns the result of expanding all of the tokens. Unlike invoke, you will always get the expanded node (or nodes); you will not get a None return value.

The paragraphs method

The paragraphs method does the final processing of paragraphs in a node’s child nodes. It makes sure that all content is wrapped within paragraph nodes. This method is generally called from the digest method.