OCaml CAS

Oct. 12, 2021

Try It First

Every language has its perks. C++ has move constructors, Python has a library for everything, JavaScript sorts numbers lexicographically (duh). Well, my favorite parts of OCaml are its typing system and pattern matching capabilities.

Don't you hate it when you see a C function like this:

int socket(int, int, int)

And now contrast that with OCaml's socket declaration:

val Unix.socket : Unix.socket_domain -> Unix.socket_type -> int -> Unix.file_descr

Doesn't the latter read much more clearly? Also notice how (if you know what a socket domain and socket type is) you don't need any documentation to understand this code.

1. Let's Start Parsing

Well, not yet

Before we start parsing, lets layout our plan; it surprisingly (or maybe not) takes a bit of work to go from "1 + 1" to "2". This is because math notation is sometimes tricky and other times just downright frustrating. There are tons of math expression parsing algorithms out there, one of the most popular is the shunting yard algorithm. However, we'll take a different approach that's a little more generalizable (at a cost).

The first step will be a lexical pass that tokenizes our raw input. Essentially, we convert the input, which is a sequence of characters, into a sequence of tokens. A token can be thought of as a "unit" in our inputs, with which we'll use to build something with more structure in the next step.

An example would be converting something like 1 + 23 into [Token("1"), Token("+"), Token("23")].

Next will come the syntactical pass that takes these tokens and constructs a tree structure called the abstract syntax tree. This structure gives the expression meaning and makes processing much easier.

Continuing the above example, our syntax tree will look like
       Token("+")
        /    \
Token("1")  Token("23")
(You may need to make sure your window is wide enough so the tree is formatted properly)

Finally, we can use this syntax tree to convert into an expression tree. The main benefit of this is that we're ultimately interested in performing mathematical operations and not so much syntactical operations.

Finally, our tree looks like
+
/ \
1 23

2. Lexical Analysis

Tokens, tokens tokens

The first step for parsing input is tokenization. Before we dive into any code, lets define what a token is. We can classify them to be any of the following types

type token_type =
    | Function
    | Constant
    | Variable
    | LeftParenthesis
    | RightParenthesis
    | Comma
    | Equals
    | Plus
    | Minus
    | Multiply
    | Divide
    | Exponentiate
    | Negate

The token types are all pretty self-explanatory as they describe exactly what characters they represent. However, note that some of them depend on the context of surrounding characters. For example, a "-" can either be a Minus or a Negate depending on what come before it. Thus, let's define a function that'll tell us which.

let minus_or_negate prev_token =
  match prev_token with
  | None -> Negate
  | Some token ->
    (match token with
    | Function | Constant | Variable -> Minus
    | LeftParenthesis -> Negate
    | RightParenthesis -> Minus
    | Comma | Equals -> Negate
    | Plus | Minus | Multiply | Divide | Exponentiate | Negate -> Negate)
;;

Now, using the previous token (if it exists), we can figure out if the next "-" is a Minus or Negate.

Next, since we want this calculator to follow conventional math grammar as closely as possible, we want to allow implicit multiplication. This means something like \(xyz\) should be parsed as \(x \times y \times z\). However, we must be careful because function names like \(\sin(\pi)\) should not be parsed as \(s \times i \times n \times \pi\). Thus, we must differentiate between functions, constants, and variables.

The simplest of the three to detect are numbers. Simply because, well, they're numbers. They begin with either a dash, comma, or digit. We can write a simple function to extract a number from the head of a char list and return its tail:

let tokenize_number_value =
  let rec tok_numbers pos seen_decimal = function
    | ('0' .. '9' as digit) :: rem_chars ->
      (match tok_numbers (pos + 1) seen_decimal rem_chars with
      | None -> None
      | Some (parsed_number, rem) -> Some (digit :: parsed_number, rem))
    | '.' :: rem_chars ->
      if seen_decimal
      then None
      else (
        match tok_numbers (pos + 1) true rem_chars with
        | None -> None
        | Some (parsed_number, rem) -> Some ('.' :: parsed_number, rem))
    | chars -> Some ([], chars)
  in
  tok_numbers 0 false
;;

Note that we actually don't have to care about any leading '-' because the Negate token will match any dashes first.

Next comes the functions and variables. The way we'll approach this is to (for now) treat every continuous string of letters as a function (even if they're supposed to represent individual variables).

let tokenize_char_value =
  let rec tok_chars = function
    | (('a' .. 'z' | 'A' .. 'Z') as ch) :: rem_chars ->
      let parsed_chars, rem = tok_chars rem_chars in
      ch :: parsed_chars, rem
    | chars -> [], chars
  in
  tok_chars
;;

Then, we'll define a resolve_keywords function that will check each Function token to see if it really exists as a function, and if not, break it up into a list of variables.

let rec resolve_keywords repl_state = function
  | [] -> []
  | tok :: rem_toks ->
    let rest = resolve_keywords repl_state rem_toks in
    (match tok with
    | { token = Function; pos; value } ->
      (match Repl_state.keyword_type repl_state value with
      | Some Repl_state.Constant -> { token = Constant; pos; value } :: rest
      | Some Repl_state.Function -> { token = Function; pos; value } :: rest
      | None ->
        let split =
          List.mapi
            ~f:(fun i a -> { token = Variable; pos = pos + i; value = Char.to_string a })
            (String.to_list value)
        in
        split @ rest)
    | _ -> tok :: rest)
;;

Note the use of a Repl_state module. This is what we'll use to keep track of all the defined constants and functions. More on this will be covered later.

Finally, we have the pieces to perform tokenization. The way we'll do this is iteratively chopping of the head of the input string to parse as a token while accumulating them into a list. The "chopping" function looks like this

let get_next_token prev_token chars =
  match chars with
  | '(' :: rem_chars -> Some ((LeftParenthesis, "("), rem_chars)
  | ')' :: rem_chars -> Some ((RightParenthesis, ")"), rem_chars)
  | ',' :: rem_chars -> Some ((Comma, ","), rem_chars)
  | '=' :: rem_chars -> Some ((Equals, "="), rem_chars)
  | '+' :: rem_chars -> Some ((Plus, "+"), rem_chars)
  | '-' :: rem_chars -> Some ((minus_or_negate prev_token, "-"), rem_chars)
  | '*' :: '*' :: rem_chars -> Some ((Exponentiate, "**"), rem_chars)
  | '*' :: rem_chars -> Some ((Multiply, "*"), rem_chars)
  | '/' :: rem_chars -> Some ((Divide, "/"), rem_chars)
  | '^' :: rem_chars -> Some ((Exponentiate, "^"), rem_chars)
  | ('a' .. 'z' | 'A' .. 'Z') :: _ ->
    let parsed_chars, rem_chars = tokenize_char_value chars in
    let parsed_str = String.of_char_list parsed_chars in
    Some ((Function, parsed_str), rem_chars)
  | ('0' .. '9' | '.') :: _ ->
    (match tokenize_number_value chars with
    | None -> None
    | Some (parsed_digits, rem_chars) ->
      let parsed_num = String.of_char_list parsed_digits in
      Some ((Constant, parsed_num), rem_chars))
  | _ -> None
;;

While the accumulating function looks like this

let tokenize ?(repl_state = Repl_state.init) input_str =
  let rec state_machine prev_token pos = function
    | [] -> Result.Ok []
    | (' ' | '\t') :: chars -> state_machine prev_token (pos + 1) chars
    | chars ->
      (match get_next_token prev_token chars with
      | None -> Result.Error pos
      | Some ((token, value), rem_chars) ->
        (match state_machine (Some token) (pos + String.length value) rem_chars with
        | Result.Error pos -> Result.Error pos
        | Result.Ok lst -> Result.Ok ({ token; pos; value } :: lst)))
  in
  let open Result in
  String.to_list input_str |> state_machine None 0 >>| resolve_keywords repl_state
;;

Note that we skip any whitespace in the input as those aren't necessary in conveying any meaningful information.

This wraps up tokenization, and next, we'll go over parsing.

3. Parsing

Tokens all the way down

Parsing is a fairly tricky step because it involves us to create some "structure" out of a flat list of tokens. This structure, in detail, is called a syntax tree and basically it allows us to figure out the order in which things should be evaluated. For us, it's defined like this:

type t =
  | Leaf of Tokenizer.t
  | PrefixOp of
      { token : Tokenizer.t
      ; child : t
      }
  | InfixOp of
      { token : Tokenizer.t
      ; left : t
      ; right : t
      }

Parsing something into a syntax tree is usually done with the help of a grammar definition, which not only rigorously defines the language, but also tells us how to construct the tree.

Let's start with a set of symbols that a sequence of tokens can match

type expr_type =
  | ExprBase
  | ExprEq
  | ExprComma
  | ExprPlusMinus
  | ExprMultDiv
  | ExprNegate
  | ExprExp
  | ExprParenthesis
  | ExprFunctionApply
  | ExprValue
  | ExprFunction

Every valid math expression (when tokenized) should match ExprBase. We'll define rules on how to expand these matches to recursively define them down to the terminal symbols ExprValue and ExprFunction. Once we get down to the terminal symbols, they should just match one token, where constant and variable tokens should match ExprValue while function tokens should match ExprFunction.

At an abstract level, we can think of defining the rules like so:
ExprBase -> ExprEq
ExprEq -> ExprComma "=" ExprEq
ExprEq -> ExprComma
ExprComma -> ExprComma "," ExprPlusMinus
ExprComma -> ExprPlusMinus
ExprPlusMinus -> ExprPlusMinus "+"/"-" ExprMultDiv
ExprPlusMinus -> ExprMultDiv
...
You can see how each expression (on the left-hand side of the arrow) can be "expanded" into a rule (right-hand side of the arrow). Thus, if we wanted to match a sequence of tokens against an ExprEq expression, we'll try to first match it with something of the form ExprComma followed by a "=" token followed by a ExprEq. If that fails, we'll try to match it against a ExprComma expression instead. Recursively doing this should allow us to match every valid math expression with one of these rules, which will tell us how to create the syntax tree.

There are quite a few number of different substitution types we can perform, so let's define them neatly in a variant type:

type substitution =
  | Promotion of expr_type
  | InfixSubstitution of (expr_type * Tokenizer.token_type * expr_type)
  | PrefixSubstitution of (Tokenizer.token_type * expr_type)
  | ImplicitMultiply of (expr_type * expr_type)
  | ParenthesesReduce of expr_type
  | TerminalValue
  | TerminalFunction

We can also just define the substitution as a list of expressions, but this is a bit neater because it tells us how to construct the syntax tree as well.

Now, with the substitutions available to us, let's define the substitution rules for each expression:

let get_substitions = function
  | ExprBase -> [ Promotion ExprEq ]
  | ExprEq ->
    [ Promotion ExprComma; InfixSubstitution (ExprComma, Tokenizer.Equals, ExprEq) ]
  | ExprComma ->
    [ Promotion ExprPlusMinus
    ; InfixSubstitution (ExprComma, Tokenizer.Comma, ExprPlusMinus)
    ]
  | ExprPlusMinus ->
    [ Promotion ExprMultDiv
    ; InfixSubstitution (ExprPlusMinus, Tokenizer.Plus, ExprMultDiv)
    ; InfixSubstitution (ExprPlusMinus, Tokenizer.Minus, ExprMultDiv)
    ]
  | ExprMultDiv ->
    [ Promotion ExprNegate
    ; InfixSubstitution (ExprMultDiv, Tokenizer.Multiply, ExprNegate)
    ; InfixSubstitution (ExprMultDiv, Tokenizer.Divide, ExprNegate)
    ; ImplicitMultiply (ExprMultDiv, ExprNegate)
    ]
  | ExprNegate ->
    [ Promotion ExprExp
    ; PrefixSubstitution (Tokenizer.Negate, ExprNegate)
    ; PrefixSubstitution (Tokenizer.Negate, ExprParenthesis)
    ]
  | ExprExp ->
    [ Promotion ExprFunctionApply
    ; Promotion ExprParenthesis
    ; InfixSubstitution (ExprFunctionApply, Tokenizer.Exponentiate, ExprExp)
    ]
  | ExprFunctionApply ->
    [ Promotion ExprValue
    ; Promotion ExprFunction
    ; Promotion ExprParenthesis
    ; PrefixSubstitution (Tokenizer.Function, ExprParenthesis)
    ]
  | ExprParenthesis -> [ ParenthesesReduce ExprBase ]
  | ExprValue -> [ TerminalValue ]
  | ExprFunction -> [ TerminalFunction ]
;;

This is a bit lengthy but if you look at each set of rules individually, it's pretty manageable to figure out. Most of these substitutions are just InfixSubstitution's because a lot of math has infix notations.

Now, for the actual parsing itself, we'll do this through two mutually recursive functions. One function will try to match a token with an expression, while the other function will try to match the tokens with a singular substitution rule. The former function will iterate through all the available substitutions available for the given expression, while the latter function will divide up the tokens into pieces and recursively call the first function on each part of the substitution. Together, they look like this:

let rec parse_as_expr expr tokens =
  let rec match_substitutions = function
    | [] -> None
    | sub :: rem_subs ->
      (match perform_sub sub tokens with
      | Some tree -> Some tree
      | None -> match_substitutions rem_subs)
  in
  get_substitions expr |> match_substitutions

and perform_sub = function
  | Promotion expr -> parse_as_expr expr
  | InfixSubstitution (expr_left, infix_token, expr_right) ->
    let rec scan rev_left right_tokens =
      match right_tokens with
      | [] -> None
      | token :: rem_right_tokens ->
        let go_next () = scan (token :: rev_left) rem_right_tokens in
        if Tokenizer.equal_token_type Tokenizer.(token.token) infix_token
        then (
          match parse_as_expr expr_right rem_right_tokens with
          | None -> go_next ()
          | Some right ->
            (match parse_as_expr expr_left (List.rev rev_left) with
            | None -> go_next ()
            | Some left -> Some (InfixOp { token; left; right })))
        else go_next ()
    in
    scan []
  | PrefixSubstitution (prefix_token, expr_child) ->
    (function
    | [] -> None
    | token :: rem_tokens ->
      if Tokenizer.equal_token_type Tokenizer.(token.token) prefix_token
      then (
        match parse_as_expr expr_child rem_tokens with
        | None -> None
        | Some child -> Some (PrefixOp { token; child }))
      else None)
  | ImplicitMultiply (expr_left, expr_right) ->
    let rec scan rev_left right_tokens =
      match rev_left, right_tokens with
      | [], [] -> None
      | [], token :: rem_right_tokens -> scan [ token ] rem_right_tokens
      | rev_left, token :: rem_right_tokens ->
        let go_next () = scan (token :: rev_left) rem_right_tokens in
        (match parse_as_expr expr_right right_tokens with
        | None -> go_next ()
        | Some right ->
          let left_tokens = List.rev rev_left in
          (match parse_as_expr expr_left left_tokens with
          | None -> go_next ()
          | Some left ->
            Some
              (InfixOp
                 { token = Tokenizer.{ token = Tokenizer.Multiply; pos = -1; value = "" }
                 ; left
                 ; right
                 })))
      | _, [] -> None
    in
    scan []
  | ParenthesesReduce expr ->
    (function
    | [] -> None
    | tok :: rem ->
      if Tokenizer.equal_token_type Tokenizer.(tok.token) Tokenizer.LeftParenthesis
      then (
        let rev_toks = List.rev rem in
        match rev_toks with
        | [] -> None
        | tok :: mid ->
          if Tokenizer.equal_token_type Tokenizer.(tok.token) Tokenizer.RightParenthesis
          then parse_as_expr expr (List.rev mid)
          else None)
      else None)
  | TerminalValue ->
    (function
    | [ token ] ->
      (match Tokenizer.(token.token) with
      | Tokenizer.Constant | Tokenizer.Variable -> Some (Leaf token)
      | _ -> None)
    | _ -> None)
  | TerminalFunction ->
    (function
    | [ token ] ->
      (match Tokenizer.(token.token) with
      | Tokenizer.Function -> Some (Leaf token)
      | _ -> None)
    | _ -> None)
;;

Again, the second function looks very long, but really it's just a switch statement that tries to perform some type of substitution (returning None if it fails).

4. Expressions

Almost home

Now that we have a syntax tree, we can create an expression tree that we can recursively evaluate to compute some math expression.

type t =
  | Node of string
  | Apply of (string * t)
  | Comma of (t * t)
  | Equate of (t * t)
  | Add of (t * t)
  | Subtract of (t * t)
  | Multiply of (t * t)
  | Divide of (t * t)
  | Exponentiate of (t * t)
  | Negate of t

It's a pretty straightforward definition because at this point, we've converted the data into a highly organized structure.

The conversion from a syntax tree is also pretty straightforward because both trees have the same structure. We're just mapping nodes of one tree to nodes of the other.

let node x = Node x
let apply fn x = Apply (fn, x)
let ( @@| ) fn x = apply fn x
let ( @| ) a b = Comma (a, b)
let ( =| ) a b = Equate (a, b)
let ( +| ) a b = Add (a, b)
let ( -| ) a b = Subtract (a, b)
let ( *| ) a b = Multiply (a, b)
let ( /| ) a b = Divide (a, b)
let ( **| ) a b = Exponentiate (a, b)
let ( -/ ) a = Negate a
let rec of_parse_tree =
  let ( >>== ) a f =
    match a with
    | Result.Error e -> Result.Error e
    | Result.Ok x -> Result.Ok (f x)
  in
  let ( >>>= ) a f =
    match a with
    | Result.Ok x, Result.Ok y -> Result.Ok (f x y)
    | Result.Error e, _ -> Result.Error e
    | _, Result.Error e -> Result.Error e
  in
  let open Result in
  function
  | Parser.Leaf x -> Ok (Node Tokenizer.(x.value))
  | Parser.PrefixOp { token; child } ->
    (match Tokenizer.(token.token) with
    | Negate -> of_parse_tree child >>== ( -/ )
    | Function -> of_parse_tree child >>== ( @@| ) Tokenizer.(token.value)
    | _ -> Error Tokenizer.(token.pos))
  | Parser.InfixOp { token; left; right } ->
    (match Tokenizer.(token.token) with
    | Comma -> (of_parse_tree left, of_parse_tree right) >>>= ( @| )
    | Equals -> (of_parse_tree left, of_parse_tree right) >>>= ( =| )
    | Plus -> (of_parse_tree left, of_parse_tree right) >>>= ( +| )
    | Minus -> (of_parse_tree left, of_parse_tree right) >>>= ( -| )
    | Multiply -> (of_parse_tree left, of_parse_tree right) >>>= ( *| )
    | Divide -> (of_parse_tree left, of_parse_tree right) >>>= ( /| )
    | Exponentiate -> (of_parse_tree left, of_parse_tree right) >>>= ( **| )
    | _ -> Error Tokenizer.(token.pos))
;;

Finally, to evaluate this expression tree, we can do this recursively. The entire function is quite lengthy (not because of the complexity, but rather the sheer number of functions) so I'll just include the first few lines

let rec to_float = function
  | Add (x, y) -> to_float x +. to_float y
  | Subtract (x, y) -> to_float x -. to_float y
  | Multiply (x, y) -> to_float x *. to_float y
  | Divide (x, y) -> to_float x /. to_float y
  | Exponentiate (x, y) -> to_float x **. to_float y
  | Negate x -> to_float x |> Float.neg
  | Node "e" -> Float.exp 1.
  | Node "pi" -> Float.pi
  | Node x ->
    (try Float.of_string x with
    | _ -> Float.nan)
  | Apply ("sqrt", x) -> Float.sqrt @@ to_float x
  | Apply ("cbrt", x) -> to_float x **. (1. /. 3.)
  | Apply ("log", x) -> Float.log10 @@ to_float x
  | Apply ("ln", x) -> Float.log @@ to_float x
  | Apply ("exp", x) -> Float.exp @@ to_float x
  | Apply ("max", x) ->
    (match List.map ~f:to_float (to_list x) with
    | [] -> Float.nan
    | num :: rem -> List.fold rem ~init:num ~f:Float.max)
  | Apply ("min", x) ->
    (match List.map ~f:to_float (to_list x) with
    | [] -> Float.nan
    | num :: rem -> List.fold rem ~init:num ~f:Float.min)
  | Apply ("abs", x) -> Float.abs @@ to_float x
  | Apply ("sgn", x) ->
    let y = to_float x in
    if Float.(y < 0.) then -1. else if Float.(y > 0.) then 1. else 0.
  | Apply ("sin", x) -> Float.sin @@ to_float x
   ...

And that wraps up the calculator from input string all the way to evaluation.

5. Conclusion

Steps towards a true Computer Algebra System

We saw with a couple of hundred lines of code, we can implement a parser all the way to an expression tree. However, we're not quite finished yet. After all, we still haven't implemented symbolic computations... or have we?

If you run the code, you'll see that the to_float function will return Nan when your input has variables. However, this is more so a result of the lack of conversion from a variable to a float rather than the lack of support for symbols. In fact, if you print out the expression tree, you'll see that indeed it does contain any variables in your input. The next step is simply defining the rules of mathematics so that our program can manipulate the expression tree into meaningful representations.

However, this blog is long enough so that will be a future project. In the meantime, check it out on GitHub and run it for yourself.