parsing overview
Here is a broad overview for developers of how Note Splitter’s core splitting algorithm works.
The main steps the program goes through are:
tokenization
parsing
splitting
formatting
tokenization
After the user has selected which plain text file to split and what to split by, Note Splitter looks at the file and assigns a category to each line. This process is called lexical analysis, or tokenization. Some of the categories any given line can have are: header, blockquote, footnote, empty line, table row, etc. Each line and its category is combined into a token, which is what this program calls a variable representing one small part of a file and data about that part. At this point in the program, each token only holds one line of the file, and all the tokens are in one list.
Some features of plain text files can only be correctly understood by looking at their context. That is why the next step is to double-check the token types, this time comparing adjacent tokens. For example, a code block in a markdown file might contain a Python comment that, only without context, looks like a markdown header.
print('this is code inside a code block, and . . .')
# this is a Python comment, not a markdown header.
Here’s an example that shows the result of tokenization with each token’s type on the left, and its plain text on the right:
Header | # sample markdown
Text | #first-tag #second-tag
UnorderedListItem | * bullet point 1
UnorderedListItem | * bullet point 2
EmptyLine |
Text | here is text
OrderedListItem | 1. ordered
OrderedListItem | 2. list
EmptyLine |
Header | ## second header
Text | #third-tag
CodeFence | ```python
Code | print('this code is inside a code block')
Code | while True:
Code | print(eval(input('>>> ')))
CodeFence | ```
EmptyLine |
You can find all token types this program uses on the tokens page, see their hierarchy, and see how this program tokenizes text in the Lexer class.
parsing
Next, an optional step is to group together some tokens into larger tokens. For example, table row tokens that are next to each other are put together into one table token, and two code fence tokens surrounding code tokens become a code block token. This process is called syntax analysis, or parsing. The inner tokens are still tokens, but the overall token list is shorter and more organized now. (The reason why this step is optional is because sometimes the extra layer of organization is not needed and only makes operations more difficult.)
Continuing from the previous example, here is the result of parsing:
Header | # sample markdown
Text | #first-tag #second-tag
TextList | * bullet point 1
| * bullet point 2
EmptyLine |
Text | here is text
TextList | 1. ordered
| 2. list
EmptyLine |
Header | ## second header
Text | #third-tag
CodeBlock | ```python
| print('this code is inside a code block')
| while True:
| print(eval(input('>>> ')))
| ```
EmptyLine |
Now we have a syntax tree. This data structure can simplify many operations such as splitting a file, merging multiple files, moving parts of a file around, etc.
Parsing occurs in the SyntaxTree constructor, which is in parser_.py.
splitting
Note Splitter takes the syntax tree and the user’s choice of what to split by, and splits the syntax tree into sections. (Each section’s tokens are put together into a Section token). These section tokens are each a smaller syntax tree that is still easy to modify.
Continuing from the previous example, here’s the result of splitting where the user chose to split by headers of all levels:
Section | # sample markdown
| #first-tag #second-tag
| * bullet point 1
| * bullet point 2
|
| here is text
| 1. ordered
| 2. list
|
Section | ## second header
| #third-tag
| ```python
| print('this code is inside a code block')
| while True:
| print(eval(input('>>> ')))
| ```
|
Once again, all the previous tokens still exist and can be accessed, they have simply been grouped together inside other tokens. You can see the code for splitting in splitter.py.
formatting
The last big step before saving is formatting, which includes:
Reducing header levels in sections that don’t have level 1 headers.
Copying any relevant footnotes and global tags from the source file into each section (if enabled in settings).
Converting the sections back into strings.
The code for formatting can be found in formatter_.py.
further reading
Syntax trees are most often used to process code, but even though the resources below talk mostly about code, the ideas still apply to working with a plain text syntax tree.
An excellent introduction to abstract syntax trees (ASTs) is given in ASTs - What are they and how to use them.
Wikipedia explains lexical analysis and related topics in great depth.
An advanced and detailed resource covering all of the steps above and more can be found in Crafting Interpreters.