At the start of summer, I wrote code for the Physics department at Cambridge help them rewrite ~500 pages of dense, mathematical lecture notes. It’s a fun little hack I pieced together to extract metadata from a proprietary extension deep within a pptx file.

Prelude: how the department got here

Say you’re a professor in the late 90s, and you’re trying to deliver a course on particle physics. For whatever reason, you choose to present your lectures with Microsoft PowerPoint, along with an extension that allows you to use LaTeX. The addon is quite pleasant to use; equations behave like images, and can be edited with a right click.

Fast forward thirty years — some five hundred slides — to today. The powerpoint extension no longer works, and all of the equations in the handout are no longer editable.

The slides are passed on to a new lecturer (Dr Lester), who genuinely tries to make this work. Editing equations like images, inserting his own equations as screenshots, manually adjusting equation numberings across many slides. He also comes up with a genuinely commendable strategy to incrementally rewrite from PowerPoint to LaTeX. He first exported all slides as PDFs, and inserted them with \includepdf into a LaTeX document which allowed him to essentially layer images over existing slides. Now, he can write equations in LaTeX and have it compile to a slide deck!

Ultimately, it’s still too painful to use: Equation and slide numbering is non-negotiable in a teaching setting, and editing existing content is too error prone.

Dr Lester did what pretty much every software developer has once considered: giving up and asking management to approve a rewrite. Having proven that a full rewrite was necessary, he showed that he simply didn’t have the time to rewrite 500 slides of dense mathematics and got funding to hire a student worker to write the lectures.

Fun Hacking

Dr Lester actually framed the project as unskilled work, and I applied with the suspicion that I could automate alot of the role.

I thought that it was possible to scrape content from the PowerPoint file, and print it into a LaTeX file. My initial plan was to export every image from the PowerPoint, and put it through an online image to LaTeX API, but I realized there was a better solution. I speculated that there had to be some sort of LateX metadata inside somewhere, which I could potentially extract, saving many hours of work.

Anyway, the first step for me was to explore around the python-pptx library I was using to deserialize Powerpoint files. I found it relatively straightforward to extract text from slides, but I had a hard time digging through things to find any metadata:

>>> from inspect import getmembers
>>> from pprint import pprint
>>> pprint(getmembers(prs))
[('__class__', <class 'pptx.presentation.Presentation'>),

 ... omitted 27 dunder methods ...

 ('_element', <Element {http://schemas.openxmlformats.org/presentationml/2006/main}presentation at 0x7fa0a535e440>),
 ('_part', <pptx.parts.presentation.PresentationPart object at 0x7fa0a56ad1d0>),
 ('core_properties', <pptx.parts.coreprops.CorePropertiesPart object at 0x7fa0a5682910>),
 ('element', <Element {http://schemas.openxmlformats.org/presentationml/2006/main}presentation at 0x7fa0a535e440>),
 ('notes_master', <pptx.slide.NotesMaster object at 0x7fa0a4f3df90>),
 ('part', <pptx.parts.presentation.PresentationPart object at 0x7fa0a56ad1d0>),
 ('save', <bound method Presentation.save of <pptx.presentation.Presentation object at 0x7fa0a58b3650>>),
 ('slide_height', 6858000),
 ('slide_layouts', <pptx.slide.SlideLayouts object at 0x7fa0a52d7bd0>),
 ('slide_master', <pptx.slide.SlideMaster object at 0x7fa0a4f3fc10>),
 ('slide_masters', <pptx.slide.SlideMasters object at 0x7fa0a4f3fc90>),
 ('slide_width', 9906000),
 ('slides', <pptx.slide.Slides object at 0x7fa0a4f3fbd0>)]

Fun Fact: PowerPoint Files are just zipped directories of plaintext XML. You can find any old .pptx file on your machine and unzip it:

$ unzip -d output/H02_Dirac/ input/H02_Dirac.pptx
$ exa output/ --tree --only-dirs
output/H02_Dirac
├── _rels
├── docProps
└── ppt
   ├── _rels
   ├── handoutMasters
   │  └── _rels
   ├── media
   ├── notesMasters
   │  └── _rels
   ├── notesSlides
   │  └── _rels
   ├── slideLayouts
   │  └── _rels
   ├── slideMasters
   │  └── _rels
   ├── slides
   │  └── _rels
   ├── tags
   └── theme

I think I dug around through these files for a while, trying to loosely understand the file structure. Maybe at one point I grepped for something Latex related, and hit the lottery:

$ rg \frac --files-with-matches
ppt/tags/tag284.xml
ppt/tags/tag408.xml
ppt/tags/tag414.xml
 ...

Damn! Let’s cat a file and see what we got:

$ cat ppt/tags/tag433.xml
<p:tagLst
  xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main"
  xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
  xmlns:p="http://schemas.openxmlformats.org/presentationml/2006/main">
  <p:tag 
    name="SOURCE" 
    val="\documentclass{article}\pagestyle{empty}&#xA;\usepackage{partiiiparticles}&#xA;\begin{document}&#xA;&#xA;$a = \sqrt{\frac{1}{2}(\gamma+1)}, \ \ \ \ b =\sqrt{\frac{1}{2}(\gamma-1)}$&#xA;\end{document}&#xA;"
  />
  <p:tag name="EXTERNALNAME" val="TP_tmp"/>
  <p:tag name="BLEND" val="0"/>
  <p:tag name="TRANSPARENT" val="0"/>
  <p:tag name="RESOLUTION" val="1200"/>
  <p:tag name="WORKAROUNDTRANSPARENCYBUG" val="0"/>
  <p:tag name="ALLOWFONTSUBSTITUTION" val="0"/>
  <p:tag name="BITMAPFORMAT" val="pngmono"/>
  <p:tag name="ORIGWIDTH" val="138"/>
  <p:tag name="PICTUREFILESIZE" val="6105"/>
</p:tagLst>

We have LaTeX!! This is actually a deliverable the client might want!! All of this exploration has not been pointless!

Okay, to really package this into something useful for Dr Lester & co, I need some way to find out which slide this LaTeX string actually corresponds to. That way, I can produce all of the equations in order, and speed up the rewrite significantly.

Alot more digging shows that the tagged XML file is referenced in yet more files:

$ rg "tag433"
[Content_Types].xml
2:[Omitted long line with 1 matches]

ppt/slides/_rels/slide54.xml.rels
2:[Omitted long line with 1 matches]

Bingo!

Now we just need to put together a quick script to make a nice .tex file for Dr Lester. Firstly, let’s extract all of our .pptx files:

def unzip_presentation(input: Path, output: Path) -> Path:
    output_dir = output / "zip"
    output_dir.mkdir(parents=True, exist_ok=True)
 
    # Extract the contents of the pptx file to the temporary directory
    with zipfile.ZipFile(input, "r") as zip_ref:
        zip_ref.extractall(output_dir)
 
    assert output_dir.is_dir()
    return output_dir

Okay, next we want to get a mapping from tag number to Latex source code. You can’t reliably regex XML, so I opted to serialize XML with xmltodict and validate with pydantic:

# instances of `Latex` are strings with the type Latex
# this effectively allows us to describe Latex through Python's type system
Latex = NewType("Latex", str)
 
 
def tags_to_latex(zip_path: Path) -> dict[str, Latex]:
    """a dictionary from tag to latex, eg. '200' to 'x = 2'"""
    class Tag(BaseModel):
        name: str = Field(alias="@name")
        val: str = Field(alias="@val")
 
    class Data(BaseModel):
        xmlns_a: str = Field(alias="@xmlns:a")
        xmlns_r: str = Field(alias="@xmlns:r")
        xmlns_p: str = Field(alias="@xmlns:p")
        tag: List[Tag] = Field(alias="p:tag")
 
    class TagList(BaseModel):
        p_tag_lst: Data = Field(alias="p:tagLst")
 
    tag_path = zip_path / "ppt" / "tags"
    assert tag_path.is_dir()
 
    result = {}
 
    for file_path in tag_path.glob("*.xml"):
        taglist = TagList.parse_obj(xmltodict.parse(file_path.read_text()))
        tag_dict = {tag.name: tag.val for tag in taglist.p_tag_lst.tag}
        if "SOURCE" in tag_dict:
            # convert filepath 'tag200.xml' into '200'
            # remove \begin{document} and \end{document} from the document source
            result[file_path.stem[3:]] = tag_dict["SOURCE"][88:-15]
    return result

We can do something similar to get a mapping from slide number to a list of tag ids:

def slides_to_tags(zip_path: Path) -> dict[str, list[str]]:
    class Relationship(BaseModel):
        id: str = Field(alias="@Id")
        type: str = Field(alias="@Type")
        target: str = Field(alias="@Target")
 
    class Relationships(BaseModel):
        xmlns: str = Field(alias="@xmlns")
        relationship: List[Relationship] = Field(alias="Relationship")
 
        # very annoying fix for xmltodict:
        # if <Relationships> tag only contains many <Relationship> child tags, 'relationships' is a list
        # if <Relationships> tag contains only one <Relationship> child tag, 'relationships' is a dict
        @validator("relationship", pre=True, always=True, allow_reuse=True)
        def ensure_list(cls, value):
            if not isinstance(value, list):
                return [value]
            return value

All that is left is to integrate this with information already provided by python-pptx to get a pleasant output .tex file for Dr Lester.

def create_text(
    prs, output: Path, slides_mapper: dict[str, list[str]], tag_mapper: dict[str, Latex]
) -> None:
    text_dir = output / "text"
    text_dir.mkdir(parents=True, exist_ok=True)
 
    # text_runs will be populated with a list of strings,
    # one for each text run in presentation
    for idx, slide in enumerate(prs.slides, 1):
        text_runs = [f"% BEGIN SLIDE {idx}\n"]
        file = text_dir / f"slide{idx:03}.tex"
        for shape in slide.shapes:
            if not shape.has_text_frame:
                continue
            for paragraph in shape.text_frame.paragraphs:
                for run in paragraph.runs:
                    text_runs.append(run.text)
                text_runs.append("\n")
        for tag_id in slides_mapper.get(str(idx), []):
            text_runs.append(tag_mapper[tag_id])
        file.write_text("".join(text_runs))
        text_runs = []
 
 
for input_file, output_dir in get_pptx_files():
    logging.debug(input_file)
    prs = Presentation(input_file)
    output_zip_dir = unzip_presentation(input_file, output_dir)
    tags_mapper = tags_to_latex(output_zip_dir)
    slides_mapper = slides_to_tags(output_zip_dir)
    create_text(prs, output_dir, slides_mapper, tags_mapper)

Bon Appétit

Of course, the resulting LaTeX isn’t perfect — it omits images, and there’s alot of editorial work still left to do with the layout / presentation. But it’s a very good start.

All in all, 150 lines of Python to save several days worth of work.

I’d say that’s a paycheck I earned!