Hacking PowerPoint for Fun & Profit

At the start of summer, I wrote code for the Physics department at Cambridge help them rewrite ~500 pages of dense, mathematical lecture notes. It's a fun little hack I pieced together to extract metadata from a proprietary extension deep within a pptx file.

The department actually framed the project as unskilled work, and I applied with the suspicion that I could automate alot of the role.

I thought that it was possible to scrape content from the PowerPoint file, and export it into a LaTeX file. My initial plan was to export every image from the PowerPoint, and put it through an online image to LaTeX API, but I realized there was a better solution. I speculated that there had to be some sort of LateX metadata inside somewhere, which I could potentially extract, saving many hours of work.

Anyway, the first step for me was to explore around the python-pptx library I was using to deserialize Powerpoint files. I found it relatively straightforward to extract text from slides, but I had a hard time digging through things to find any metadata:

>>> from inspect import getmembers
>>> from pprint import pprint
>>> pprint(getmembers(prs))
[('__class__', <class 'pptx.presentation.Presentation'>),

 ... omitted 27 dunder methods ...

 ('_element', <Element {http://schemas.openxmlformats.org/presentationml/2006/main}presentation at 0x7fa0a535e440>),
 ('_part', <pptx.parts.presentation.PresentationPart object at 0x7fa0a56ad1d0>),
 ('core_properties', <pptx.parts.coreprops.CorePropertiesPart object at 0x7fa0a5682910>),
 ('element', <Element {http://schemas.openxmlformats.org/presentationml/2006/main}presentation at 0x7fa0a535e440>),
 ('notes_master', <pptx.slide.NotesMaster object at 0x7fa0a4f3df90>),
 ('part', <pptx.parts.presentation.PresentationPart object at 0x7fa0a56ad1d0>),
 ('save', <bound method Presentation.save of <pptx.presentation.Presentation object at 0x7fa0a58b3650>>),
 ('slide_height', 6858000),
 ('slide_layouts', <pptx.slide.SlideLayouts object at 0x7fa0a52d7bd0>),
 ('slide_master', <pptx.slide.SlideMaster object at 0x7fa0a4f3fc10>),
 ('slide_masters', <pptx.slide.SlideMasters object at 0x7fa0a4f3fc90>),
 ('slide_width', 9906000),
 ('slides', <pptx.slide.Slides object at 0x7fa0a4f3fbd0>)]

Fun Fact: PowerPoint Files are just zipped directories of plaintext XML. You can find any old .pptx file on your machine and unzip it:

$ unzip -d output/H02_Dirac/ input/H02_Dirac.pptx
$ exa output/ --tree --only-dirs
output/H02_Dirac
├── _rels
├── docProps
└── ppt
   ├── _rels
   ├── handoutMasters
   │  └── _rels
   ├── media
   ├── notesMasters
   │  └── _rels
   ├── notesSlides
   │  └── _rels
   ├── slideLayouts
   │  └── _rels
   ├── slideMasters
   │  └── _rels
   ├── slides
   │  └── _rels
   ├── tags
   └── theme

I think I dug around through these files for a while, trying to loosely understand the file structure. Maybe at one point I grepped for something Latex related, and hit the lottery:

$ rg \frac --files-with-matches
ppt/tags/tag284.xml
ppt/tags/tag408.xml
ppt/tags/tag414.xml
 ...

Damn! Let's cat a file and see what we got:

$ cat ppt/tags/tag433.xml
<p:tagLst
  xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main"
  xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
  xmlns:p="http://schemas.openxmlformats.org/presentationml/2006/main">
  <p:tag 
    name="SOURCE" 
    val="\documentclass{article}\pagestyle{empty}&#xA;\usepackage{partiiiparticles}&#xA;\begin{document}&#xA;&#xA;$a = \sqrt{\frac{1}{2}(\gamma+1)}, \ \ \ \ b =\sqrt{\frac{1}{2}(\gamma-1)}$&#xA;\end{document}&#xA;"
  />
  <p:tag name="EXTERNALNAME" val="TP_tmp"/>
  <p:tag name="BLEND" val="0"/>
  <p:tag name="TRANSPARENT" val="0"/>
  <p:tag name="RESOLUTION" val="1200"/>
  <p:tag name="WORKAROUNDTRANSPARENCYBUG" val="0"/>
  <p:tag name="ALLOWFONTSUBSTITUTION" val="0"/>
  <p:tag name="BITMAPFORMAT" val="pngmono"/>
  <p:tag name="ORIGWIDTH" val="138"/>
  <p:tag name="PICTUREFILESIZE" val="6105"/>
</p:tagLst>

We have LaTeX!! This is actually a deliverable the client might want!! All of this exploration has not been pointless!

Okay, to really package this into something useful for Dr Lester & co, I need some way to find out which slide this LaTeX string actually corresponds to. That way, I can produce all of the equations in order, and speed up the rewrite significantly.

Alot more digging shows that the tagged XML file is referenced in yet more files:

$ rg "tag433"
[Content_Types].xml
2:[Omitted long line with 1 matches]

ppt/slides/_rels/slide54.xml.rels
2:[Omitted long line with 1 matches]

Bingo!

Now we just need to put together a quick script to make a nice .tex file for Dr Lester. Firstly, let's extract all of our .pptx files:

def unzip_presentation(input: Path, output: Path) -> Path:
    output_dir = output / "zip"
    output_dir.mkdir(parents=True, exist_ok=True)

    # Extract the contents of the pptx file to the temporary directory
    with zipfile.ZipFile(input, "r") as zip_ref:
        zip_ref.extractall(output_dir)

    assert output_dir.is_dir()
    return output_dir

Okay, next we want to get a mapping from tag number to Latex source code. You can't reliably regex XML, so I opted to serialize XML with xmltodict and validate with pydantic:

# instances of `Latex` are strings with the type Latex
# this effectively allows us to describe Latex through Python's type system
Latex = NewType("Latex", str)


def tags_to_latex(zip_path: Path) -> dict[str, Latex]:
    """a dictionary from tag to latex, eg. '200' to 'x = 2'"""
    class Tag(BaseModel):
        name: str = Field(alias="@name")
        val: str = Field(alias="@val")

    class Data(BaseModel):
        xmlns_a: str = Field(alias="@xmlns:a")
        xmlns_r: str = Field(alias="@xmlns:r")
        xmlns_p: str = Field(alias="@xmlns:p")
        tag: List[Tag] = Field(alias="p:tag")

    class TagList(BaseModel):
        p_tag_lst: Data = Field(alias="p:tagLst")

    tag_path = zip_path / "ppt" / "tags"
    assert tag_path.is_dir()

    result = {}

    for file_path in tag_path.glob("*.xml"):
        taglist = TagList.parse_obj(xmltodict.parse(file_path.read_text()))
        tag_dict = {tag.name: tag.val for tag in taglist.p_tag_lst.tag}
        if "SOURCE" in tag_dict:
            # convert filepath 'tag200.xml' into '200'
            # remove \begin{document} and \end{document} from the document source
            result[file_path.stem[3:]] = tag_dict["SOURCE"][88:-15]
    return result

We can do something similar to get a mapping from slide number to a list of tag ids:

def slides_to_tags(zip_path: Path) -> dict[str, list[str]]:
    class Relationship(BaseModel):
        id: str = Field(alias="@Id")
        type: str = Field(alias="@Type")
        target: str = Field(alias="@Target")

    class Relationships(BaseModel):
        xmlns: str = Field(alias="@xmlns")
        relationship: List[Relationship] = Field(alias="Relationship")

        # very annoying fix for xmltodict:
        # if <Relationships> tag only contains many <Relationship> child tags, 'relationships' is a list
        # if <Relationships> tag contains only one <Relationship> child tag, 'relationships' is a dict
        @validator("relationship", pre=True, always=True, allow_reuse=True)
        def ensure_list(cls, value):
            if not isinstance(value, list):
                return [value]
            return value

All that is left is to integrate this with information already provided by python-pptx to get a pleasant output .tex file for Dr Lester.

def create_text(
    prs, output: Path, slides_mapper: dict[str, list[str]], tag_mapper: dict[str, Latex]
) -> None:
    text_dir = output / "text"
    text_dir.mkdir(parents=True, exist_ok=True)

    # text_runs will be populated with a list of strings,
    # one for each text run in presentation
    for idx, slide in enumerate(prs.slides, 1):
        text_runs = [f"% BEGIN SLIDE {idx}\n"]
        file = text_dir / f"slide{idx:03}.tex"
        for shape in slide.shapes:
            if not shape.has_text_frame:
                continue
            for paragraph in shape.text_frame.paragraphs:
                for run in paragraph.runs:
                    text_runs.append(run.text)
                text_runs.append("\n")
        for tag_id in slides_mapper.get(str(idx), []):
            text_runs.append(tag_mapper[tag_id])
        file.write_text("".join(text_runs))
        text_runs = []


for input_file, output_dir in get_pptx_files():
    logging.debug(input_file)
    prs = Presentation(input_file)
    output_zip_dir = unzip_presentation(input_file, output_dir)
    tags_mapper = tags_to_latex(output_zip_dir)
    slides_mapper = slides_to_tags(output_zip_dir)
    create_text(prs, output_dir, slides_mapper, tags_mapper)

Bon Appétit

Of course, the resulting LaTeX isn't perfect -- it omits images, and there's alot of editorial work still left to do with the layout / presentation. But it's a very good start.

All in all, 150 lines of Python to save several days worth of work.

I'd say that's a paycheck I earned!