Hacking PowerPoint for Fun & Profit
At the start of summer, I wrote code for the Physics department at
Cambridge help them rewrite ~500 pages of dense, mathematical lecture
notes. It's a fun little hack I pieced together to extract metadata from
a proprietary extension deep within a pptx
file.
The department actually framed the project as unskilled work, and I applied with the suspicion that I could automate alot of the role.
I thought that it was possible to scrape content from the PowerPoint file, and export it into a LaTeX file. My initial plan was to export every image from the PowerPoint, and put it through an online image to LaTeX API, but I realized there was a better solution. I speculated that there had to be some sort of LateX metadata inside somewhere, which I could potentially extract, saving many hours of work.
Anyway, the first step for me was to explore around the
python-pptx
library I was using to deserialize Powerpoint
files. I found it relatively straightforward to extract text from
slides, but I had a hard time digging through things to find any
metadata:
>>> from inspect import getmembers
>>> from pprint import pprint
>>> pprint(getmembers(prs))
[('__class__', <class 'pptx.presentation.Presentation'>),
... omitted 27 dunder methods ...
('_element', <Element {http://schemas.openxmlformats.org/presentationml/2006/main}presentation at 0x7fa0a535e440>),
('_part', <pptx.parts.presentation.PresentationPart object at 0x7fa0a56ad1d0>),
('core_properties', <pptx.parts.coreprops.CorePropertiesPart object at 0x7fa0a5682910>),
('element', <Element {http://schemas.openxmlformats.org/presentationml/2006/main}presentation at 0x7fa0a535e440>),
('notes_master', <pptx.slide.NotesMaster object at 0x7fa0a4f3df90>),
('part', <pptx.parts.presentation.PresentationPart object at 0x7fa0a56ad1d0>),
('save', <bound method Presentation.save of <pptx.presentation.Presentation object at 0x7fa0a58b3650>>),
('slide_height', 6858000),
('slide_layouts', <pptx.slide.SlideLayouts object at 0x7fa0a52d7bd0>),
('slide_master', <pptx.slide.SlideMaster object at 0x7fa0a4f3fc10>),
('slide_masters', <pptx.slide.SlideMasters object at 0x7fa0a4f3fc90>),
('slide_width', 9906000),
('slides', <pptx.slide.Slides object at 0x7fa0a4f3fbd0>)]
Fun Fact: PowerPoint Files are just zipped directories of plaintext
XML. You can find any old .pptx
file on your machine and
unzip it:
$ unzip -d output/H02_Dirac/ input/H02_Dirac.pptx
$ exa output/ --tree --only-dirs
output/H02_Dirac
├── _rels
├── docProps
└── ppt
├── _rels
├── handoutMasters
│ └── _rels
├── media
├── notesMasters
│ └── _rels
├── notesSlides
│ └── _rels
├── slideLayouts
│ └── _rels
├── slideMasters
│ └── _rels
├── slides
│ └── _rels
├── tags
└── theme
I think I dug around through these files for a while, trying to loosely understand the file structure. Maybe at one point I grepped for something Latex related, and hit the lottery:
$ rg \frac --files-with-matches
ppt/tags/tag284.xml
ppt/tags/tag408.xml
ppt/tags/tag414.xml
...
Damn! Let's cat
a file and see what we got:
$ cat ppt/tags/tag433.xml
<p:tagLst
xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main"
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
xmlns:p="http://schemas.openxmlformats.org/presentationml/2006/main">
<p:tag
name="SOURCE"
val="\documentclass{article}\pagestyle{empty}
\usepackage{partiiiparticles}
\begin{document}

$a = \sqrt{\frac{1}{2}(\gamma+1)}, \ \ \ \ b =\sqrt{\frac{1}{2}(\gamma-1)}$
\end{document}
"
/>
<p:tag name="EXTERNALNAME" val="TP_tmp"/>
<p:tag name="BLEND" val="0"/>
<p:tag name="TRANSPARENT" val="0"/>
<p:tag name="RESOLUTION" val="1200"/>
<p:tag name="WORKAROUNDTRANSPARENCYBUG" val="0"/>
<p:tag name="ALLOWFONTSUBSTITUTION" val="0"/>
<p:tag name="BITMAPFORMAT" val="pngmono"/>
<p:tag name="ORIGWIDTH" val="138"/>
<p:tag name="PICTUREFILESIZE" val="6105"/>
</p:tagLst>
We have LaTeX!! This is actually a deliverable the client might want!! All of this exploration has not been pointless!
Okay, to really package this into something useful for Dr Lester & co, I need some way to find out which slide this LaTeX string actually corresponds to. That way, I can produce all of the equations in order, and speed up the rewrite significantly.
Alot more digging shows that the tagged XML file is referenced in yet more files:
$ rg "tag433"
[Content_Types].xml
2:[Omitted long line with 1 matches]
ppt/slides/_rels/slide54.xml.rels
2:[Omitted long line with 1 matches]
Bingo!
Now we just need to put together a quick script to make a nice
.tex
file for Dr Lester. Firstly, let's extract all of our
.pptx
files:
def unzip_presentation(input: Path, output: Path) -> Path:
= output / "zip"
output_dir =True, exist_ok=True)
output_dir.mkdir(parents
# Extract the contents of the pptx file to the temporary directory
with zipfile.ZipFile(input, "r") as zip_ref:
zip_ref.extractall(output_dir)
assert output_dir.is_dir()
return output_dir
Okay, next we want to get a mapping from tag number to Latex source
code. You can't
reliably regex XML, so I opted to serialize XML with
xmltodict
and validate with pydantic
:
# instances of `Latex` are strings with the type Latex
# this effectively allows us to describe Latex through Python's type system
= NewType("Latex", str)
Latex
def tags_to_latex(zip_path: Path) -> dict[str, Latex]:
"""a dictionary from tag to latex, eg. '200' to 'x = 2'"""
class Tag(BaseModel):
str = Field(alias="@name")
name: str = Field(alias="@val")
val:
class Data(BaseModel):
str = Field(alias="@xmlns:a")
xmlns_a: str = Field(alias="@xmlns:r")
xmlns_r: str = Field(alias="@xmlns:p")
xmlns_p: = Field(alias="p:tag")
tag: List[Tag]
class TagList(BaseModel):
= Field(alias="p:tagLst")
p_tag_lst: Data
= zip_path / "ppt" / "tags"
tag_path assert tag_path.is_dir()
= {}
result
for file_path in tag_path.glob("*.xml"):
= TagList.parse_obj(xmltodict.parse(file_path.read_text()))
taglist = {tag.name: tag.val for tag in taglist.p_tag_lst.tag}
tag_dict if "SOURCE" in tag_dict:
# convert filepath 'tag200.xml' into '200'
# remove \begin{document} and \end{document} from the document source
3:]] = tag_dict["SOURCE"][88:-15]
result[file_path.stem[return result
We can do something similar to get a mapping from slide number to a list of tag ids:
def slides_to_tags(zip_path: Path) -> dict[str, list[str]]:
class Relationship(BaseModel):
id: str = Field(alias="@Id")
type: str = Field(alias="@Type")
str = Field(alias="@Target")
target:
class Relationships(BaseModel):
str = Field(alias="@xmlns")
xmlns: = Field(alias="Relationship")
relationship: List[Relationship]
# very annoying fix for xmltodict:
# if <Relationships> tag only contains many <Relationship> child tags, 'relationships' is a list
# if <Relationships> tag contains only one <Relationship> child tag, 'relationships' is a dict
@validator("relationship", pre=True, always=True, allow_reuse=True)
def ensure_list(cls, value):
if not isinstance(value, list):
return [value]
return value
All that is left is to integrate this with information already
provided by python-pptx
to get a pleasant output
.tex
file for Dr Lester.
def create_text(
dict[str, list[str]], tag_mapper: dict[str, Latex]
prs, output: Path, slides_mapper: -> None:
) = output / "text"
text_dir =True, exist_ok=True)
text_dir.mkdir(parents
# text_runs will be populated with a list of strings,
# one for each text run in presentation
for idx, slide in enumerate(prs.slides, 1):
= [f"% BEGIN SLIDE {idx}\n"]
text_runs file = text_dir / f"slide{idx:03}.tex"
for shape in slide.shapes:
if not shape.has_text_frame:
continue
for paragraph in shape.text_frame.paragraphs:
for run in paragraph.runs:
text_runs.append(run.text)"\n")
text_runs.append(for tag_id in slides_mapper.get(str(idx), []):
text_runs.append(tag_mapper[tag_id])file.write_text("".join(text_runs))
= []
text_runs
for input_file, output_dir in get_pptx_files():
logging.debug(input_file)= Presentation(input_file)
prs = unzip_presentation(input_file, output_dir)
output_zip_dir = tags_to_latex(output_zip_dir)
tags_mapper = slides_to_tags(output_zip_dir)
slides_mapper create_text(prs, output_dir, slides_mapper, tags_mapper)
Bon Appétit
Of course, the resulting LaTeX isn't perfect -- it omits images, and there's alot of editorial work still left to do with the layout / presentation. But it's a very good start.
All in all, 150 lines of Python to save several days worth of work.
I'd say that's a paycheck I earned!