Hacking PowerPoint for Fun & Profit
At the start of summer, I wrote code for the Physics department at
Cambridge help them rewrite ~500 pages of dense, mathematical lecture
notes. It's a fun little hack I pieced together to extract metadata from
a proprietary extension deep within a pptx file.
The department actually framed the project as unskilled work, and I applied with the suspicion that I could automate alot of the role.
I thought that it was possible to scrape content from the PowerPoint file, and export it into a LaTeX file. My initial plan was to export every image from the PowerPoint, and put it through an online image to LaTeX API, but I realized there was a better solution. I speculated that there had to be some sort of LateX metadata inside somewhere, which I could potentially extract, saving many hours of work.
Anyway, the first step for me was to explore around the
python-pptx library I was using to deserialize Powerpoint
files. I found it relatively straightforward to extract text from
slides, but I had a hard time digging through things to find any
metadata:
>>> from inspect import getmembers
>>> from pprint import pprint
>>> pprint(getmembers(prs))
[('__class__', <class 'pptx.presentation.Presentation'>),
... omitted 27 dunder methods ...
('_element', <Element {http://schemas.openxmlformats.org/presentationml/2006/main}presentation at 0x7fa0a535e440>),
('_part', <pptx.parts.presentation.PresentationPart object at 0x7fa0a56ad1d0>),
('core_properties', <pptx.parts.coreprops.CorePropertiesPart object at 0x7fa0a5682910>),
('element', <Element {http://schemas.openxmlformats.org/presentationml/2006/main}presentation at 0x7fa0a535e440>),
('notes_master', <pptx.slide.NotesMaster object at 0x7fa0a4f3df90>),
('part', <pptx.parts.presentation.PresentationPart object at 0x7fa0a56ad1d0>),
('save', <bound method Presentation.save of <pptx.presentation.Presentation object at 0x7fa0a58b3650>>),
('slide_height', 6858000),
('slide_layouts', <pptx.slide.SlideLayouts object at 0x7fa0a52d7bd0>),
('slide_master', <pptx.slide.SlideMaster object at 0x7fa0a4f3fc10>),
('slide_masters', <pptx.slide.SlideMasters object at 0x7fa0a4f3fc90>),
('slide_width', 9906000),
('slides', <pptx.slide.Slides object at 0x7fa0a4f3fbd0>)]
Fun Fact: PowerPoint Files are just zipped directories of plaintext
XML. You can find any old .pptx file on your machine and
unzip it:
$ unzip -d output/H02_Dirac/ input/H02_Dirac.pptx
$ exa output/ --tree --only-dirs
output/H02_Dirac
├── _rels
├── docProps
└── ppt
├── _rels
├── handoutMasters
│ └── _rels
├── media
├── notesMasters
│ └── _rels
├── notesSlides
│ └── _rels
├── slideLayouts
│ └── _rels
├── slideMasters
│ └── _rels
├── slides
│ └── _rels
├── tags
└── theme
I think I dug around through these files for a while, trying to loosely understand the file structure. Maybe at one point I grepped for something Latex related, and hit the lottery:
$ rg \frac --files-with-matches
ppt/tags/tag284.xml
ppt/tags/tag408.xml
ppt/tags/tag414.xml
...
Damn! Let's cat a file and see what we got:
$ cat ppt/tags/tag433.xml
<p:tagLst
xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main"
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
xmlns:p="http://schemas.openxmlformats.org/presentationml/2006/main">
<p:tag
name="SOURCE"
val="\documentclass{article}\pagestyle{empty}
\usepackage{partiiiparticles}
\begin{document}

$a = \sqrt{\frac{1}{2}(\gamma+1)}, \ \ \ \ b =\sqrt{\frac{1}{2}(\gamma-1)}$
\end{document}
"
/>
<p:tag name="EXTERNALNAME" val="TP_tmp"/>
<p:tag name="BLEND" val="0"/>
<p:tag name="TRANSPARENT" val="0"/>
<p:tag name="RESOLUTION" val="1200"/>
<p:tag name="WORKAROUNDTRANSPARENCYBUG" val="0"/>
<p:tag name="ALLOWFONTSUBSTITUTION" val="0"/>
<p:tag name="BITMAPFORMAT" val="pngmono"/>
<p:tag name="ORIGWIDTH" val="138"/>
<p:tag name="PICTUREFILESIZE" val="6105"/>
</p:tagLst>
We have LaTeX!! This is actually a deliverable the client might want!! All of this exploration has not been pointless!
Okay, to really package this into something useful for Dr Lester & co, I need some way to find out which slide this LaTeX string actually corresponds to. That way, I can produce all of the equations in order, and speed up the rewrite significantly.
Alot more digging shows that the tagged XML file is referenced in yet more files:
$ rg "tag433"
[Content_Types].xml
2:[Omitted long line with 1 matches]
ppt/slides/_rels/slide54.xml.rels
2:[Omitted long line with 1 matches]
Bingo!
Now we just need to put together a quick script to make a nice
.tex file for Dr Lester. Firstly, let's extract all of our
.pptx files:
def unzip_presentation (input : Path , output : Path ) -> Path :
output_dir = output / "zip"
output_dir .mkdir (parents = True , exist_ok = True )
# Extract the contents of the pptx file to the temporary directory
with zipfile .ZipFile (input , "r" ) as zip_ref :
zip_ref .extractall (output_dir )
assert output_dir .is_dir ()
return output_dir
Okay, next we want to get a mapping from tag number to Latex source
code. You can't
reliably regex XML, so I opted to serialize XML with
xmltodict and validate with pydantic:
# instances of `Latex` are strings with the type Latex
# this effectively allows us to describe Latex through Python's type system
Latex = NewType ("Latex" , str )
def tags_to_latex (zip_path : Path ) -> dict [str , Latex ]:
"""a dictionary from tag to latex, eg. '200' to 'x = 2'"""
class Tag (BaseModel ):
name : str = Field (alias = "@name" )
val : str = Field (alias = "@val" )
class Data (BaseModel ):
xmlns_a : str = Field (alias = "@xmlns:a" )
xmlns_r : str = Field (alias = "@xmlns:r" )
xmlns_p : str = Field (alias = "@xmlns:p" )
tag : List [Tag ] = Field (alias = "p:tag" )
class TagList (BaseModel ):
p_tag_lst : Data = Field (alias = "p:tagLst" )
tag_path = zip_path / "ppt" / "tags"
assert tag_path .is_dir ()
result = {}
for file_path in tag_path .glob ("*.xml" ):
taglist = TagList .parse_obj (xmltodict .parse (file_path .read_text ()))
tag_dict = {tag .name : tag .val for tag in taglist .p_tag_lst .tag }
if "SOURCE" in tag_dict :
# convert filepath 'tag200.xml' into '200'
# remove \begin{document} and \end{document} from the document source
result [file_path .stem [3 :]] = tag_dict ["SOURCE" ][88 :- 15 ]
return result
We can do something similar to get a mapping from slide number to a list of tag ids:
def slides_to_tags (zip_path : Path ) -> dict [str , list [str ]]:
class Relationship (BaseModel ):
id : str = Field (alias = "@Id" )
type : str = Field (alias = "@Type" )
target : str = Field (alias = "@Target" )
class Relationships (BaseModel ):
xmlns : str = Field (alias = "@xmlns" )
relationship : List [Relationship ] = Field (alias = "Relationship" )
# very annoying fix for xmltodict:
# if <Relationships> tag only contains many <Relationship> child tags, 'relationships' is a list
# if <Relationships> tag contains only one <Relationship> child tag, 'relationships' is a dict
@ validator ( "relationship" , pre = True , always = True , allow_reuse = True )
def ensure_list (cls , value ):
if not isinstance (value , list ):
return [value ]
return value
All that is left is to integrate this with information already
provided by python-pptx to get a pleasant output
.tex file for Dr Lester.
def create_text (
prs , output : Path , slides_mapper : dict [str , list [str ]], tag_mapper : dict [str , Latex ]
) -> None :
text_dir = output / "text"
text_dir .mkdir (parents = True , exist_ok = True )
# text_runs will be populated with a list of strings,
# one for each text run in presentation
for idx , slide in enumerate (prs .slides , 1 ):
text_runs = [f"% BEGIN SLIDE { idx } \n " ]
file = text_dir / f"slide { idx :03 } .tex"
for shape in slide .shapes :
if not shape .has_text_frame :
continue
for paragraph in shape .text_frame .paragraphs :
for run in paragraph .runs :
text_runs .append (run .text )
text_runs .append ("\n" )
for tag_id in slides_mapper .get (str (idx ), []):
text_runs .append (tag_mapper [tag_id ])
file .write_text ("" .join (text_runs ))
text_runs = []
for input_file , output_dir in get_pptx_files ():
logging .debug (input_file )
prs = Presentation (input_file )
output_zip_dir = unzip_presentation (input_file , output_dir )
tags_mapper = tags_to_latex (output_zip_dir )
slides_mapper = slides_to_tags (output_zip_dir )
create_text (prs , output_dir , slides_mapper , tags_mapper )
Bon Appétit
Of course, the resulting LaTeX isn't perfect -- it omits images, and there's alot of editorial work still left to do with the layout / presentation. But it's a very good start.
All in all, 150 lines of Python to save several days worth of work.
I'd say that's a paycheck I earned!