- Created by chrisgrieser
- 🗓 Last Updated: 11/04/23 16:18:15
- 🌟 Stars on GitHub: 60
- Please consider supporting the creator by Starring or Sponsoring them on GitHub!
- Get Workflow
- Get Latest Release
- Get Source Code
From their README
PDF Annotation Extractor
An Alfred Workflow to extract annotations as Markdown & insert Pandoc Citations as References.
Automatically determines correct page numbers, merges highlights across page breaks, prepends a YAML Header bibliographic information, and some more small Quality-of-Life conveniences.
Table of Contents
Installation
-
Requirement: Alfred 5 with Powerpack
-
Install Homebrew
-
Install
pdfannots2json
by pasting the following into your terminal:brew install mgmeyers/pdfannots2json/pdfannots2json
-
Download the latest release.
-
Set the hotkey by double-clicking the sky-blue field at the top left.
-
Set up the workflow configuration inside the app.
Usage
Requirements for the PDF
- The PDF Annotation Extractor works on any PDF that has valid annotations saved in the PDF file. (Some PDF readers like Skim or Zotero 6 do not store annotations int eh PDF itself by default.)
- The filename of the PDF must be exactly the citekey (without
@
), optionally followed by an underscore and some text like{citekey}_{title}.pdf
. The citekey must not contain underscores (_
).
Note
You can achieve such a filename pattern with automatic renaming rules of most reference managers, for example with the ZotFile plugin for Zotero or the AutoFile feature of BibDesk.
Basics
Use the hotkey to trigger the Annotation Extraction on the PDF file currently selected in Finder.
Annotation Types extracted
- Highlight ➡️ bullet point, quoting text and prepending the comment
- Underline ➡️ output to Drafts.app; they are not included in the annotations.
- Free Comment ➡️ blockquote of the comment text
- Strikethrough ➡️ Markdown strikethrough
- Rectangle ➡️ image
Automatic Page Number Identification
Instead of the PDF page numbers, this workflow retrieves information about the real page numbers from the BibTeX library and inserts them. If there is no page data in the BibTeX entry (e.g., monographies), you are prompted to enter the page number manually.
- In that case, enter the real page number of your first PDF page.
- In case there is content before the actual text (e.g., a foreword or Table of Contents), the real page number
1
often occurs later in the PDF. In that case, you must enter a negative page number, reflecting the true page number the first PDF would have. Example: Your PDF is a book which has a foreword, and uses roman numbers for it; real page number 1 is PDF page number 12. If you continued the numbering backwards, the first PDF page would have page number-10
, you enter the value-10
when prompted for a page number.
Annotation Codes
Insert these special codes at the beginning of an annotation to invoke special actions on that annotation. Annotation Codes do not apply to Strikethroughs. (You can run the Alfred command acode
to display a cheat sheet showing all the following information.)
+
: Merge this highlight/underline with the previous highlight/underline. Works for annotations on the same page (= skipping text in between) and for annotations across two pages.? foo
(free comments): Turns "foo" into a Question Callout (> ![QUESTION]
) and move up. (Callouts are Obsidian-specific Syntax.)##
: Turns highlighted/underlined text into a heading that is added at that location. The number of#
determines the heading level. If the annotation is a free comment, the text following the#
is used as heading instead (Space after#
required).=
: Adds highlighted/underlined text as tags to the YAML-frontmatter (mostly used for Obsidian as output). If the annotation is a free comment, uses the text after the=
. In both cases, the annotation is removed afterwards._
(highlights only): Removes the_
and creates a copy of the annotation, but with the typeunderline
. This annotation code avoids having to highlight and underline the same text segment to have it in both places.
Extracting Images
- The respective images is saved in the
attachments
subfolder of the output folder, and named{citekey}_image{n}.png
. - The images is embedded in the markdown file with the
![[ ]]
syntax, e.g.![[filename.png|foobar]]
- Any
rectangle
type annotation in the PDF is extracted as image. - If the rectangle annotation has any comment, it is used as the alt-text for the image. (Note that some PDF readers like PDF Expert do not allow you to add a comment to rectangular annotations.)
Troubleshooting
- Update to the latest version of
pdfannots2json
by running the following Terminal commandbrew upgrade pdfannots2json
in your terminal. - This workflow does not work with annotations that are not actually saved in the PDF file. Some PDF Readers like Skim or Zotero 6 do this, but you can tell those PDF readers to save the notes in the actual PDF.
- This workflow sometimes does not work when the PDF has bigger free-form annotations (e.g., from using a stylus on a tablet). Delete all those annotations that are "free form" and the workflow should work.
- When the hotkey does not work when triggered in Preview, most likely the Alfred app does not have permission to access the app. You can give Alfred permission in the macOS System Settings:
- There are some cases where the extracted text is all jumbled up. In that case, it's a is a problem with the upstream
pdfannots2json
. The issue is tracked here, and you can also report your problem.
Note
As a fallback, you can usepdfannots
as extraction engine, as a different PDF engine sometimes fixes issues. This requires installing pdfannots viapip3 install pdfannots
, and switching the fallback engine viaaconf
. Note thatpdfannots
does not support image extraction or extracting only recent annotations, so generally you want to keep usingpdfannots2json
.
Credits
Thanks
- Thanks to Andrew Baumann for pdfannots, which caused me to develop this workflow (even though it does not use
pdfannots
anymore). - Also many thanks to @mgmeyers for pdfannots2json, which enabled many improvements to this workflow.
- I also thank @StPag for his ideas on annotation codes.
- Icons created by Freepik/Flaticon.
About the Developer
In my day job, I am a sociologist studying the social mechanisms underlying the digital economy. For my PhD project, I investigate the governance of the app economy and how software ecosystems manage the tension between innovation and compatibility. If you are interested in this subject, feel free to get in touch!