Scraping All Paper Titles and URLs from an Academia.edu page using Google Apps Script
With appendix - recommended add-on for automatically transforming Markdown formatted text from ChatGPT into Google Doc formatting
Previous posts on the topic of automation using using Google Apps Script: “Automate Image Downloading Using Google Apps Script” (Aug 29, 2023); "Scripting the Talmud: Automated Talmudic Text Extraction and Formatting"; "Scripting the Talmud Part#2: Automated Rashi Text Extraction and digital layout of tzurat daf". See also: "Data Transformation with the Amazing New ChatGpt4 ‘Code Interpreter’, with Google Colab"
Introduction
Academia.edu is a popular research-sharing platform. Academia.edu doesn't offer sitemaps or simple ways to extract URLs directly. This lack of direct access makes a scraping approach valuable.
To clarify the need: Google Sheets, Excel, or Word allow users to insert and copy-paste hyperlinked text. However, extracting a large number of raw URLs is a tedious process, since there are no built-in functions to do so.
Web scraping is the technique of extracting data programmatically, in bulk, from websites. It's a valuable tool in scenarios where there is too much data to extract manually.
In this post, I dive into a solution that uses Google Apps Script to scrape the titles and URLs of my papers on my Academia.edu page (registration required), and populate them into a Google Sheet.
Some relevant technical terms defined
Anchor text - Wikipedia: “The anchor text, link label or link text is the visible, clickable text in an HTML hyperlink. “
Regular expression - Wikipedia: “A regular expression (shortened as regex [...]) is a sequence of characters that specifies a match pattern in text.”
Google Apps Script - Wikipedia: “Google Apps Script is a scripting platform developed by Google for light-weight application development in the Google Workspace platform.”
Web scraping - Wikipedia: “Web scraping [...] is data scraping used for extracting data from websites.”
Markdown - Wikipedia: “Markdown is a lightweight markup language for creating formatted text using a plain-text editor.”
Step-by-Step Guide
1. Setting up Google Sheets:
Start by creating a new Google Sheet. This will be our destination for the scraped URLs and link names.
2. Accessing Google Apps Script:
From your Google Sheet, click on Extensions > Apps Script to open the Google Apps Script editor.
3. Writing the Script:
Our scrapeURLs function will:
Fetch the HTML content of the webpage.
Use regular expressions to identify and extract URLs and their associated anchor texts.
Populate the Google Sheet with the extracted data.
Here's the script:
function scrapeURLs() {
var url = "https://yeshiva.academia.edu/EzraBrand"; // Replace this with the URL you want to scrape
var response = UrlFetchApp.fetch(url);
var htmlContent = response.getContentText();
// Use regex to match all <a> tags
var regex = /<a[^>]*href="([^"]+)"[^>]*>([^<]+)<\/a>/g;
var match;
var links = [];
while (match = regex.exec(htmlContent)) {
var linkUrl = match[1];
var linkText = match[2].trim();
links.push([linkUrl, linkText]);
}
// Write URLs and link names to the active Google Sheet
var sheet = SpreadsheetApp.getActiveSpreadsheet().getActiveSheet();
for (var i = 0; i < links.length; i++) {
sheet.getRange(i + 1, 1).setValue(links[i][0]);
sheet.getRange(i + 1, 2).setValue(links[i][1]);
}
}
function getUrlsFromElement(element) {
var urls = [];
var children = element.getChildren();
children.forEach(function(child) {
if (child.getName() === "a") {
var link = child.getAttribute("href");
if (link) {
urls.push(link.getValue());
}
}
urls = urls.concat(getUrlsFromElement(child));
});
return urls;
}
Note: This method relies on regex and assumes that the link text does not contain any inner HTML tags. Use of regex was necessary, due to the fact that Academia.edu doesn’t use a consistent XML structure. If the structure of Academia.edu changes or becomes more complex, the regex approach might need adjustments.
4. Running the Script:
After saving your script in the Google Apps Script editor, click the play (▶️) button to execute the scrapeURLs function.
5. Checking Results:
Return to your Google Sheet. The sheet should now display two columns: one with the scraped URLs and the other with their corresponding link texts.
I then used Google Sheets’s REGEXMATCH function to filter for the URLs that were articles. These are URLs that start with a number after the last forward slash (/):
=REGEXMATCH(A2, "https://www\.academia\.edu/\d+/")
First few rows of final output:
Conclusion
While platforms like Google Sheets, Excel, and Word are powerful in their own right, they sometimes lack the specialized tools needed for tasks like URL extraction. Coupled with websites like Academia.edu not offering direct access to lists of URLs, scraping becomes a valuable skill. Using Google Apps Script, I’ve shown how you can quickly extract and catalog URLs and anchor texts from a webpage.
Appendix - Using Google Docs with ‘Markdown’
‘Markdown’, a lightweight markup language, is used by ChatGPT for formatting. However, when transferring Markdown content into platforms like Google Docs, the raw formatting symbols (like hashes and asterisks) are exposed.
Example, from this blogpost:
### Step-by-Step Guide
**1. Setting up Google Sheets:**
Enter the "Markdown To Docs (GdocifyMd)" add-on for Google Docs. This tool converts Markdown content into perfectly formatted Google Docs text. It eliminates the need for manual formatting or external conversions. So far, I can recommend it. It’s available as a free Google Docs add-on in the G Suite Marketplace.