Guides
Extract & summarize websites
֍
Extract and summarize content from web pages

In this example, we'll generate a concise summary of the films showing at a local indie movie theater.

Before we begin the Substrate part of this example, we'll first fetch the html content of the theater's calendar page, and use BeautifulSoup (opens in a new tab) to find urls for all films showing today or tomorrow.

script.py

import requests, re
from bs4 import BeautifulSoup
from datetime import datetime, timedelta
today = datetime.today()
tomorrow = today + timedelta(days=1)
today_str = today.strftime("%Y-%m-%d")
tomorrow_str = tomorrow.strftime("%Y-%m-%d")
response = requests.get(f"https://metrograph.com/calendar")
soup = BeautifulSoup(response.text, "html.parser")
pattern = r"^/film/\?vista_film_id="
pattern_re = re.compile(pattern)
urls = []
today_el = soup.find(id=f"calendar-list-day-{today_str}")
tomorrow_el = soup.find(id=f"calendar-list-day-{tomorrow_str}")
for anchor in today_el.find_all("a", href=True):
if pattern_re.match(anchor["href"]):
urls.append(f"https://metrograph.com{anchor['href']}")
for anchor in tomorrow_el.find_all("a", href=True):
if pattern_re.match(anchor["href"]):
urls.append(f"https://metrograph.com{anchor['href']}")
urls = list(set(urls))

Now, we'll process all of these urls in parallel using Substrate nodes.

First, initialize Substrate and create a model for film info, which we'll use to generate a JSON summary of the film web page's content.

Python
TypeScript

from substrate import (
Substrate,
ComputeJSON,
ComputeText,
RunPython,
sb,
)
from pydantic import BaseModel, Field
s = Substrate(api_key=YOUR_API_KEY)
class Film(BaseModel):
title: str = Field(..., description="The film's title")
url: str = Field(..., description="The film's url from the top of the text")
description: str = Field(..., description="A short summary of the film including genre, director, and year.")
showtimes: list[str] = Field(..., description="List of showtimes for the film.")

For each of the urls we found on the calendar page, we'll:

  1. Get the markdown content of the film page using RunPython
  2. Generate a JSON summary of the markdown content using ComputeJSON
  3. Collect all the JSON summaries and generate a markdown summary using ComputeText
Python
TypeScript

def markdown(url: str):
import requests
from bs4 import BeautifulSoup
from markdownify import markdownify
print(url)
res = requests.get(url)
soup = BeautifulSoup(res.content, "html.parser")
return markdownify(str(soup))
summaries = []
mds = []
for url in urls:
# 1. Get the markdown content of the page
md = RunPython(
function=markdown,
kwargs={
"url": url,
},
pip_install=["requests", "beautifulsoup4", "markdownify"],
)
mds.append(md)
# 2. Generate a JSON summary of the markdown content
summary = ComputeJSON(
prompt=sb.concat(
"Summarize the following markdown content from a web page describing a film playing at local theater\n ",
md.future.output),
json_schema=Film.model_json_schema(),
node="Llama3Instruct8B",
)
summaries.append(summary)
# 3. Collect all the JSON summaries and generate a markdown summary
markdown = ComputeText(
prompt=sb.concat(
"Generate markdown summarizing the following movies. Include title in brackets followed by the url of the film in parentheses, e.g. [title](url). Include a very concise one sentence description. Include showtimes. Today is the earliest date that appears, replace the appropriate dates with TODAY and TOMORROW. Do not include preamble before the markdown. Categorize the movies into a few genres.\n",
*[sb.jq(s.future.json_object, "@json") for s in summaries],
),
node="Llama3Instruct70B",
)
res = s.run(markdown)

Example Output

Thriller/Suspense

Lost Highway (opens in a new tab) - A jazz musician's life is turned upside down when he is accused of murdering his wife. Showtimes: TODAY at 7:50pm, Monday June 10 at 9:30pm, Thursday June 13 at 4:30pm

Three Days of the Condor (opens in a new tab) - A CIA codebreaker must survive a deadly conspiracy within his own agency. Showtimes: 9:30pm

Action/Crime

AGGRO DR1FT (opens in a new tab) - A tormented assassin takes on a Miami crime syndicate while balancing family life. Showtimes: TODAY at 10:30pm