You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
Andreas Demmelbauer 4025c6f580 polishing 5 years ago
public assets get their own independet url (no sub of feeds) 5 years ago
.gitignore initial commit 5 years ago
LICENSE Initial commit 5 years ago
README.md polishing 5 years ago
config.example.json log timestamps, editable request-headers, README 5 years ago
cron-example.sh add cron-example 5 years ago
feedcake.py log timestamps, editable request-headers, README 5 years ago
requirements.txt initial commit 5 years ago

README.md

Feedcake

„Gib mir ein Stück Kuchen und ich will den ganzen cake.“

The Problem

Most news platforms don’t give you the full article via rss/atom.
This wouldn’t be a big problem. But some of them do crazy 1984-ish stuff on their websites or they have built up paywalls for users with privacy addons.

Goal of this script

Getting a full-featured news feed (full articles with images) from various news pages

Benefits for the user

  • They don’t need to go on the website to read the articles
  • No ads
  • No tracking

Possible downsides for the user

  • articles don’t get updated once they are scraped
  • articles arrive with some delay
  • interactive/special elements in articles may not work

What it does

  • Fetching the news feed from the original website
  • scrape contents of new entries and save them into a directory structure
  • exclude articles if a string in the ‘exclude’ list is included in the title
  • save a full featured RSS file

... and what it doesn’t

  • Managing when it scrapes (but install instructions for crontab are included)
  • serving the feeds and assets via HTTPS (use your favorite web server for that)
  • Dealing with article comments
  • Archiving feeds (But content and assets - but without meta data)
  • Using some sort of database (the file structure is everything)
  • Cleaning up old assets
  • Automatically updating the basedir if it changed.

Ugly stuff?

  • the html files (feed content) get stored along the assets, even if they don’t need to be exploited via HTTPS.
  • almost no exception handling yet.

How to use

  • git clone this project and enter directory
  • install python3, pip and virtualenv
  • Create virtualenv: virtualenv -p python3 ~/.virtualenvs/feedcake
  • Activate your new virtualenv: source ~/.virtualenvs/feedcake/bin/activate
  • switch into the projects directory: cd feedcake
  • Install requirements: pip3 install -r requirements.txt
  • copy the config-example: cp config-example.json config.json.
  • edit config.json
  • copy the cron-example: cp cron-example.sh cron.sh.
  • edit cron.sh
  • make cron.sh executable: chmod +x cron.sh
  • add cronjob for cron.sh: crontab -e
    • */5 * * * * /absolute/path/to/cron.sh >> /path/to/logfile 2>&1
  • setup your webserver:
    • let your webserver somehow point to the feeds directory. You should protect the http path with a basic authentication.
    • let the assets_url specified in the config point to the assets directory.
  • After running the script the first time, your desired feed is available at base_url/destination (e.g. https://yourdomain.tld/some-url/newspaper.xml)

TODOs

  • Handle exceptions
  • Decide what should happen with old news articles and assets which are not listed in the current feed anymore.