You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Andreas Demmelbauer 0730d0e749 no prettify пре 5 година
public assets get their own independet url (no sub of feeds) пре 5 година
.gitignore initial commit пре 5 година
LICENSE Initial commit пре 5 година
README.md assets get their own independet url (no sub of feeds) пре 5 година
config.example.json assets get their own independet url (no sub of feeds) пре 5 година
cron-example.sh add cron-example пре 5 година
feedcake.py no prettify пре 5 година
requirements.txt initial commit пре 5 година

README.md

Feedcake

„Gib mir ein Stück Kuchen und ich will den ganzen cake.“

The Problem

Most news platforms don’t give you the full article via rss/atom.
This wouldn’t be a big problem. But some of them do crazy 1984-ish stuff on their websites or they have built up paywalls for users with privacy addons.

Goal of this script

Getting a full-featured news feed (full articles with images) from various news pages

Benefits for the user

  • They don’t need to go on the website to read the articles
  • No ads
  • No tracking

Possible downsides for the user

  • articles don’t get updated once they are scraped
  • articles arrive with some delay
  • interactive/special elements in articles may not work

What it does

  • Fetching the news feed from the original website
  • scrape contents of new entries and save them into a directory structure
  • save a full featured RSS file

... and what it doesn’t

  • Managing when it scrapes (use crontab or sth else for that)
  • serving the feeds and assets via HTTPS (use your favorite web server for that)
  • Dealing with article comments
  • Archiving feeds (But content and assets - but without meta data)
  • Using some sort of database (the file structure is everything)
  • Cleaning up old assets
  • Automaticly updating the basedir if it changed.

Ugly stuff?

  • the html files (feed content) get stored along the assets, even if they don’t need to be exploited via HTTPS.

How to use

  • git clone this project and enter directory
  • install python3, pip and virtualenv
  • Create virtualenv: virtualenv -p python3 ~/.virtualenvs/feedcake
  • Activate your new virtualenv: source ~/.virtualenvs/feedcake/bin/activate
  • switch into the projects directory: cd feedcake
  • Install requirements: pip3 install -r requirements.txt
  • copy the config-example: cp config-example.json config.json.
  • edit config.json
  • copy the cron-example: cp cron-example.sh cron.sh.
  • edit cron.sh
  • make cron.sh executable: chmod +x cron.sh
  • add cronjob for cron.sh: crontab -e
    • */5 * * * * /absolute/path/to/cron.sh > /path/to/logfile 2>&1
  • setup your webserver:
    • let your webserver somehow point to the feeds directory. You should protect the http path with a basic authentication.
    • let the assets_url specified in the config point to the assets directory.
  • After running the script the first time, your desired feed is available at base_url/destination (e.g. https://yourdomain.tld/some-url/newspaper.xml)

TODOs

  • Decide what should happen with old news articles and assets which are not listed in the current feed anymore.