# Feedcake ## „Gib mir ein Stück Kuchen und ich will den ganzen cake.“ ### The Problem Most news platforms don't give you the full article via rss/atom. This wouldn't be a big problem. But some of them do crazy 1984-ish stuff on their websites or they have built up paywalls for users with privacy addons. ### Goal of this script Getting a full-featured news feed (full articles with images) from various news pages ### Benefits for the user * They don't need to go on the website to read the articles * No ads * No tracking ### Possible downsides for the user * articles don't get updated once they are scraped * articles arrive with some delay * interactive/special elements in articles may not work ### What it does * Fetching the news feed from the original website * scrape contents of new entries and save them into a directory structure * exclude articles if a string in the 'exclude' list is included in the title * save a full featured RSS file ### ... and what it doesn't * Managing when it scrapes (but install instructions for crontab are included) * serving the feeds and assets via HTTPS (use your favorite web server for that) * Dealing with article comments * Archiving feeds (But content and assets - but without meta data) * Using some sort of database (the file structure is everything) * Cleaning up old assets * Automatically updating the basedir if it changed. ### Ugly stuff? * the html files (feed content) get stored along the assets, even if they don't need to be exploited via HTTPS. * almost no exception handling yet. ### How to use * git clone this project and enter directory * install python3, pip and virtualenv * Create virtualenv: `virtualenv -p python3 ~/.virtualenvs/feedcake` * Activate your new virtualenv: `source ~/.virtualenvs/feedcake/bin/activate` * switch into the projects directory: `cd feedcake` * Install requirements: `pip3 install -r requirements.txt` * copy the config-example: `cp config-example.json config.json`. * edit `config.json` * copy the cron-example: `cp cron-example.sh cron.sh`. * edit `cron.sh` * make `cron.sh` executable: `chmod +x cron.sh` * add cronjob for `cron.sh`: `crontab -e` * `*/5 * * * * /absolute/path/to/cron.sh >> /path/to/logfile 2>&1` * setup your webserver: * let your webserver somehow point to the `feeds` directory. You should protect the http path with a basic authentication. * let the `assets_url` specified in the config point to the `assets` directory. * After running the script the first time, your desired feed is available at `base_url/destination` (e.g. `https://yourdomain.tld/some-url/newspaper.xml`) ### TODOs * Handle exceptions * Decide what should happen with old news articles and assets which are not listed in the current feed anymore.