Feedcake
„Gib mir ein Stück Kuchen und ich will den ganzen cake.“
The Problem
Most news platforms don’t give you the full article via rss/atom.
This wouldn’t be a big problem. But some of them do crazy 1984-ish stuff on their
websites or they have built up paywalls for users with privacy addons.
Goal of this script
Getting a full-featured news feed (full articles with images) from various
news pages
Benefits for the user
- They don’t need to go on the website to read the articles
- No ads
- No tracking
Possible downsides for the user
- articles don’t get updated once they are scraped
- articles arrive with some delay
- interactive/special elements in articles may not work
What it does
- Fetching the news feed from the original website
- scrape contents of new entries and save them into a directory structure
- exclude articles if a string in the ‘exclude’ list is included in the title
- save a full featured RSS file
... and what it doesn’t
- Managing when it scrapes (but install instructions for crontab are included)
- serving the feeds and assets via HTTPS (use your favorite web server for that)
- Dealing with article comments
- Archiving feeds (But content and assets - but without meta data)
- Using some sort of database (the file structure is everything)
- Cleaning up old assets
- Automatically updating the basedir if it changed.
Ugly stuff?
- the html files (feed content) get stored along the assets, even if they don’t
need to be exploited via HTTPS.
- almost no exception handling yet.
How to use
- git clone this project and enter directory
- install python3, pip and virtualenv
- Create virtualenv:
virtualenv -p python3 ~/.virtualenvs/feedcake
- Activate your new virtualenv:
source ~/.virtualenvs/feedcake/bin/activate
- switch into the projects directory:
cd feedcake
- Install requirements:
pip3 install -r requirements.txt
- copy the config-example:
cp config-example.json config.json
.
- edit
config.json
- copy the cron-example:
cp cron-example.sh cron.sh
.
- edit
cron.sh
- make
cron.sh
executable: chmod +x cron.sh
- add cronjob for
cron.sh
: crontab -e
*/5 * * * * /absolute/path/to/cron.sh >> /path/to/logfile 2>&1
- setup your webserver:
- let your webserver somehow point to the
feeds
directory.
You should protect the http path with a basic authentication.
- let the
assets_url
specified in the config point to the assets
directory.
- After running the script the first time, your desired feed is available at
base_url/destination
(e.g. https://yourdomain.tld/some-url/newspaper.xml
)
TODOs
- Handle exceptions
- Decide what should happen with old news articles and assets which are not
listed in the current feed anymore.