Andreas Demmelbauer 89f9b9cbd4 remove quote		6 years ago
public	assets get their own independet url (no sub of feeds)	6 years ago
.gitignore	initial commit	6 years ago
LICENSE	Initial commit	6 years ago
README.md	remove quote	6 years ago
config.example.json	log timestamps, editable request-headers, README	6 years ago
cron-example.sh	add cron-example	6 years ago
feedcake.py	more documentation and comments	6 years ago
requirements.txt	initial commit	6 years ago

README.md

Feedcake

Attention

This script is maintained by only one person who is also a python newbie.
If you don’t care about having article images, you should definitely use PyFeeds instead!
Also, it’s only working for a very limited subset of news sites.

The Problem

Most news platforms don’t give you the full article via rss/atom.
This wouldn’t be a big problem. But some of them do crazy 1984-ish stuff on their websites or they have built up paywalls for visitors using privacy addons.

Goal of this script

Getting a full-featured news feed (full articles with images) from various news pages

Benefits for the user

read full articles directly in your feed reader
exclude articles by keyword in title
no tracking
no ads

Possible downsides for the user

articles don’t get updated once they are scraped
articles arrive with some delay
interactive/special elements in articles may not work

What it does

Fetching the news feed from the original website
scrape contents of new entries and save them into a directory structure
exclude articles if a string in the ‘exclude’ list is included in the title
save a full featured RSS file

... and what it doesn’t

Managing when it scrapes (but install instructions for crontab are included)
serving the feeds and assets via HTTPS (use your favorite web server for that)
Dealing with article comments
Archiving feeds (But content and assets - but without meta data)
Using some sort of database (the file structure is everything)
Cleaning up old assets
Automatically updating the basedir if it has changed. (you have to clear the assets directory)

Ugly stuff?

the html files (feed content) get stored along the assets, even if they don’t need to be exploited via HTTPS.
almost no exception handling yet.

How to use

git clone this project and enter directory
install python3, pip and virtualenv
Create virtualenv: virtualenv -p python3 ~/.virtualenvs/feedcake
Activate your new virtualenv: source ~/.virtualenvs/feedcake/bin/activate
switch into the projects directory: cd feedcake
Install requirements: pip3 install -r requirements.txt
copy the config-example: cp config-example.json config.json.
edit config.json
copy the cron-example: cp cron-example.sh cron.sh.
edit cron.sh
make cron.sh executable: chmod +x cron.sh
add cronjob for cron.sh: crontab -e
- */5 * * * * /absolute/path/to/cron.sh >> /path/to/logfile 2>&1
setup your webserver:
- let your webserver somehow point to the public/feeds directory. You should protect the http path with a basic authentication.
- let the assets_url you specified in the config earlier point to the public/assets directory.
After running the script the first time, your desired feed is available at base_url/destination (e.g. https://yourdomain.tld/some-url/newspaper.xml)

TODOs

Handle exceptions
Decide what should happen with old news articles and assets which are not listed in the current feed anymore.