redplanet
/
feedcake

# Feedcake
## „Gib mir ein Stück Kuchen und ich will den ganzen cake.“

### Attention
This script is maintained by only one person who is also a python newbie.  
If you don't care about having article images, you should definitely use
[PyFeeds](https://github.com/PyFeeds/PyFeeds) instead!  
Also, it's only working for a very limited subset of news sites.


### The Problem
Most news platforms don't give you the full article via rss/atom.  
This wouldn't be a big problem. But some of them do crazy 1984-ish stuff on their
websites or they have built up paywalls for visitors using privacy addons.

### Goal of this script
Getting a full-featured news feed (full articles with images) from various
news pages

### Benefits for the user
* read full articles directly in your feed reader
* exclude articles by keyword in title
* no tracking
* no ads

### Possible downsides for the user
* articles don't get updated once they are scraped
* articles arrive with some delay
* interactive/special elements in articles may not work

### What it does
* Fetching the news feed from the original website
* scrape contents of new entries and save them into a directory structure
* exclude articles if a string in the 'exclude' list is included in the title
* save a full featured RSS file

### ... and what it doesn't
* Managing when it scrapes (but install instructions for crontab are included)
* serving the feeds and assets via HTTPS (use your favorite web server for that)
* Dealing with article comments
* Archiving feeds (But content and assets - but without meta data)
* Using some sort of database (the file structure is everything)
* Cleaning up old assets
* Automatically updating the basedir if it has changed.
  (you have to clear the assets directory)

### Ugly stuff?
* the html files (feed content) get stored along the assets, even if they don't
  need to be exploited via HTTPS.
* almost no exception handling yet.

### How to use
* git clone this project and enter directory
* install python3, pip and virtualenv
* Create virtualenv: `virtualenv -p python3 ~/.virtualenvs/feedcake`
* Activate your new virtualenv: `source ~/.virtualenvs/feedcake/bin/activate`
* switch into the projects directory: `cd feedcake`
* Install requirements: `pip3 install -r requirements.txt`
* copy the config-example: `cp config-example.json config.json`.
* edit `config.json`
* copy the cron-example: `cp cron-example.sh cron.sh`.
* edit `cron.sh`
* make `cron.sh` executable: `chmod +x cron.sh`
* add cronjob for `cron.sh`: `crontab -e`
  * `*/5 * * * * /absolute/path/to/cron.sh  >> /path/to/logfile 2>&1`
* setup your webserver:
  * let your webserver somehow point to the `public/feeds` directory.
    You should protect the http path with a basic authentication.
  * let the `assets_url` you specified in the config earlier point to the
    `public/assets` directory.
* After running the script the first time, your desired feed is available at
  `base_url/destination` (e.g. `https://yourdomain.tld/some-url/newspaper.xml`)

### TODOs
* Handle exceptions
* Decide what should happen with old news articles and assets which are not
  listed in the current feed anymore.