|
@@ -23,20 +23,22 @@ news pages |
|
|
### What it does |
|
|
### What it does |
|
|
* Fetching the news feed from the original website |
|
|
* Fetching the news feed from the original website |
|
|
* scrape contents of new entries and save them into a directory structure |
|
|
* scrape contents of new entries and save them into a directory structure |
|
|
|
|
|
* exclude articles if a string in the 'exclude' list is included in the title |
|
|
* save a full featured RSS file |
|
|
* save a full featured RSS file |
|
|
|
|
|
|
|
|
### ... and what it doesn't |
|
|
### ... and what it doesn't |
|
|
* Managing when it scrapes (use crontab or sth else for that) |
|
|
|
|
|
|
|
|
* Managing when it scrapes (but install instructions for crontab are included) |
|
|
* serving the feeds and assets via HTTPS (use your favorite web server for that) |
|
|
* serving the feeds and assets via HTTPS (use your favorite web server for that) |
|
|
* Dealing with article comments |
|
|
* Dealing with article comments |
|
|
* Archiving feeds (But content and assets - but without meta data) |
|
|
* Archiving feeds (But content and assets - but without meta data) |
|
|
* Using some sort of database (the file structure is everything) |
|
|
* Using some sort of database (the file structure is everything) |
|
|
* Cleaning up old assets |
|
|
* Cleaning up old assets |
|
|
* Automaticly updating the basedir if it changed. |
|
|
|
|
|
|
|
|
* Automatically updating the basedir if it changed. |
|
|
|
|
|
|
|
|
### Ugly stuff? |
|
|
### Ugly stuff? |
|
|
* the html files (feed content) get stored along the assets, even if they don't |
|
|
* the html files (feed content) get stored along the assets, even if they don't |
|
|
need to be exploited via HTTPS. |
|
|
need to be exploited via HTTPS. |
|
|
|
|
|
* almost no exception handling yet. |
|
|
|
|
|
|
|
|
### How to use |
|
|
### How to use |
|
|
* git clone this project and enter directory |
|
|
* git clone this project and enter directory |
|
@@ -60,5 +62,6 @@ news pages |
|
|
`base_url/destination` (e.g. `https://yourdomain.tld/some-url/newspaper.xml`) |
|
|
`base_url/destination` (e.g. `https://yourdomain.tld/some-url/newspaper.xml`) |
|
|
|
|
|
|
|
|
### TODOs |
|
|
### TODOs |
|
|
|
|
|
* Handle exceptions |
|
|
* Decide what should happen with old news articles and assets which are not |
|
|
* Decide what should happen with old news articles and assets which are not |
|
|
listed in the current feed anymore. |
|
|
listed in the current feed anymore. |