| @@ -23,20 +23,22 @@ news pages | |||||
| ### What it does | ### What it does | ||||
| * Fetching the news feed from the original website | * Fetching the news feed from the original website | ||||
| * scrape contents of new entries and save them into a directory structure | * scrape contents of new entries and save them into a directory structure | ||||
| * exclude articles if a string in the 'exclude' list is included in the title | |||||
| * save a full featured RSS file | * save a full featured RSS file | ||||
| ### ... and what it doesn't | ### ... and what it doesn't | ||||
| * Managing when it scrapes (use crontab or sth else for that) | |||||
| * Managing when it scrapes (but install instructions for crontab are included) | |||||
| * serving the feeds and assets via HTTPS (use your favorite web server for that) | * serving the feeds and assets via HTTPS (use your favorite web server for that) | ||||
| * Dealing with article comments | * Dealing with article comments | ||||
| * Archiving feeds (But content and assets - but without meta data) | * Archiving feeds (But content and assets - but without meta data) | ||||
| * Using some sort of database (the file structure is everything) | * Using some sort of database (the file structure is everything) | ||||
| * Cleaning up old assets | * Cleaning up old assets | ||||
| * Automaticly updating the basedir if it changed. | |||||
| * Automatically updating the basedir if it changed. | |||||
| ### Ugly stuff? | ### Ugly stuff? | ||||
| * the html files (feed content) get stored along the assets, even if they don't | * the html files (feed content) get stored along the assets, even if they don't | ||||
| need to be exploited via HTTPS. | need to be exploited via HTTPS. | ||||
| * almost no exception handling yet. | |||||
| ### How to use | ### How to use | ||||
| * git clone this project and enter directory | * git clone this project and enter directory | ||||
| @@ -60,5 +62,6 @@ news pages | |||||
| `base_url/destination` (e.g. `https://yourdomain.tld/some-url/newspaper.xml`) | `base_url/destination` (e.g. `https://yourdomain.tld/some-url/newspaper.xml`) | ||||
| ### TODOs | ### TODOs | ||||
| * Handle exceptions | |||||
| * Decide what should happen with old news articles and assets which are not | * Decide what should happen with old news articles and assets which are not | ||||
| listed in the current feed anymore. | listed in the current feed anymore. | ||||