In my bachelor thesis I used evaluated pre-processing strategies for news articles in order to predict whether they report on leaked user authentication data. It is only available in German, tough.
Please note that the results are most likely outdated and the work probably contains mistakes.
A shortened version can be found here.
If you fell the need to read the longer version see here