maybe I've missed this in the instructions - but what is the recommended type and size of the clusters for the jobs processing btc-2010-chunk-000 or the full data set?
My current understanding is that when defining a cluster one 'Master' node is required while 'Core' and 'Task' nodes can be added on demand.
The defaults - one medium sized Master node, two medium sized Core nodes, no Task node - did work fine when loading the cse344-test-file, but even after adding 5 Task nodes I ran into problems when working with btc-2010-chunk-000.
The error message I continue to get in the log-file says:
Backend error message: Error: Java heap space
which looks like a memory issue of the system to me (just guessing)
What combination of nodes is the best bundle for our 'larger' problems?
Alternatively: is there some setting which might make the memory management more resilient to this heap space error?
Ok, after setting up a cluster consisting of 1 Master m1.large and 5 Core m1.small I still initially ran into the problems loading the data from S3 (as described here: https://class.coursera.org/datasci-002/forum/thread?thread_id=1770 ) but after locally importing the data set into the HDFS of the cluster (see Fan Fei https://class.coursera.org/datasci-002/forum/thread?thread_id=1770#post-8540 ) I was able to complete Problem 1, Problem 2B and Problem 3 without further problems.