Overview:
With the availability of lots of options to choose from the cost effective hardware in cloud environment, the storage has become so easy now a days. Once the storage is cheap, the immediate requirement for any software solution is to write an effective code to consume and expose the data. Agira team met a similar situation with one of our assignments last week. The solution demanded uploading of a very complex CSV data into an ETL Database. Improved performance was the goal and we started exploring the possibilities of developing the piece of code in Ruby on Rails and Golang.
Approach:
On the Rails side, we wrote the code on Ruby and executed the upload action using sidekiq jobs. Our good old concurrency bottlenecks of Ruby language hit the performance and the time taken to complete the job was not satisfactory. To handle the situation, we wrote the piece of code in Golang which does the similar task. On the developer’s laptop itself we could see the better performance of the Golang code.
Technology Bundle:
- Ruby – 2.3.
- Rails – 4.2.0
- PostGres Gem – 0.18.4
- PostGres DB – 9.3.11
- GoLang – 1.5.3
Ruby Code Snippet:
CSV.read('./db/samples/myfile_sample.csv', {col_sep: "," }).each_slice(200).each_with_index do |rows,i| array_hash =[]rows.each_with_index { |row,j| next if (i==0 && j==0); array_hash << Hash[[array_header,row].transpose] } RubyCsv.create(array_hash) end
GoLang Code Snippet:
func (csvDataContainer *VariableInit) GenertaeString() { // for example csvDataContainer.Lenght = 100 csvDataContainer.stringVal = ""// store the rows of data (as a string) to be inserted in table csvDataContainer.str = ""// used for comma and semicolon separator if len(csvDataContainer.remaining) > csvDataContainer.Lenght{ csvDataContainer.first_100 = csvDataContainer.remaining[:csvDataContainer.Lenght] csvDataContainer.remaining = csvDataContainer.remaining[csvDataContainer.Lenght:] }else{ csvDataContainer.first_100 = csvDataContainer.remaining csvDataContainer.remaining = nil } for parentIndex,rec_values := range csvDataContainer.first_100 { csvDataContainer.stringVal += "(" for childIndex,rec := range rec_values { if rec = rec; rec == "" { rec = "NULL"} checkAlphanumeric, _ := regexp.Compile("([a-zA-Z]+)") if (checkAlphanumeric.MatchString(rec)) {rec = "'"+rec+"'"} if csvDataContainer.str = ","; childIndex == 0 { csvDataContainer.str = "" } csvDataContainer.stringVal += (csvDataContainer.str+rec) } if (len(csvDataContainer.first_100) == parentIndex+1) { csvDataContainer.stringVal += ");" } else {csvDataContainer.stringVal += "),"} } psqlConn.InsertRec(csvDataContainer.stringVal) csvDataContainer.first_100 = nil fmt.Printf("--------------> %d\n", len(csvDataContainer.remaining)) if (len(csvDataContainer.remaining) !=0) { csvDataContainer.GenertaeString() } }
Cloud Setup:
We wanted to do an extensive testing, so decided to have a cloud environment in AWS to test the performance in terms of huge data volumes.
The following AWS boxes configurations were considered for running the volume test:
Instance Type | vCPU | Memory (GiB) | Networking Performance | Physical Processor | Clock Speed (GHz) |
c4.xlarge | 4 | 7.5 | High | Intel Xeon E5-2666 v3 | 2.9 |
c4.2xlarge | 8 | 15 | High | Intel Xeon E5-2666 v3 | 2.9 |
Record Set Sample:
Column | Data 1 | Data 2 | Data 3 |
Data 4 | Data 5 |
policyID | 119736 | 448094 | 206893 | 333743 | 172534 |
statecode | FL | FL | FL | FL | FL |
county | CLAY COUNTY | CLAY COUNTY | CLAY COUNTY | CLAY COUNTY | CLAY COUNTY |
eq_site_limit | 498960 | 1322376.3 | 190724.4 | 0 | 0 |
hu_site_limit | 498960 | 1322376.3 | 190724.4 | 79520.76 | 254281.5 |
fl_site_limit | 498960 | 1322376.3 | 190724.4 | 0 | 0 |
fr_site_limit | 498960 | 1322376.3 | 190724.4 | 0 | 254281.5 |
tiv_2011 | 498960 | 1322376.3 | 190724.4 | 79520.76 | 254281.5 |
tiv_2012 | 792148.9 | 1438163.57 | 192476.78 | 86854.48 | 246144.49 |
eq_site_deductible | 0 | 0 | 0 | 0 | 0 |
hu_site_deductible | 9979.2 | 0 | 0 | 0 | 0 |
fl_site_deductible | 0 | 0 | 0 | 0 | 0 |
fr_site_deductible | 0 | 0 | 0 | 0 | 0 |
point_latitude | 30.102261 | 30.063936 | 30.089579 | 30.063236 | 30.060614 |
point_longitude | -81.711777 | -81.707664 | -81.700455 | -81.707703 | -81.702675 |
line | Residential | Residential | Residential | Residential | Residential |
construction | Masonry | Masonry | Wood | Wood | Wood |
point_granularity | 1 | 3 | 1 | 3 | 1 |
Observation:
The code has been executed against 100k, 500k and 1million record sets. The time taken to update the records in database are captured from the log records. Look below for the time taken in AWS for different volume of record sets for the Golang & Ruby code snippets:
Code Reference:
The complete code scripts and the sample data used are available for your knowledge at the Git repository: https://github.com/agiratech/go_vs_ruby_metrics
Conclusion:
On the considered scenario for study, to upload records into ETL database, Golang code performance is better than Ruby code.
We at Agira, always believe in suggesting the best possible solution to our clients. Here too, our team expended it’s technical knowledge, suggested required technology stack for the product/application development to the clients for improved performance.
We offer Golang development services for building world-class enterprise apps. We have expertise in building most complex software solutions using Google’s Go language. Chat with us now and hire golang developers within 72 hours.