Server load problem

greatchap

Weaksauce
Joined
Mar 22, 2012
Messages
103
Hello Everyone,

I had posted earlier in this site (on another sub forum) regarding my website getting hacked. Thanks to your advice I took steps to protect my server and its working fine now.

However now a new problem has come up. The server which runs my site is under heavy load. My website basically consists of the site itself, cms, blog and website apps. The website apps read some data from db and show it to users every 5 minutes. It is basically list of stocks and its related data.

I am running on AWS Platform, a t2.large server. My os is centos 7 and I run php, mysql, csf etc. The t2 has burstable cpu credits and at night the credits accumulate whichh are used by in day time. However the site starts slowing down if traffic increases. The cpu usage is high.

After some research I was told that I should shift to a M4 or C4 platform/server because t2 is meant for low traffic sites and development and not production. Moreover I may come up with mobile apps soon and if load increases it will be a problem.

What should I do? Should I go for M4 or C4 server in aws ?

Thank you,

Regards,
GR
 
get some apps to better understand the delay in your apps.

has the traffic increased on a large basis? if so, do you understand why and has your code taken that into consideration.

perhaps start to use cache so the system does more reads verse processing & pulling data every. single. time.
 
You may want to look into running a cache in front of the site. If most of your traffic is from users who are not logged in, this can provide quite a bit of relief.
Even if they're logged in, a static cache will reduce load on the backend. Varnish is a good example and works very well.
 
Caching is a good idea and definitely one of the go-to options AWS would recommend. Are you actually running out of CPU credits during peak load times or is there something else causing the performance issue?
 
Thank you for your reply guys. Here is some addiional data that may help:

1) During peak times load is because of mysql as my 2 online applications reads stock related data from mysql db every 5 minutes.
2) Since the site works on CMS the data of pages is also read from db and shown to users
3) I have no compression or optimiziation enabled/done in site
4) I was told that my homepage size or something is 4 mb which is high.
5) Usually I get messages such as Excessive processes running under - public_html/index.php
6) Someone told me T2 Server is not meant for production and I should upgrade to M4. What are your thoughts on that ?

During peak hours a lot of cpu credits are used. At night they accumulate and in day time they are used.

What measures should I take?
 
Production and non-production/development are fuzzy definitions meant to assign a qualitative value to performance capabilities. I have multiple production domain controllers and jump hosts running in Azure using their equivalent of t2 instances and I've seen no performance related problems. The best advice is to make sure that your workload's performance requirements match what the instance type is capable of providing.

Turning on some sort of compression seems like an easy win to reduce page load times and decrease the amount of data sent over the network. However, be wary of the increased demands on your cpu. If that's where your resource constraint lies, then it could end up exacerbating the problem.

My personal recommendation is to drill deeper into the problem to find evidence which points to the bottleneck. Once you know where that is, you can start looking at options to correct the issue.
 
How many CPU cycles are you consuming peak?

How many CPU cycles are you consuming avg. during your high load period?

How large is the database, your web page is reading from?

How much of your current storage is being consumed?

What's the RAM usage of your instance look like under load?


I think you need to follow the advice others have laid out previously but answering the questions I listed should enable us to give you a non-optimized response.
 
The problem is to find someone suitable who can fix the problem. I spoke to 2 companies who just did a basic check for malware and said there is no malware and since your cpu usage is high we request you to shift to M4 server. I contacted some freelancers who said we need to enable compression, and optimize the db by installing memcache. Also probably enable reverse dns or cloudfare etc.

1) 200 cpu credits are used during high load i.e. 7 hours (approx.)
2) N/A
3) The website used db for cms, blog, and to store user registrations, products & invoice details. The size of that db would be 160 mb. The site runs 2 website apps which reads from db every 5 minutes and shows data to user. Those db few main tables have no more than 150 records .
4) Total disk space 60 GB and free is 35 GB
5) RAM usage is less I think as in whm I saw figures in memory% to be like 5% or 3% or 6% for a few processes and for the rest it was below 1%.
 
So you're entire instance is running within 1GB of RAM?

That doesn't make sense. With that small of a database you could load the entire db into cache and have few fetches outside of updates. It makes me wonder what else is going on. What processes are eating so much CPU?
 
The server has 8 GB RAM. However if I go to process log in whm I see:

1) MySQL
CPU% - 20.35
Memory% - 4.25

2) cpanelsolr : memory% is 3.36

Rest all memory % is 0.24 or lower.

Homepage sometimes uses 3% of cpu.
 
See if you can load the entire MySQL db into cache. 160MB is nothing. That should be an easy change and should removed some read latency and CPU overhead. It may also cause your CPU to be driven higher, but your resulting performance should scale linearly as well.

After that you really need to figure out where you CPU is going, what specific processes are consuming it, and under what conditions.
 
See if you can load the entire MySQL db into cache. 160MB is nothing. That should be an easy change and should removed some read latency and CPU overhead. It may also cause your CPU to be driven higher, but your resulting performance should scale linearly as well.

After that you really need to figure out where you CPU is going, what specific processes are consuming it, and under what conditions.

How do I load entire MySql db in cache / use more RAM ?
 
6) Someone told me T2 Server is not meant for production and I should upgrade to M4. What are your thoughts on that

The t2 family is intended for low-load applications that may need occasional higher-than-baseline performance. Lots of people use t2 instances for various production purposes; lots of people also misuse them because they didn't read how the CPU credits work and/or they didn't load test before deploying. Scaling up the instance size before understanding the issue is not good advice unless it's an emergency that is costing your business more money than the increase in operational cost and you are still going to investigate the problem after the immediate production pain has been taken care of. It sounds to me as though you are not sure why the load is high and so you should investigate the root cause first.

Are you exhausting the instance's CPU credits (sounds like "no", but please do confirm) or is burst rate simply not providing enough CPU to keep up with requests? Review your EC2 CloudWatch metrics.

I presume the DB is running on-instance rather than in RDS? Have you verified you're not IO bound when at peak load? You didn't mention it, but it's likely the instance is using a gp2 volume whose performance is modeled using baseline rate with a burst bucket. Check for high iowait ("wa" in top) and review your CloudWatch EBS metrics.

Have you tried installing atop to get a point-in-time picture of the system state for later analysis?
 
The t2 family is intended for low-load applications that may need occasional higher-than-baseline performance. Lots of people use t2 instances for various production purposes; lots of people also misuse them because they didn't read how the CPU credits work and/or they didn't load test before deploying. Scaling up the instance size before understanding the issue is not good advice unless it's an emergency that is costing your business more money than the increase in operational cost and you are still going to investigate the problem after the immediate production pain has been taken care of. It sounds to me as though you are not sure why the load is high and so you should investigate the root cause first.

Are you exhausting the instance's CPU credits (sounds like "no", but please do confirm) or is burst rate simply not providing enough CPU to keep up with requests? Review your EC2 CloudWatch metrics.

I presume the DB is running on-instance rather than in RDS? Have you verified you're not IO bound when at peak load? You didn't mention it, but it's likely the instance is using a gp2 volume whose performance is modeled using baseline rate with a burst bucket. Check for high iowait ("wa" in top) and review your CloudWatch EBS metrics.

Have you tried installing atop to get a point-in-time picture of the system state for later analysis?

Thank you for your help. However the problem is I am not a linux administrator. And can do basic monitoring only.

As of now I am not running out of credits because a good number of credits have accumulated during the weekend and my t2 large has a cap of 850 credits. Approx 200-250 credits are used during business hours and credits are updated at night when volume is low. The problem is that of optimization. My site runs on cms codeigniter, more content and stuff reside in db and are read from there. However the biggest db maybe 200 mb in size and other db are small. Inspite of this the site runs slow because they were just dumped into the server with basic tweaking only. I have a CSF and ClamAV.

Atop is not installed so far and DB is running on the same instance. Have you verified you're not IO bound when at peak load? - I don't know plus have no idea how to check.

I just installed atop and am attaching 4 images. Let me know what you think.

https://preview.************/idzLEG/atop.png
https://preview.************/cD9bob/atop2.png
https://preview.************/gNb0EG/atop3.png
https://preview.************/bUAhTb/atop4.png
 
Last edited:
Images attached in post. Links not working as post removes addresses.
 

Attachments

  • atop2.png
    atop2.png
    56.1 KB · Views: 19
  • atop3.png
    atop3.png
    61.6 KB · Views: 19
  • atop4.png
    atop4.png
    45.5 KB · Views: 17
  • atop.png
    atop.png
    60.8 KB · Views: 20
The issue has been sorted for now. The guy who developed website was inserting a record in a db table to manage sessions. He did not delete those temp records and it crossed the 1 million mark. This made the site slow. Once I deleted the records the site became fast.

Thank you for your support.
 
Your problem isn't sorted.

Go find the "malware" hunting guys and get your money back.

Turn on Cloudtrail globally.
Properly decouple your application.
Set Cloudwatch alarms that look at very specific parts of your application.
Use triggered Lambdas, my favorite is when it jails an instance into my "INSPECT" vpc.

A T2 burstable instance shouldn't be used in this manner, as others have mentioned.
You have literally described the use case for time based auto scaling.
Redeply as a 2 tier application.
DNS switching btw environments is good, I like non-disruptive deployments.
You could manually setup a 2nd LB and a fixed launch config, use Cloudformation with a nested Beanstalk app, straight up Chef, whatever you think makes your life easier.
I go with Cloudformation, I like YAML for readability, and I can always run someone else's template thru the designer and see what they did.

Session management as described can be done with DynamoDB, but a Cloudwatch event should have triggered a push of old records to an S3 bucket.
It sound like the person that deployed this application didn't read identity federation and session management docs at all.

A dedicated EBS volume would have been the logical choice, but it doesn't sound like that happened.
At whatever snapshot interval you just start writing to a new DynamoDB table.
There are a lot of DB house keeping design patterns based around an EBS volume management event.

Cloudfront should be used.
Elasticache should be used.

Web app and DB should be separated, watch SQS queue depth and make sure workers can be properly scaled.

SNS should be used liberally, bc why alarm and monitor and not be notified?

I swear the DOP exam I sat last week was made out of calls to support like the original post.
 
Your problem isn't sorted.

Go find the "malware" hunting guys and get your money back.

Turn on Cloudtrail globally.
Properly decouple your application.
Set Cloudwatch alarms that look at very specific parts of your application.
Use triggered Lambdas, my favorite is when it jails an instance into my "INSPECT" vpc.

A T2 burstable instance shouldn't be used in this manner, as others have mentioned.
You have literally described the use case for time based auto scaling.
Redeply as a 2 tier application.
DNS switching btw environments is good, I like non-disruptive deployments.
You could manually setup a 2nd LB and a fixed launch config, use Cloudformation with a nested Beanstalk app, straight up Chef, whatever you think makes your life easier.
I go with Cloudformation, I like YAML for readability, and I can always run someone else's template thru the designer and see what they did.

Session management as described can be done with DynamoDB, but a Cloudwatch event should have triggered a push of old records to an S3 bucket.
It sound like the person that deployed this application didn't read identity federation and session management docs at all.

A dedicated EBS volume would have been the logical choice, but it doesn't sound like that happened.
At whatever snapshot interval you just start writing to a new DynamoDB table.
There are a lot of DB house keeping design patterns based around an EBS volume management event.

Cloudfront should be used.
Elasticache should be used.

Web app and DB should be separated, watch SQS queue depth and make sure workers can be properly scaled.

SNS should be used liberally, bc why alarm and monitor and not be notified?

I swear the DOP exam I sat last week was made out of calls to support like the original post.

Thanks for your feedback.

Let me share you a few insights :

1) I have a very small company with less than ten employees and practically no IT staff
2) Development of my website was outsourced and it took 1 year whereas the initial timeline was 4 months
3) Since I have no IT staff and I do not have info on linux servers I got my aws server setup by a freelancer who installed centos and did basic configurations
4) The new website was shifted to the new aws server and I spend a month or more testing it and sorting out the bugs

I went live on the 3rd week of December. The this happened:
1) After I went live my site got hijacked or god knows what and it was filled with malware
2) For a first few days site would work in day time and towards evening all the data would get erased and junk content would appear in php/js scripts
3) Then I caught hold of another freelancer who fixed my website by installing CSF, ClamAV and we took some additional measures. I also deleted data and installed it again.

Since then it was working fine which means no malware or so. However after a few days I noticed that my site was not fast and took time when opening.
1) After asking few people and doing my bit I figured that mysql was hogging all the resources. This caused CPU credits exhaustion.
2) I upgraded my server to t2 large but the feedback I got on this thread indicated there is another problem and server upgrade is not the solution
3) I started to examine mysql threads/queries and found out that a query from a particular table took longer than expected.
4) I found out that the table held session related record and had temporary records. It has over a million records.
5) The idiot who developed the site didn't tell me about it nor he wrote query to delete the records

The deletion of those records fixed the site and its working fine now. Getting work done here is not easy and very few companies do quality work. Plus its difficult to determine which one does.
As of now I everything is okay but later I may opt for cloudfront and more if I feel I need to.

The server is just 1 machine with EBS storage that machine itself runs the DB. I see no need to have a separate db server as I not a big guy. If the traffic increases in the future then will see. :)
 
That makes little sense.

Check your SGs and see what ports are open, and whether 3306 particularly (and anything else besides 80) is open to the world.
There shouldn't be more than a couple SGs, go thru them.
If someone opened anonymous FTP or such, the SG would disallow it, centos or application combing would be a waste of time.
We lock down the instance asap using Lambda.
A monolithic install like what you described makes triage and recovery more difficult.
If I was truly concerned I would replace the instance, we had some ec2-classic builds like what you described built by contractors.
I fired them and made Dev learn the associate exam material.
They have to take responsibility for their deployments.

If MFA isn't turned on, do so.
Rotate all API and SSH keys.
Do an I AM audit, check roles and see if any were made that allow cross account access, federated access, etc and delete them.
Any user accounts that aren't you, blast them.
Make sure there isn o instance profile allowing service roles.
Use ACLS and NACLS for filtering out IPs you see trying to access your instance.
Turn on Guard dog.

I wouldn't care much about malware bc I'm not allowing any rights that I don't know about.

Stop treating your instance like it's just a physical install you are renting and learn about AWS in your own time.

Most importantly turn on Cloudtrail, and set billing alarms so you don't find out that someone is using resources in regions you aren't operating in.

I highly disagree that cost should be a barrier to running MySQL in a private subnet, or that anyone running an application in production shouldn't have at least a snapshot of their instance in a very basic autoscaling group.

Cloudwatch needs to look at specific points in your application, stock alarms aren't that helpful.
There are free security services, setup an ELB, turn them on, enable access logs.

There are a lot of people out there claiming a whole lot.
Some of them will have these:

https://www.certmetrics.com/amazon/public/badge.aspx?i=5&t=c&d=2018-01-17&ci=AWS00169581&dm=80

Most of us that have these won't do what you described.

If I were you I would invest more time learning than hiring someone, given your scale.
 
That makes little sense.

Check your SGs and see what ports are open, and whether 3306 particularly (and anything else besides 80) is open to the world.
There shouldn't be more than a couple SGs, go thru them.
If someone opened anonymous FTP or such, the SG would disallow it, centos or application combing would be a waste of time.
We lock down the instance asap using Lambda.
A monolithic install like what you described makes triage and recovery more difficult.
If I was truly concerned I would replace the instance, we had some ec2-classic builds like what you described built by contractors.
I fired them and made Dev learn the associate exam material.
They have to take responsibility for their deployments.

If MFA isn't turned on, do so.
Rotate all API and SSH keys.
Do an I AM audit, check roles and see if any were made that allow cross account access, federated access, etc and delete them.
Any user accounts that aren't you, blast them.
Make sure there isn o instance profile allowing service roles.
Use ACLS and NACLS for filtering out IPs you see trying to access your instance.
Turn on Guard dog.

I wouldn't care much about malware bc I'm not allowing any rights that I don't know about.

Stop treating your instance like it's just a physical install you are renting and learn about AWS in your own time.

Most importantly turn on Cloudtrail, and set billing alarms so you don't find out that someone is using resources in regions you aren't operating in.

I highly disagree that cost should be a barrier to running MySQL in a private subnet, or that anyone running an application in production shouldn't have at least a snapshot of their instance in a very basic autoscaling group.

Cloudwatch needs to look at specific points in your application, stock alarms aren't that helpful.
There are free security services, setup an ELB, turn them on, enable access logs.

There are a lot of people out there claiming a whole lot.
Some of them will have these:

https://www.certmetrics.com/amazon/public/badge.aspx?i=5&t=c&d=2018-01-17&ci=AWS00169581&dm=80

Most of us that have these won't do what you described.

If I were you I would invest more time learning than hiring someone, given your scale.

Thank you for your feedback. However you have to understand that my work is small and i do not have the time to learn aws in depth. If i start doing that then i wont be able to do other ten things that i do.

However i managed to fix the cpu load. As mentioned in post earlier the issue is fixed for now.
 
If you have a 160Mb dataset, tweak your mysql setting by increasing your buffer pool size to larger than 160mb instead of the default 8. This will load your entire data to memory inside mysql. Then if you have a relatively static dataset you can enable mysql query cache which will speed up your queries by orders of magnitude if your data is relatively static stock data for example.

Further speed up can be achieved by enabling compression and headers/expires in Apache so static content is cached client side and that relieves a lot of work from your server.
 
Back
Top