Performance Testing and Monitoring

Performance Testing and Monitoring

This week we are joined by Kim Knup, who is with Songkick and tells us a tale of intrigue and guile, and the behavior of concert attendees. Wait, what? OK, not quite that juicy, but she does work with Songkick, she does test and monitor performance, and it turns out that different audiences and different fans of different performers have distinctly different approaches to how the source and buy tickets through Songkick, and Kim shares some of those examples with us. Also, in our news segment, when Apple Support is down, do we care as much as when AWS is down? In other words, do we grade quality on a curve?

itunesrss

Panelists:


References:


Any questions or suggested topics?
Let us know!


Transcript:

MATTHEW HEUSSER: We’re back at The Testing Show.  Michael is out this week, so I’m going to be running the show.  I’m your host, Matt Heusser.  Today, we have two regulars.  Perze Ababa?

PERZE ABABA: Hello, everyone.

MATTHEW HEUSSER: We all know Perze.  He is the test manager at Johnson & Johnson and a serious volunteer in the testing world.  And, Jessica Ingrassellino?

JESSICA INGRASSELLINO: Hey there.

MATTHEW HEUSSER: Jess, you’re probably best known for your work at the Python Conference and Python Education Summit; but, like Perze, she does a whole lot of things.

JESSICA INGRASSELLINO: [LAUGHTER].  Indeed.

MATTHEW HEUSSER: There aren’t many people we have on this show who have a doctorate in music and are former secondary education teachers that have done testing for impressive companies in New York City.  Our special guest, of course, Kim Knup, who I met at TestBash a couple of years ago in Brighton.  You were taking photographs?

KIM KNUP: Oh, yes.  Yep.

MATTHEW HEUSSER: You took the Lean Software Test Tutorial, I think, but I don’t know much more about you, except that lately you’ve been doing performance testing.  So, tell us a little about Kim.

KIM KNUP: Okay.  Well, hello everyone.  I’m currently a software tester at Songkick.  Someone that might be a bit more well known, that’s my boss there, is Amy Phillips who talks a lot about continuous delivery and deployment, and that’s generally how we work on our products at Songkick.  In the past, I’ve been mainly a manual tester doing lots of exploratory testing using mind-maps as tools.  I organized, together with Emma Keaveny, the Brighton Tester MeetUp.  Sort of a summary.

MATTHEW HEUSSER: Great.  Lately, at Songkick, you’ve been doing some perf testing, which we’re going to talk about in a little bit.

KIM KNUP: Yes.  Yep.  I’ve discovered some performance testing.  Yeah.  I find it very interesting and fascinating.

MATTHEW HEUSSER: So, my news for today, the only really exciting thing that happened to me, yesterday I was trying to install a new Wi-Fi card on a MacBook Pro, which you can do.  You can unscrew it and put the Wi-Fi card in.  It is possible.  It is not easy.  I went to Google like, “I put the Wi-Fi card in, and it didn’t work.”  I get, “support.apple.com.”  I click on that and it says, “You have to be logged in.”  I click on that and I get an error.  I wish I had taken screen captures, but support.apple.com was essentially down for several hours yesterday.  I tried it this morning.  They had entirely re-skinned the front-end.  So, apparently they did a big deploy overnight, which didn’t take and they did another one.  That’s a reasonably big company.  I mean, it’s Fortune number one?  Fortune number?  It’s up there, it had the highest capitalization of any company on the planet.  They are an integrated hardware and software shop, and they’re pure software.  Sometimes they have these hiccups, and I think the market is going to forgive that.  I think no one’s going to care.  I couldn’t even find any news on it.  The first news I could find was an outage from last month, so apparently it happens a lot and people get over it.  I guess my question is:  Does quality matter anymore, is quality dead?

JESSICA INGRASSELLINO: I wouldn’t say that “quality is dead” in that instance, but I think people are more forgiving when there’s not an immediate impact on their bottom line.  Having a support website go down, while problematic for all the people seeking support, is maybe not the same as having Amazon EC2 go out for, say, even 5 minutes; because, in that 5 minutes, that causes potential profit loss that people really freak out about, customers freak out about.  But, Apple has such a loyal following and since it was the help situation, maybe there’s just more forgiveness in the market for that.  I would imagine that’s a potential case.

MATTHEW HEUSSER: Yeah.  It makes sense that that doesn’t track back to dollars.

KIM KNUP: I think, a lot of the time, it’s also about, while it is good enough quality for the product you’re looking at with the support center, while yet it doesn’t have any monetary impact straight away to the users.  Whereas, the S3 outage, obviously, more users were affected.  It wasn’t just people seeking Apple support.  So, “What’s good enough quality for a product so that we can move fast?”  That might be a question.  I don’t think “quality is dead,” but I think it’s more, “What is enough?”  And not, “How is it amazing, necessarily?”

PERZE ABABA: I think the other key ingredient to this too is complications when it comes to reliability as well as performance, as probably we’ll be talking about later in the podcast.  Everybody remembers what happened when S3 went down a month ago, right?  The site that Matt pasted in our podcast chat, is it down right now, was also down, [LAUGHTER], because I couldn’t pull any of its resources from AWS.  So, one thing that I realized was there was a new update for iOS, which is iOS 10.3.1.  Most of the time, when they push these updates, there’s some wonkiness around the support systems within Apple.  That’s just really my observation.  It might or might not be related to that particular event.  But it’s weird that you’re seeing that behavior on your side, but we don’t see any report.  So, it could be possible that they’re performing A/B testing on you.

MATTHEW HEUSSER: [LAUGHTER].  That could be.  That would be fascinating.  Yeah.  I’m redirected to something broken, but most people aren’t, because they can sniff my Apple ID from the cookies or whatever and know that I’m running Sierra and do whatever.  So yeah, in the course of researching this, I found a bunch of web sites that we’re going to put in the Show Notes, and they are currently down.  It’s down right now, sort of stuff.  I noticed at least one of those said, “Ubisoft.”  StumbleUpon was down, which amazed me, ubi.com was down.  Then I clicked on it and it worked just fine for me.  WordPress was down for half an hour, but that was in January.  But I think the real thing here is measuring in the business impact of an outage so you can come up with reasonable service-level expectations.  IT can go to management and we’d say, “Ah, we think we’re going to be down about this much over the next year.”  Management can say, “Yeah.  It’s sounds about right.”  I don’t that conversation happens.  I think it is much more like an expected 100-percent uptime and expected that that’ll be the same cost as we were paying 5 years ago.  When we had really big windows, we could be down, we could do service maintenance on Saturday, and no one would notice.  The Context Driven Manifesto actually says that, “When you jump from context to context, you become dangerous.”  The video game guy at Microsoft who switches to work at Boeing is going to be a problem and vice versa, and I wonder if that’s true too for the insurance company website guy that goes to work at a company like Amazon or an e-commerce company that’s expected to be up 24/7.  Speaking of which, Kim was doing some performance testing at Songkick.  What does Songkick do?  Let’s start there.

KIM KNUP: Songkick is a website for music lovers in a sense.  We let you track your favorite artists in your preferred locations, so you get notifications of when they play in your area.  So, you basically never miss them live.  We also have a commerce piece where we sell tickets to some of those concerts directly to the fans as well.

MATTHEW HEUSSER: So, it’s an e-commerce community?  So, how do you make your money?  Do you make your money through ads?  Do you make your money through ticket sales?

KIM KNUP: Mostly through the ticket sale piece.  So, we run campaigns for artists where we will make them a store where they can sell their tickets directly to fans, and that’s how we make money.  We don’t have the ads on the website.

MATTHEW HEUSSER: What does performance testing in that kind of world mean?  Now, you were doing human testing before then and switched to doing it?  As a percentage, did it change or is it mostly you’re just think of yourself as a perf person now?  How did it change?

KIM KNUP: I don’t quite think of myself as a “performance tester” as such.  At Songkick, it’s quite unique that a tester is embedded across a couple of product teams, and one of my product teams was the commerce platform.  As part of that, I was doing a lot of user testing, like going through the store, buying tickets as a user.  You also are to get into that mindset of personas as such, like the super-fan that wants tickets at 9:00 a.m.; and, from 8:55 in the morning, they’re refreshing the website over and over again.  You try and simulate that behavior, but you just do that for one.  So then, there was the thought of, “Well, how do we actually simulate that behavior for many, many fans when there’s a high demand?”  That’s kind of how I found my way into looking at what the tests might be that we might want to run on the platform, and that’s kind of how it started.  So, I still feel like I’m an exploratory tester in general and probably always will be, but I’ve experience performance testing now in that sense.  [LAUGHTER].

MATTHEW HEUSSER: Tell us about how you did it.  How did you simulate a bunch of users?

KIM KNUP: There’s different tools on the market.  We used, specifically, Gatling.  The main reason being that I’ve got some experience using Jmeter, which has a UI, and with Gatling the scenarios are written using Scala in code.  It was a lot faster for us to pair with a developer and get them to write the scenarios as we were coming up with them.  One thing we are lucky that we can do is that from actual Onsales, so that’s when we sell tickets for a certain campaign, we can look at the logs and look at the user behavior across that and then from that we have some data to create realistic scenarios.  For example, if you have an event on sale and you can sell X amount of tickets for it, more often than not I think it’s something like 60-to-70 percent of people will buy 2 tickets rather than 1, 3, or 4.  So, we try to weigh those sort of journeys into the performance test.  So, 60 percent of our requests would ask for 2 tickets and then the others would be weighted accordingly to request 1, 3, or 4, for example.

MATTHEW HEUSSER: My mind goes to implementation, right?

KIM KNUP: Um-hum.

MATTHEW HEUSSER: So I start thinking about, “We need a big list of fake users.”

KIM KNUP: Um-hum.

MATTHEW HEUSSER: “We need to simulate login.  We need to capture the traffic involved in ordering a ticket.  We need to build this sort of chain of clicks and events.  We need a really big for loop to go through all of the users.  We need to ramp it up over time.”  How much of that did Gatling do for you?  What is the actual, when you’re pairing, workflow to come up with these performance tests?

KIM KNUP: In terms of actually workflow, we would look at the logs to see what typical Onsales may look like, how many users actually buy tickets, because there is a percentage that, for example, doesn’t end up checking out.  We don’t have a concept of login for this store.  It’s all guest checkout, which is kind of lucky in that sense.  We don’t have to create user accounts doing our testing.  Gatling actually lets you set the amount of concurrent users you want to simulate.  So, let’s say you want to have 10,000 users, you can just put in the number of 10,000 users, and it will spin all of those up.  Because we’re looking at spike testing, to look at how the system handles being hit suddenly with loads of requests, we would actually halt all of those users until they were ready to go.  So Gatling can spin them up over time, but we wanted them all to be ready at the same time.  Then, we would just flip the data to like be on sale and they would go through the purchase flow as quickly as possible.  So, Gatling does do a lot for you once you’ve coded it.  It also spits out really nice HTML reporting, [LAUGHTER], which the business then loves to look at, because you can easily understand how many arrows came up, how many successful purchases were made versus not, and was that as designed.  It’s nicely visualized.

MATTHEW HEUSSER: What were some things, when you went through that process, that you didn’t expect?

KIM KNUP: Because we sell tickets for very different artists and very different genres, it was interesting to see that fans of different artists behave very differently.  So, you can have to account, [LAUGHTER], for that.  So, real life is way more complicated, [LAUGHTER], than it is in code.  Just because we would’ve performance tested in one manner based on an Onsale for one particular artist—let’s say something like the Chili Peppers or something like that—that demographic would be very different to a different artist like Little Mix or something.  We haven’t actually sold tickets for them, but I’m just saying that demographic is very different.  They behave very differently.  So, you actually end up seeing that different stuff is happening on your system.

MATTHEW HEUSSER: Give me an example.  What do they do that’s different?

KIM KNUP: So, for example, when you go and select “tickets,” on the ticket form, the older demographic tend to be a bit more relaxed and they will just click once or twice a second to try and get a good pair of tickets, and also they’re mostly on a desktop.  Whereas the younger demographic was mostly on mobile devices, and they would be in a position to tap continuously on the ticket button, and it would be something like almost 50 times a second.  I don’t even know how that’s possible, but it was definitely real human beings clicking this button over and over again.

MATTHEW HEUSSER: What is this button they’re trying to click?

KIM KNUP: The ticket.  Say it was for two tickets, it would then show them, “You have these two tickets in Row A and at Seats 1 and 2.”  But say, the next time they click, it might be in Rosette, and they might not be happy.  So, they’ll click it again to try and get the best pair of tickets.

MATTHEW HEUSSER: So, it’s like you don’t see a list of 100 tickets.  It’s just like, “Yeah.  These are the two tickets we selected for you.”  “Nope, give me a different one.  Nope, give me a different one.  Nope.”  That’s the activity?

KIM KNUP: Kind of, yeah.  So, the tickets can either be viewed in terms of pricing or in terms of area.  So, a venue—a bigger venue especially—would have different areas like balcony, front circle, back circle, sections 1-10.  They could either view them by section or by price.  A lot of the time, they view them by section because they know, “I want to be in the front circle,” and just click over and over again to get the best pair.  We had to adjust the front-end a little bit based on that scenario where people kept clicking over and over again, because it puts an extra load on the system each time a request is made.  Yeah.  But, that’s the thing we’ve learned—that users for different artists behave very differently at the end of the day.  [LAUGHTER].

MATTHEW HEUSSER: Huh, that’s fascinating.  So, I’m more of the style of, “Show me an image of the venue and then show me a bunch of things.”

KIM KNUP: Um.

MATTHEW HEUSSER: Give me 20 results and I can say, “I want different ones,” or drill into this or drill into that.  But, by saying, “Select the price range or select the area and we give you one set,” it kind of builds slot machine behavior.

KIM KNUP: Um-hum.  Yes.

MATTHEW HEUSSER: Uncertain immediate positive incentive.  Perze, you’ve done a lot of this sort of work at various different companies.  So, does this sound familiar to you?

PERZE ABABA: Yes.  I’m actually very interested.  I do have a question, Kim.

KIM KNUP: Um-hum.

PERZE ABABA: With regards to the area, when you guys are simulating load, one of the challenges I’ve noticed in performance testing for e-commerce applications is how long we kind of hold that session before the user lets it go.  The question is:  Do you guys actually simulate that?  An example that you gave earlier was you have a behavior of a user where they try to load multiple browsers at once, for example, and then see what choices they’re given; but, each of these choices, that means we’re locking that choice in the system and letting that go if they don’t choose it after a certain amount of time.  How do you guys actually simulate locking, or do you guys identify the workload on that and then simulated the load based on existing instances when people are trying to order multiple tickets at once?

KIM KNUP: Yeah.  So, there was definitely a journey there.  That some of the initial scenarios we built were based on assumptions or things we’d seen in the past, and then we started to use real data from real Onsales, like smaller ones and scaling them up to see like how many would hold onto a ticket, for how long, and not actually buy it.  If you buy a ticket through one of our stores on an artist’s website, there will be a 10-minute timer that then clears those tickets and you’ve lost them basically.  I can’t quite remember if we 100 percent simulate that actual log sitting there for 10 minutes, but I think we do in the tests that run for 2-3 hours.  But we might not in the other ones, because the user can actually just clear their cart and then that’s immediate—that those tickets come back into play.  It depends on what our objective is.

PERZE ABABA: I think, when we are performing, kind of the performance analysis, there’s an actual flow, you know, that we follow.

KIM KNUP: Yeah.

PERZE ABABA: This is something that is kind of learned over the years where the load testing process that I usually follow is, “You have to identify the key scenarios that you’re looking at and then you identify the workload and then you identify which metrics you should be caring about when you’re running the tests.”

KIM KNUP: Yeah.

PERZE ABABA: You develop your test cases whether that’s using the tool—using Gatling is actually pretty cool—and then the actual simulation of the load and then the analysis of the results.  As we perform multiple rounds of testing, we go back and identify which metrics we care about.  We go down that list and say, “These are the things that we can iterate on.”  One thing that I really appreciate with the Context Driven Community was that they really set this rule that testing is about finding information from a performance perspective.  There’s a ton of information, [LAUGHTER], that we could lean on, forge into, and then we can actually go just from the UI side of things, from a web browser performance side of things, and we can go all the way back to the database looking at what the slow queries are for a given request that you’re looking at.  So, there’s a ton of opportunities where we can look at, “Which part of the system are we looking into performance?”  It’s really not easy to do this when you’re using a standard marketing site, and it’s exponentially complicated when you’re dealing with an e‑commerce site where you have to deal with regulations and financial standards.

KIM KNUP: I think one thing to add to that is we can find a lot of information using performance testing, but often you don’t quite know where to start or what to look at, unless you have a real objective or mission statement of what you’re looking for and what you’re trying to achieve to inform your actual testing.  That’s something that I learned.  You need to know what you’re actually aiming for; because, otherwise, there’s always going to be something that slows the system down and if you fixed that bit, it’s a bit like whack-a-mole you find the next slow query.  But, “When is it actually enough that you’re happy you can handle the load that the system might get?”  Always having a mission statement in mind of what you want to achieve and why is really useful for performance testing as well as any testing.

MATTHEW HEUSSER: So, can I ask a political question?  When you do your perf testing and someone somewhere (first it’s, who is the person) says, “We need to simulate 10,000 simultaneous users and it needs to respond within so fast?”  Typically, for me, it’s always investigation.  They had no idea, and they have no idea.  So, I report how it is and ask, “Is it good enough?”  But, when you do that, do you add an extra 10 percent you just don’t tell anybody about so you know how it’s going to behave if you hit 11,000?  But you don’t share that, or do you share that?  Do you do over test?

KIM KNUP: We certainly have over tested, like you say, just in case, [LAUGHTER], and have communicated it.  But, the most important part was that we can fulfill the initial objective of, “Sell this many tickets in that timeframe with the system not slowing down.”  If we have any other information, we tend to be really open as a team.  There isn’t many politics where I am, which is nice, [LAUGHTER], that I know of, that I’m exposed to yet.

PERZE ABABA:  To piggyback on that, what Kim said earlier about, “establishing an initial objective,” grounds that idea where you have to understand how your system performs on its baseline.  If I just go through a normal traffic, which you could probably look in through your Apache logs or whatever logging mechanism that you guys have in production, if your system is already in production, that’s something that you can rely on and say, “Okay.  Fine.  These is the observed think times between actions.  These are the actual number of people that show up within the hour, and these are the things that they are doing.  Five percent of them are going through login, ten percent are going through this particular feature.  These are the flows, there’s the behaviors that they’re doing.”  You’re able to observe the key measures that you’re looking at.  So, whether it’s how the CPU performs in the back-end or what the response times are, what the inherent latencies are between requests.  Once you establish that baseline, then you can make a distinction in saying, “We’ve observed during these types of sales, we have like a 10X increase or a 20X increase,” and then you test against that.  It’s definitely good to see if there’s a way for you to push and find that limit; because, the moment you see that you’ve hit your objectives and your system behaves properly, then it would be interesting to find out, “What does it take to really stress the system out?”  Introduce as much load as you can.  [LAUGHTER].  You can see which systems respond to certain bottlenecks.  I’ve recently performed testing for a back-end system, and we’ve realized the response times are looking really well for this metric that we’re doing but then we’ve seen the CPU just go haywire.  We’re not even hitting a 20th of the load that we expect the system to be able to perform.  The CPU is already at like 100 percent.  Our uptimes are like 47 times more than what the CPU is able to work on, so these are things that you have to observe, you know, in the background and then see, “What are we actually doing to kind of have these behaviors show themselves to us?”  That way, we can go through our remediation and escalation if needed.

KIM KNUP: Alongside that, we did a lot of work to investigate like our fail-over strategy, “What if this service becomes unavailable?  What if the CPU reaches 100 percent on this box?”  We also wanted to see how that then actually plays out, so we would also do a stress test and push the limits of the system and see which bit actually does fail first and then, “How does that cascade down the system, and how does that actually look on the monitoring, just so that we are prepared should the worst ever happen?”  So, I think a proper stress test where you push the limits of the system gives you lots of invaluable information about your infrastructure as well.

PERZE ABABA: Right, and monitoring is really key for this.

KIM KNUP: Yes.

PERZE ABABA: Because if you don’t know how each of your discreet systems are behaving based on a given workload scenario that you guys are introducing, then it would be just very difficult to make an informed decision.  It might be okay where you’re observing the system; but, somewhere else, it’s already thrashing some internal systems, and it just needs a little push to kind of bring the whole system down.  For me, with the systems that I’m working on right now, which is very heavily reliant on micro-services and most of the time we’re just looking at the services that are user-facing, but then you’re looking at all the other dependent services.  If there’s no integration testing that happens of all these services in between, then it’s really a bad recipe for an incoming disaster.

JESSICA INGRASSELLINO: So, Kim, you had mentioned at the outset that your company is “unique” and that you have a lot of “teams” and test across teams.  My question is:  With such a very complex system of testing, and especially performance testing, that needs to be done on a product like this, that is e-commerce, that experience is incredibly load based on essentially fan tastes, how do you balance and manage that with your other responsibilities?  How does that look in the organization?

KIM KNUP: The last couple of quarters, when I was doing this performance testing on the commerce platform, we had four product teams that varied in size.  Often me and my colleague, we try and balance out that we have an equal amount of work across the two teams that we look after.  So, there will be one that’s a bit more high profile.  I don’t want to say “more important,” but maybe more complicated and might need a bit more dedicated testing time such as the commerce platform, and then I might have something that isn’t as heavy.  We’re in a really good place with our general website, in terms of test and test strategy, the system is well understood.  The developers use continuous deployment.  There’s no dedicated testing, phase testing is just built into the development process.  It isn’t the end of the world if you aren’t as hands-on on that product team, but if you take more of a strategy role in that team.  So, it’s all about managing time and managing priorities.  Does that help a little bit?  I know it might sound a bit abstract.  [LAUGHTER].

JESSICA INGRASSELLINO: It helps a lot actually.  It’s as I hoped it would be, [LAUGHTER], but I also wasn’t certain how one might go about kind of doing all of those things.

KIM KNUP: Yeah.  I mean, with the performance testing, actually we had a dedicated slot that on a certain weekday we would spend two hours as a team doing the performance testing, or like as a subset of the team, and would run the test together and someone would be monitoring, someone would also go through the store as a user to see if there was any user-facing issues that might pop up while the system is under load, and we would get together like that.  Because we had that dedicated time, we knew this was always going to be part of the development against any new features or any back-end changes we made.  We always knew there was going to be dedicated time to do the performance testing on it.  That was, I think, our attempt to integrate performance testing and not have it as an afterthought.

JESSICA INGRASSELLINO: That’s great to hear.

MATTHEW HEUSSER: So, I have one more question and that’s, “What broke?”  You did perf testing, but what fell over first?

KIM KNUP: There were some issues where we saw high CPU spikes on the database initially, and it turned out that a tweak of a query we were making helped with that a lot.  So, there’s some Legacy code in the system that isn’t as performant as it could be, and we did a lot of work on the algorithms that look for ticket and on the general database queries to help with that.  Our front-end generally was all right, as far as I remember.  Yeah, I don’t know how much detail I can go in.  But, yes.  I think the thing that worried us the most initially was the database, but it wasn’t actually too bad of a fix.

MATTHEW HEUSSER: Right.  Yeah.  There’s a lot to unpack there.  I think we’ve covered enough for today.  Maybe we can come back again.

KIM KNUP: [LAUGHTER].

MATTHEW HEUSSER: So, if you have questions about this—I’m sure we didn’t cover everything—please send your e-mail to:  [email protected]  But, I think that’s about it for the perf-testing stuff.  Does anybody have any news?  Anything exciting coming up?

PERZE ABABA: NYC Testers, we are going to announce our next upcoming MeetUp pretty soon, so watch out on our website as well as our MeetUp page, meetup.com/NYC-Testers.

MATTHEW HEUSSER: All right, everyone.  Thanks for being on the show.  Appreciate it.

KIM KNUP: Yeah.  Thank you very much for having me.

JESSICA INGRASSELLINO: Thank you.

PERZE ABABA: Thank you

[END OF TRANSCRIPT]